diff --git a/README.md b/README.md index 9c4fee4..a9b52ef 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Discourse Graphs: Structured Scientific Knowledge -This repository contains specifications and schemas for creating **discourse graphs**—structured representations of scientific research as interconnected knowledge components. +This repository contains early-stage prototype specifications and schemas for creating **discourse graphs**—structured representations of scientific research as interconnected knowledge components. It is intendend for discussion. ## What Are Discourse Graphs? diff --git a/atproto-lexicon/README.md b/atproto-lexicon/README.md new file mode 100644 index 0000000..6e8dba4 --- /dev/null +++ b/atproto-lexicon/README.md @@ -0,0 +1,218 @@ +# Discourse Graphs ATProto Lexicon (Prototype) + +**NSID namespace:** `org.discoursegraphs.*` + +## Overview + +This is a prototype ATProto Lexicon for the Discourse Graphs protocol — a schema +for federated scientific synthesis infrastructure built on structured networks of +claims, evidence, and their relationships. + +The design follows two sets of principles simultaneously: + +**From the Discourse Grammar spec:** +- Base schema of 4 node types (Question, Claim, Evidence, Source) + 4 relation types + (Supports, Opposes, Addresses, Informs) +- Claims and evidence are deliberately separate first-class types +- Relations are reified (separate assertions with own metadata), not node attributes +- Incremental formalization: nodes born with minimal required formality, progressively + refined with affordances and payoffs at each level +- Local label variations mapped to base types for interoperability + +**From ATProto Lexicon conventions (Lexinomicon):** +- lowerCamelCase for field names +- Open unions and knownValues (not closed enums) for extensibility +- Fields not marked required unless truly necessary for functionality +- Designed for forward/backward compatibility and schema evolution +- Records keyed by TID for natural ordering + +## Lexicon Files + +### Core (Base Schema) + +| File | NSID | Type | Description | +|------|------|------|-------------| +| `defs.json` | `org.discoursegraphs.defs` | defs | Shared types: epistemicStatus, localLabel, sourceRef, provenanceInfo | +| `question.json` | `org.discoursegraphs.question` | record | Research questions organizing inquiry | +| `claim.json` | `org.discoursegraphs.claim` | record | Assertional statements with epistemic status | +| `evidence.json` | `org.discoursegraphs.evidence` | record | Evidence bundles: interpretation + artifact reference | +| `source.json` | `org.discoursegraphs.source` | record | Source documents, datasets, experiments | +| `supports.json` | `org.discoursegraphs.supports` | record | Reified support relations | +| `opposes.json` | `org.discoursegraphs.opposes` | record | Reified opposition relations | +| `addresses.json` | `org.discoursegraphs.addresses` | record | Claim-addresses-question relations | +| `informs.json` | `org.discoursegraphs.informs` | record | Source-informs-evidence (and other informing) relations | + +### Extensions + +| File | NSID | Variation | Description | +|------|------|-----------|-------------| +| `issue.json` | `org.discoursegraphs.issue` | Lab | Future experiment / investigation to resolve evidence gaps | +| `pattern.json` | `org.discoursegraphs.pattern` | HCI | Conceptual design patterns abstracted from implementations | +| `artifact.json` | `org.discoursegraphs.artifact` | HCI | Concrete systems instantiating patterns | +| `endorsement.json` | `org.discoursegraphs.endorsement` | Core | Accountability layer: belief/validation of statements | + +## Key Design Decisions + +### Why relations are separate records (not embedded) +The spec is explicit: "relations should be separate assertions (with their own metadata) +rather than attributes of a discourse node." This means every support/oppose/address +relation is its own record with: +- Its own author (via ATProto repo ownership) +- Its own provenance (manual, AI-assisted, etc.) +- Its own timestamp +- Optional warrant (the reasoning justifying the relation) + +This is critical for the accountability layer. When scientist A says evidence E supports +claim C, that's a distinct assertional act from scientist B saying the same thing. The +relation records capture this. + +### Incremental formalization via optional fields +Almost every field except the core text content and createdAt is optional. A claim can +start as just a sentence. Over time, the author (or AI assistance) can add: +- epistemicStatus (is this a hypothesis? a well-supported claim?) +- localLabel (what does my community call this?) +- tags, provenance, etc. + +This matches the spec's principle that "discourse graph entities [should] be born with +only the absolute minimum required formality." + +### knownValues, not enums +ATProto's knownValues pattern (open string with documented known values) is used for +epistemicStatus, relationStrength, sourceType, etc. This means: +- Communities can extend with their own values without breaking schema +- Old clients gracefully handle new values they don't recognize +- No schema migration needed when a new community adds "conjecture" as an + epistemic status + +### Local labels for interoperability +The localLabel type lets a community say "we call claims 'hypotheses' in our lab" while +the base type mapping ensures federation still works. An appview aggregating across +communities can display local labels while querying on base types. + +### Provenance as a first-class concern +Every node and relation carries optional provenanceInfo following PROV-O semantics: +- wasGeneratedBy: manual authoring vs. AI-assisted extraction +- wasAttributedTo: the responsible agent (DID) +- validatedBy: who has reviewed AI-generated content +- derivedFrom: what records this was derived from + +This is essential for the tiered trust model: personal graphs → community synthesis → +cross-community federation. + +## Example: E. coli Glucose Repression (Lab Discourse Graph) + +```json +// A researcher's question +{ + "$type": "org.discoursegraphs.question", + "text": "What is the mechanism by which glucose represses lac operon expression in E. coli?", + "tags": ["ecoli", "glucoseRepression", "lacOperon"], + "createdAt": "2026-03-10T14:00:00Z" +} + +// A claim addressing that question +{ + "$type": "org.discoursegraphs.claim", + "text": "Glucose repression of the lac operon is primarily mediated by cAMP-CRP regulation rather than inducer exclusion.", + "epistemicStatus": "hypothesis", + "localLabel": { + "label": "Hypothesis", + "baseType": "org.discoursegraphs.claim" + }, + "createdAt": "2026-03-10T14:05:00Z" +} + +// Evidence from a specific experiment +{ + "$type": "org.discoursegraphs.evidence", + "text": "In CRP knockout strains, lac operon expression was reduced by 95% even in the presence of IPTG and absence of glucose, indicating CRP is necessary for activation.", + "localLabel": { + "label": "Result", + "baseType": "org.discoursegraphs.evidence" + }, + "contextEntities": [ + { "name": "E. coli K-12 MG1655 ΔCRP", "entityType": "strain" }, + { "name": "β-galactosidase assay", "entityType": "method" } + ], + "createdAt": "2026-03-10T14:10:00Z" +} + +// A support relation (reified — separate record with its own author) +{ + "$type": "org.discoursegraphs.supports", + "subject": "at://did:plc:abc123/org.discoursegraphs.evidence/3lr...", + "object": "at://did:plc:abc123/org.discoursegraphs.claim/3lr...", + "strength": "strong", + "warrant": "CRP knockout eliminates the regulatory pathway, so loss of expression directly implicates CRP-mediated activation as the dominant mechanism.", + "provenance": { + "wasGeneratedBy": "manualAuthoring", + "wasAttributedTo": "did:plc:abc123" + }, + "createdAt": "2026-03-10T14:15:00Z" +} + +// An issue: a future experiment to strengthen the claim +{ + "$type": "org.discoursegraphs.issue", + "text": "Measure inducer exclusion contribution by comparing intracellular IPTG concentrations in glucose+ vs glucose- conditions in wild-type and PTS mutant strains.", + "motivatedBy": "at://did:plc:abc123/org.discoursegraphs.claim/3lr...", + "status": "open", + "createdAt": "2026-03-10T14:20:00Z" +} +``` + +## Relationship to Other Standards + +| Standard | Role in this schema | +|----------|-------------------| +| ATProto | Transport, identity (DIDs), repo storage, federation | +| Nanopublications | Provenance-rich knowledge representation; DG records can be compiled to nanopubs | +| PROV-O | Semantics for provenanceInfo fields | +| ORCID | Can be linked via DID ↔ ORCID mapping in endorsement/provenance | +| SEPIO | Extensible scientific argument schemas; informs warrant and evidence structure | +| JSON-LD | Future: `@context` overlay mapping DG lexicons to RDF IRIs for semantic web interop | + +## JSON-LD Interoperability Layer + +While ATProto lexicons are not natively JSON-LD, records can be mapped to JSON-LD via +a context document. A future `org.discoursegraphs.context` could provide: + +```json +{ + "@context": { + "dg": "https://w3id.org/discoursegraphs/", + "prov": "http://www.w3.org/ns/prov#", + "schema": "http://schema.org/", + "text": "schema:description", + "claim": "dg:Claim", + "evidence": "dg:Evidence", + "supports": "dg:supports", + "opposes": "dg:opposes", + "wasGeneratedBy": "prov:wasGeneratedBy", + "wasAttributedTo": "prov:wasAttributedTo" + } +} +``` + +This enables discourse graph records to be consumed by semantic web tooling and +compiled into nanopublications without changing the ATProto data model. + +## Open Questions + +1. **Should the namespace be `org.discoursegraphs.*` or something under a domain + the project controls?** The NSID needs to map to a domain for lexicon resolution. + +2. **How should domain-specific entity types (cell lines, methods, etc.) interoperate?** + Currently `contextEntities` on evidence is a simple array. Could these be their own + records with community-managed type ontologies? + +3. **Should endorsements reference specific versions of records?** ATProto records + can be updated; an endorsement of version N might not apply to version N+1. + +4. **How should cross-community "warrant disputes" be modeled?** When community A + says method X is sufficient evidence and community B disagrees, this should surface + as claims in the graph per the spec — but the UX pattern for this needs design. + +5. **What's the right granularity for compilation to nanopublications?** A single + DG supports relation might map to one nanopub, or a cluster of claim + evidence + + relations might be one nanopub. diff --git a/atproto-lexicon/claim.json b/atproto-lexicon/claim.json new file mode 100644 index 0000000..0cf9de4 --- /dev/null +++ b/atproto-lexicon/claim.json @@ -0,0 +1,53 @@ +{ + "lexicon": 1, + "id": "org.discoursegraphs.claim", + "description": "A statement that an agent asserts or proposes within a discourse graph. Claims are the primary assertional unit: they can be supported or opposed by evidence, address questions, and be compiled into theories or models. The epistemic status distinguishes between claims (believed with adequate evidence), hypotheses (posed for discussion without asserting belief), and community-defined variants.", + "defs": { + "main": { + "type": "record", + "key": "tid", + "record": { + "type": "object", + "required": ["text", "createdAt"], + "properties": { + "text": { + "type": "string", + "maxLength": 8192, + "description": "The claim text. A declarative statement that can be supported or opposed by evidence." + }, + "epistemicStatus": { + "type": "ref", + "ref": "org.discoursegraphs.defs#epistemicStatus", + "description": "The epistemic status of this statement. Defaults to unspecified (generic statement) per incremental formalization: authors can refine to 'claim', 'hypothesis', etc. as their thinking develops." + }, + "description": { + "type": "string", + "maxLength": 16384, + "description": "Optional elaboration, context, or reasoning behind this claim." + }, + "localLabel": { + "type": "ref", + "ref": "org.discoursegraphs.defs#localLabel", + "description": "Optional local label override (e.g., 'Conclusion', 'Finding', 'Design Principle')." + }, + "tags": { + "type": "array", + "maxLength": 32, + "items": { + "type": "string", + "maxLength": 256 + } + }, + "provenance": { + "type": "ref", + "ref": "org.discoursegraphs.defs#provenanceInfo" + }, + "createdAt": { + "type": "string", + "format": "datetime" + } + } + } + } + } +} diff --git a/atproto-lexicon/defs.json b/atproto-lexicon/defs.json new file mode 100644 index 0000000..9b9139f --- /dev/null +++ b/atproto-lexicon/defs.json @@ -0,0 +1,109 @@ +{ + "lexicon": 1, + "id": "org.discoursegraphs.defs", + "description": "Shared type definitions for the Discourse Graphs protocol. Defines the base node types (Question, Claim, Evidence, Source) and relation types (Supports, Opposes, Addresses, Informs) that constitute the core discourse graph schema.", + "defs": { + "epistemicStatus": { + "type": "string", + "description": "The epistemic status of a statement. Open (not closed) to allow community-specific extensions.", + "knownValues": [ + "claim", + "hypothesis", + "conjecture" + ] + }, + "relationStrength": { + "type": "string", + "description": "Optional qualitative indicator of relation strength. Open to extension.", + "knownValues": [ + "strong", + "moderate", + "weak", + "disputed" + ] + }, + "localLabel": { + "type": "object", + "description": "A community-local label mapping for a node or relation type. Enables local variation in terminology (e.g., 'hypothesis' vs 'conjecture') while preserving interoperability through the base type reference.", + "required": ["label", "baseType"], + "properties": { + "label": { + "type": "string", + "maxLength": 128, + "description": "The local display label used by this community or tool." + }, + "baseType": { + "type": "string", + "description": "The base schema type this local label maps to, as an NSID (e.g., 'org.discoursegraphs.claim')." + }, + "description": { + "type": "string", + "maxLength": 1024, + "description": "Optional description of how this community uses this label." + } + } + }, + "sourceRef": { + "type": "object", + "description": "A reference to a source document, which may be identified by DOI, URL, AT-URI, or free text citation.", + "properties": { + "doi": { + "type": "string", + "description": "Digital Object Identifier for the source." + }, + "url": { + "type": "string", + "format": "uri", + "description": "URL for the source." + }, + "atUri": { + "type": "string", + "format": "at-uri", + "description": "AT-URI pointing to another record on the network." + }, + "citation": { + "type": "string", + "maxLength": 4096, + "description": "Free-text citation string." + } + } + }, + "provenanceInfo": { + "type": "object", + "description": "Provenance metadata for a discourse graph node, following PROV-O semantics. Tracks how the node was generated (manual authoring, AI-assisted extraction, etc.) and by whom.", + "properties": { + "wasGeneratedBy": { + "type": "string", + "description": "Description or identifier of the activity that produced this node.", + "knownValues": [ + "manualAuthoring", + "aiAssistedExtraction", + "aiSuggested", + "importedFromSource" + ] + }, + "wasAttributedTo": { + "type": "string", + "format": "did", + "description": "DID of the agent who authored or validated this node." + }, + "validatedBy": { + "type": "array", + "description": "DIDs of agents who have validated this node after initial creation (e.g., after AI extraction).", + "items": { + "type": "string", + "format": "did" + } + }, + "derivedFrom": { + "type": "array", + "description": "AT-URIs of records this node was derived from.", + "items": { + "type": "string", + "format": "at-uri" + } + } + } + } + } +} diff --git a/atproto-lexicon/endorsement.json b/atproto-lexicon/endorsement.json new file mode 100644 index 0000000..0f34e9b --- /dev/null +++ b/atproto-lexicon/endorsement.json @@ -0,0 +1,41 @@ +{ + "lexicon": 1, + "id": "org.discoursegraphs.endorsement", + "description": "An endorsement expressing an agent's belief in or validation of a discourse graph node. At the moment of publication, a researcher can endorse statements — for example, upgrading a hypothesis to a claim by asserting it has sufficient evidential support. Endorsements form part of the accountability layer: they are authored, timestamped, and tied to the endorser's identity and reputation.", + "defs": { + "main": { + "type": "record", + "key": "tid", + "record": { + "type": "object", + "required": ["subject", "createdAt"], + "properties": { + "subject": { + "type": "string", + "format": "at-uri", + "description": "AT-URI of the discourse graph node being endorsed." + }, + "endorsementType": { + "type": "string", + "description": "The nature of the endorsement.", + "knownValues": [ + "believe", + "validated", + "retract", + "qualified" + ] + }, + "qualification": { + "type": "string", + "maxLength": 4096, + "description": "Optional qualification or condition on the endorsement (e.g., 'valid under conditions X but not Y')." + }, + "createdAt": { + "type": "string", + "format": "datetime" + } + } + } + } + } +} diff --git a/atproto-lexicon/evidence.json b/atproto-lexicon/evidence.json new file mode 100644 index 0000000..ae74d80 --- /dev/null +++ b/atproto-lexicon/evidence.json @@ -0,0 +1,84 @@ +{ + "lexicon": 1, + "id": "org.discoursegraphs.evidence", + "description": "An evidence bundle: a minimal interpretation of an artifact (data, observation, experimental outcome) that can support or oppose claims. Evidence is deliberately separated from claims as a first-class type because this distinction confers critical benefits for scientific reasoning — enabling assessment of what is known empirically vs. what is asserted, structuring contributions at different career stages, and supporting flexible abstraction over potentially conflicting evidence. In the lab variation, evidence from ongoing work is labeled a 'result' tied to an 'experiment' source.", + "defs": { + "main": { + "type": "record", + "key": "tid", + "record": { + "type": "object", + "required": ["text", "createdAt"], + "properties": { + "text": { + "type": "string", + "maxLength": 8192, + "description": "The evidence statement: a minimal interpretation of data or observations from a specific study or experiment." + }, + "source": { + "type": "ref", + "ref": "org.discoursegraphs.defs#sourceRef", + "description": "Reference to the source document, dataset, or experiment that produced this evidence." + }, + "sourceRecord": { + "type": "string", + "format": "at-uri", + "description": "AT-URI pointing to a org.discoursegraphs.source record on the network, if available." + }, + "description": { + "type": "string", + "maxLength": 16384, + "description": "Optional additional context: methods, conditions, limitations, or contextual details (e.g., cell line, model organism, task parameters) needed to interpret this evidence." + }, + "localLabel": { + "type": "ref", + "ref": "org.discoursegraphs.defs#localLabel", + "description": "Optional local label override (e.g., 'Result', 'Observation', 'Measurement')." + }, + "contextEntities": { + "type": "array", + "description": "Optional references to domain-specific entities relevant to interpreting this evidence (e.g., cell lines, model organisms, datasets, tasks). Enables queries like 'find all evidence for claim A, subset by cell line'.", + "maxLength": 64, + "items": { + "type": "object", + "required": ["name"], + "properties": { + "name": { + "type": "string", + "maxLength": 512, + "description": "Name or identifier of the entity." + }, + "entityType": { + "type": "string", + "maxLength": 256, + "description": "The type of entity (e.g., 'cellLine', 'modelOrganism', 'dataset', 'method')." + }, + "uri": { + "type": "string", + "format": "uri", + "description": "Optional URI identifying this entity in an external ontology or database." + } + } + } + }, + "tags": { + "type": "array", + "maxLength": 32, + "items": { + "type": "string", + "maxLength": 256 + } + }, + "provenance": { + "type": "ref", + "ref": "org.discoursegraphs.defs#provenanceInfo" + }, + "createdAt": { + "type": "string", + "format": "datetime" + } + } + } + } + } +} diff --git a/atproto-lexicon/issue.json b/atproto-lexicon/issue.json new file mode 100644 index 0000000..aae065b --- /dev/null +++ b/atproto-lexicon/issue.json @@ -0,0 +1,64 @@ +{ + "lexicon": 1, + "id": "org.discoursegraphs.issue", + "description": "An issue represents a candidate future experiment or investigation, closing the loop in the lab discourse graph variation. When a claim lacks sufficient evidence, an issue articulates what experiment or measurement would strengthen or resolve it. Issues, when claimed by a researcher, become experiments that generate new results (evidence). Maps to the base Question type for interoperability, but carries additional experimental planning semantics.", + "defs": { + "main": { + "type": "record", + "key": "tid", + "record": { + "type": "object", + "required": ["text", "createdAt"], + "properties": { + "text": { + "type": "string", + "maxLength": 4096, + "description": "Description of the proposed experiment, measurement, or investigation." + }, + "motivatedBy": { + "type": "string", + "format": "at-uri", + "description": "AT-URI of the claim or evidence gap that motivates this issue." + }, + "assignee": { + "type": "string", + "format": "did", + "description": "DID of the person who has claimed this issue for investigation." + }, + "status": { + "type": "string", + "description": "Current status of this issue.", + "knownValues": [ + "open", + "claimed", + "inProgress", + "completed", + "wontDo" + ] + }, + "description": { + "type": "string", + "maxLength": 16384, + "description": "Detailed description of the proposed experiment, including anticipated methods, expected outcomes, or decision criteria." + }, + "tags": { + "type": "array", + "maxLength": 32, + "items": { + "type": "string", + "maxLength": 256 + } + }, + "provenance": { + "type": "ref", + "ref": "org.discoursegraphs.defs#provenanceInfo" + }, + "createdAt": { + "type": "string", + "format": "datetime" + } + } + } + } + } +} diff --git a/atproto-lexicon/supports.json b/atproto-lexicon/supports.json new file mode 100644 index 0000000..ed8d3eb --- /dev/null +++ b/atproto-lexicon/supports.json @@ -0,0 +1,50 @@ +{ + "lexicon": 1, + "id": "org.discoursegraphs.supports", + "description": "A reified 'supports' relation between discourse graph nodes. Relations are separate assertions with their own metadata (author, provenance, strength) rather than attributes of nodes. Typically connects evidence to a claim, or a claim to another claim. The support relation itself is a statement that can be believed or contested.", + "defs": { + "main": { + "type": "record", + "key": "tid", + "record": { + "type": "object", + "required": ["subject", "object", "createdAt"], + "properties": { + "subject": { + "type": "string", + "format": "at-uri", + "description": "AT-URI of the supporting node (the evidence or claim that provides support)." + }, + "object": { + "type": "string", + "format": "at-uri", + "description": "AT-URI of the supported node (the claim being supported)." + }, + "strength": { + "type": "ref", + "ref": "org.discoursegraphs.defs#relationStrength", + "description": "Optional qualitative assessment of the strength of this support relation." + }, + "warrant": { + "type": "string", + "maxLength": 4096, + "description": "Optional argumentation warrant: the reasoning or methodological basis that justifies this support relationship. When communities disagree about what counts as evidence or valid inference, warrants make those disagreements explicit." + }, + "description": { + "type": "string", + "maxLength": 8192, + "description": "Optional elaboration on the nature of this support relationship." + }, + "provenance": { + "type": "ref", + "ref": "org.discoursegraphs.defs#provenanceInfo" + }, + "createdAt": { + "type": "string", + "format": "datetime" + } + } + } + } + } +} diff --git a/conceptual-schema-draft.md b/conceptual-schema-draft.md new file mode 100644 index 0000000..3f07015 --- /dev/null +++ b/conceptual-schema-draft.md @@ -0,0 +1,201 @@ +The purpose of this document is to outline a specification for using discourse graphs as an interoperable protocol and operating system for scientific research. + +# Background and Motivation + +Reasoning and retrieval based on *discourse* is a critical feature of scientific discovery. + +Some examples: + +- **Problem formulation:** When mapping out a problem space to identify opportunities for new contributions, researchers want to know the answers to questions like “What are the relevant claims of interest for my question, and what are their relative evidential weights?”, or “What results are worthwhile (e.g., supports important claims) and possible to attempt to replicate (clear inferential link between result and claim)? + +- **Formulating research contributions:** When thinking through whether a given project has made a “sufficient” contribution, a scientist needs to think through what *evidence* they have generated from their experiments, and the degree to which they *warrant/support* key *claims* of interest to the community (e.g., ones that imply other interesting claims, or address important shared questions of interest). + +- **Theory construction:** When constructing theories or models of a phenomenon using appropriate domain-specific representations like a causal/knowledge graph or system of differential equations, the theorist needs to ensure that the edges/paths/relations in the model (which can be formulated as a set of statements or *claims*) are sufficiently *evidenced* to warrant inclusion in the model. + +The common thread across these examples is that the task is supported by operations on *discourse* units like claims and evidence (and their interrelations), in addition to domain-specific scientific entities of interest, and separately from research papers (which contain the discourse units of interest). + +Unfortunately, most infrastructure and tools for sharing scientific knowledge operate on *documents* as the core unit of analysis. For instance, academic search engines primarily target retrieval of *papers* (or equivalent documents like books and book chapters or reports). Literature tools like Zotero emphasize organizing and managing documents, not discourse. This mismatch between the document-centric data model and the need for discourse creates immense overhead for scientists using existing tools and infrastructure for these core tasks. + +There is now a growing number of tools and platforms[^1] (including ones we have developed and deployed in real scientific settings[^2]) that now support operations on discourse units as a first-class object. Enabling semantic interoperability between these tools has significant potential to accelerate scientific discovery by creating a new (federated) infrastructure for sharing and reusing scientific knowledge that more closely matches the actual information needs of scientists for scientific work. In short, there is a significant opportunity to enable the construction of a new infrastructure for [FAIR](http://gofair.foundation/) publication and sharing of scientific knowledge (not just data or papers). The purpose of this document is to sketch out a proposal for a conceptual and technical schema to enable this interoperability. + +To be more explicit, our **guiding principles** are: + +- **Balance expressivity of the semantics with simplicity/consensus across tools and research/epistemic communities** + + - This means we don't define in the common spec what is going to be (irreducibly) controversial (e.g., requiring quantitative values for support/oppose relations, or deciding on [evidence thresholds](https://roamresearch.com/#/app/discourse-graphs/page/3ehfnbIKg) or belief predicate functions that determine whether something is a claim or a hypothesis + +- **Prioritize/consider pragmatics (implications for UX, in terms of costs/affordances/payoffs)** + + - One key implication is a core commitment to incremental formalization: allow discourse graph entities to be born with only the absolute minimum required formality, and progressively provide affordances (and corresponding payoffs) for adding more formal properties over time. For instance, from a UX POV, within your own lab/notes before you publish to the protocol, your "claims/hypotheses" start out default as generic superclass statement, with no explicit belief predicate, and we prompt you to, if you are able, express a belief predicate and attribute to you, at which we can resolve to one of the subclasses (HYP, CLM, etc.) + + - Another key implication is to allow for local variations in labels for key elements, to match what is most resonant and meaningful in that setting, and build affordances for translating between these local label variations when interoperating across tools. The technical schema that accompanies this conceptual proposal will be one key solution for this. + +- **Enable interoperability with prior relevant standards, where possible.** + + - Our proposal synthesizes common shared features from significant prior art on discourse-centric data models[^3]. Our goal is not to replace these prior specifications, but rather to specify the *minimum possible schema* to enable interoperability amongst existing discourse-centric tools, while also describing common local variations of the schema to permit downstream development of schema translation/migration, including (where appropriate) usage of established data models. + +# Conceptual specification + +## Base “discourse graph” schema + +The “base schema” has 4 types of nodes: 1) Questions, 2) Claims, 3) Evidence, and 4) Sources. It also has 4 types of relations: 1) Supports, 2) Opposes, 3) Addresses, and 4) Interpreted As. Together this base schema comprises what we call a **discourse graph**. + +![Base discourse graph schema](media/image4.png) + +| **Node Type** | **Description** | **Example** | **Notes** | +|---------------|-----------------|-------------|-----------| +| Question | Scientific unknowns that we want to make known, and are addressable by the systematic application of research methods | What is the NPF for CLIC/GEEC endocytic scission? | | +| Claim | Atomic, generalized assertions about the world that (propose to) answer research questions | IRSp53 binds WAVE complex | | +| Evidence | A specific empirical observation from a particular application of a research method | IRSp53 coimmunoprecipitated with WAVE in NIH3T3 cell lysates | Typically written in the past tense to emphasize its contextual nature. Evidence is more like a “bundle” than a simple statement: to be more specific, it is a statement describing a specific observation from a particular application of a research method, interpreted by a researcher from one or more “data artifacts” (figures, quotes, tables, statistics), and directly linked to a description of the particular application of a research method that produced the data artifact | +| Source | Some research source that reports/generates evidence, like an experiment/study, book, conference paper, or journal article | @miki2000irsp53 | What’s important for the link to evidence is that the source describes the relevant research method application that gave rise to the data artifact(s) that are part of an evidence bundle | + +| **Relation Type** | **Sources/Targets** | **Example** | **Notes** | +|--------------------------|---------------------|-------------|---------------------------------------------| +| Supports / Supported By | | | | +| Opposes / Opposed By | | | | +| Addresses / Addressed By | | | | +| InterpretedAs | | | This will probably map to the PROV ontology | + +An important but possibly controversial opinion here is that **claims and evidence should be treated as separate types of things**. While both claims and evidence as defined above can be thought of as assertions or statements, in practice we have found that separating them explicitly as different types of things confers many important benefits for scientific thinking and practice. How this distinction should be modeled from a technical data structure perspective is a separate question, but we claim that **a critical portion of the benefits of a discourse-centric infrastructure and tooling system accrue from distinguishing claims and evidence (bundles) as first-class objects and should thus be preserved from a user experience perspective (regardless of how it is modeled from a technical data structure perspective).** + +Examples: + +- We can more naturally reason about the contributions of projects/papers in terms of the importance/novelty of their claims, relative to the rigor/trustworthiness/etc. of the empirical evidence that supports/opposes them. This enables us to more naturally assess what is known or unknown *empirically*, and what needs to be done next to increase the quality of our answers to our research questions. + +- We can structure and scaffold contributions in earlier stages of scientific work: for instance, undergraduate research assistants can easily run experiments and feel comfortable reporting specific empirical evidence from those experiments, and then discuss with the lab over time what claims can be made from those evidence (and/or what additional evidence is needed to support an interesting claim). + +- Starting with evidence “bundles” (by virtue of their link to the contextual details of the particular study that produced the data artifact(s) interpreted as evidence) can support more flexible and reasoned abstraction and sensemaking over collections of potentially conflicting evidence. For instance, you can choose whether you want to abstract from an evidence statement about performance of a post-training algorithm applied to the OLMO model as applying to (open weight large language models, language models in general, and so on), and also recover contextual details that might be necessary to explain conflicting evidence (e.g., context length, dataset, task). + +- We can “compile” theories/models from claims in a transparent, evidence-grounded manner (choose claims that have sufficient evidence associated with them, abstract from measurements to concepts/constructs in a principled manner) + +- + +## Common variations and their purposes + +### Lab discourse graphs + +People have used discourse graphs to track the claims and evidence in their *ongoing, unpublished* research. + +![Lab discourse graph](media/image5.png) + +In this variation, people + +- label each claim as a **hypothesis** (untested claim) + +- call evidence from their ongoing work a **result**. + +- *Experiment*[^4] is the source material for a result. + +This variation in labels fits quite naturally into ongoing research, to say, e.g., that the target “hypothesis” for an experiment is x, and that the results from these experiments, when taken together, can then be shared with the rest of the scientific community as a meaningful unit of contribution. The base schema label of “claim”, for instance, can feel awkward and too strong for early stage work, and insufficiently weighty, for a project-level conclusion/contribution. + +Applying this model for ongoing research one step further, we can introduce an *issue* as a type of candidate experiment: + +![Lab discourse graph with issues](media/image1.png) + +- **Issue**[^5] is a "future experiment" for themselves or someone else. + +“Issue” closes the loop for ongoing research: the researcher wants to make a claim, but the available evidence isn’t quite strong enough, so that motivates a potential experiment. The issue, when claimed, comprises a new experiment, from which new results may be generated. + +This schema may also be extended to create and link to specific node types for Methods or Protocols, or entities like Cell lines, to enable queries like “find all the evidence for claim A, and subset it by the cell line and method”, or “what methods are present in our evidence for claim A”. + +### Discourse graphs and HCI + +This model has also been adapted to cover the kinds of questions and contributions that constitute research in Human-Computer Interaction (HCI). + +![HCI discourse graph](media/image2.png) + +In this variation, people add node types of Patterns and Artifacts to help structure work on contributing new Design Patterns (instantiated in specific Artifacts that we can test and explore) to address "How might we" research problems in HCI. + +- **Artifact**: a specific concrete system (prototype, standard, etc.) that instantiates one or more conceptual patterns or methods + +- **Pattern**: a conceptual class such as a theoretical object, heuristics, design patterns and system/methodological approach, that is abstracted from a \*specific\* implementation. Patterns are what make specific systems "work" or not, matched to a model of the problem. + +This model can help structure synthesizing new HCI research directions in close conversation with existing/prior work. + +- For example, if we're making a systems contribution, we can review the design space / history of prior art we build on and/or contribute to, w/ something like the following structure + + - For each key \[\[PTN\]\] + + - Describe key exemplar \[\[ART\]\]s for this \[\[PTN\]\] + + - Key \[\[CLM\]\] and \[\[EVD\]\] about this \[\[PTN\]\] in relation to our core problem + + - And any open \[\[QUE\]\] we address + + - e.g., does \[\[PTN\]\] work for our task? what might the \[\[ART\]\] look like? + + - \[\[ART\]\] seems awesome, but doesn't quite do X, how do we make it do x + + - lots of people say \[\[PTN\]\] is great, but we don't have great \[\[EVD\]\] that it actually works + +- Or if we're making an empirical contribution, we can review prior insights and questions we build on and/or contribute to, w/ something like the following structure + + - Key \[\[CLM\]\] about some core (sub)\[\[QUE\]\] + + - Key supporting and opposing (possibly conflicting!) \[\[EVD\]\], with intuitions about strength of \[\[EVD\]\] + + - And then any open \[\[QUE\]\] that remain that we address + + - e.g., \[\[CLM\]\] is big if true, but we really don't have good \[\[EVD\]\] for it (for our context, or bc the methods suck for XYZ reasons, etc.) + + - \[\[CLM\]\] A and \[\[CLM\]\] B are in major tension, here's what a decisive \[\[EVD\]\] would look like that arbitrates between them + +As above with methods and entity types linked to the core discourse graph, this variation expresses interoperability between a discourse graph and a more domain-specific knowledge graph, where it is useful to know what claims and evidence involve or speak to domain-specific entities. + +### User experience research + +![UX research discourse graph](media/image3.png) + +![UX research discourse graph detail](media/image7.png) + +# Technical instantiations + +We are developing a set of technical instantiations of this conceptual schema that enable specific tools and transport layers to express the schema. This is very much a work in progress: unless otherwise noted, all links below are to proof-of-concept, early-stage drafts whose purpose is to stimulate discussion and iteration, rather than support real-world usage. + +Some notes on key design choices we think should constrain implementations: + +1. **Where possible, local variations in labels/subtypes for a node type (e.g., conclusions, claims, hypotheses) should point to the nearest base schema node type for the purposes of interoperability.** + +2. **Relations should be separate assertions (with their own metadata) rather than attributes of a discourse node.** In other words, all relations should be reified. + +### OWL + +See [owl/](owl/) for the OWL ontology files: [dg_core.ttl](owl/dg_core.ttl) and [dg_base.ttl](owl/dg_base.ttl). + +Will eventually be hosted on [Nanodash](https://nanodash.knowledgepixels.com/explore?26&id=https://w3id.org/np/RAoRxaHtG1GTAjqV5LZAv4lzIuCLnowJe-3GeTFjLcq5Y&label=The+Discourse+Graphs+project&forward-to-part=true). + +The purpose of this spec will be to define the minimal shared based set of nodes and relations across the discourse graph protocol. At the moment, the draft defines specs for the "Base “discourse graph” schema" as defined above. + +This spec will also explicitly map the node and relation types to pre-existing schemas from the Semantic Web, such as [SEPIO](https://va-spec.ga4gh.org/en/latest/appendices/sepio-framework.html), [micropublications](https://link.springer.com/article/10.1186/2041-1480-5-28), and [ScholOnto](https://research-archive.stem.open.ac.uk/scholonto/), as appropriate, to enable interoperability. + +We assert that a tool has a discourse graph and can productively interoperate on the protocol if it contains data structures that can be mapped to these base nodes and relations. + +### ATProto Lexicon + +See [atproto-lexicon/](atproto-lexicon/) for a prototype ATProto Lexicon (`org.discoursegraphs.*`) that maps the base discourse graph schema to federated ATProto records. Relations are reified as separate records with their own authorship, provenance, and timestamps, enabling a tiered trust model from personal graphs to cross-community federation. + +### JSON-LD + +Example: https://github.com/DiscourseGraphs/MATSUlab-issue-exchange-analysis + +Useful for MCP servers, and for interoperation between ATProto and the semantic web (e.g., nanopublications.) + +### MyST AST + +See [discourse-graphs-myst-spec.md](discourse-graphs-myst-spec.md) for a draft specification of MyST Markdown directives and roles that embed discourse graph semantics directly into scientific documents, enabling continuous lab workflows and cross-lab collaboration. + +### OXA + +Forthcoming! + +![Additional notes diagram](media/image6.png) + +[^1]: A non-exhaustive current list includes [Fylo](https://fylo.io/), [Oshima](https://oshimascience.com/), [Octopus](https://www.octopus.ac/), [Nanodash](https://nanodash.knowledgepixels.com/?30), [Polyplexus](https://start.polyplexus.com/), [CIVICDB](https://civicdb.org/welcome), and [Consensus](https://consensus.app/) + +[^2]: [https://discoursegraphs.com/](https://discoursegraphs.com/) + +[^3]: A non-exhaustive list includes the [ScholOnto](https://research-archive.stem.open.ac.uk/scholonto/) project, [micropublications](https://link.springer.com/article/10.1186/2041-1480-5-28), [nanopublications](https://nanopub.net/), and [SEPIO](https://va-spec.ga4gh.org/en/latest/appendices/sepio-framework.html) + +[^4]: Here, we use the term “experiment” to denote a specific set of systematic methods applied to answer a specific question or hypothesis, that will then produce data that can ground a result or evidence (specific observation). It’s therefore a more expansive conceptual unit that can refer to many different kinds of systematic studies using a variety of methods, such as simulations, analyses, surveys, etc. The main goal is to distinguish this entity from “data” (some files, numbers, etc. that came from an experiment). We encourage local discourse graphs that wish to operate on the protocol but disfavor this specific term to still use the entity (via URI) and apply a label that makes more sense in their local context. + +[^5]: Note: the term “issue” is up for debate - it has proven useful in practice, but also can sometimes be confused for a more generic “task” or “problem to be solved” (in the IBIS sense). Tools and users should feel free to use a label that more accurately and resonantly captures the core idea of an issue as a *request for an experiment/study* (that someone could claim and produce results that support/oppose a hypothesis). diff --git a/media/image1.png b/media/image1.png new file mode 100644 index 0000000..dbd7d11 Binary files /dev/null and b/media/image1.png differ diff --git a/media/image2.png b/media/image2.png new file mode 100644 index 0000000..9c6a454 Binary files /dev/null and b/media/image2.png differ diff --git a/media/image3.png b/media/image3.png new file mode 100644 index 0000000..1cbc77f Binary files /dev/null and b/media/image3.png differ diff --git a/media/image4.png b/media/image4.png new file mode 100644 index 0000000..10e43da Binary files /dev/null and b/media/image4.png differ diff --git a/media/image5.png b/media/image5.png new file mode 100644 index 0000000..efdfa66 Binary files /dev/null and b/media/image5.png differ diff --git a/media/image6.png b/media/image6.png new file mode 100644 index 0000000..d599539 Binary files /dev/null and b/media/image6.png differ diff --git a/media/image7.png b/media/image7.png new file mode 100644 index 0000000..ae817d3 Binary files /dev/null and b/media/image7.png differ diff --git a/owl/dg_base.ttl b/owl/dg_base.ttl new file mode 100644 index 0000000..86286ad --- /dev/null +++ b/owl/dg_base.ttl @@ -0,0 +1,87 @@ +@prefix rdf: . +@prefix rdfs: . +@prefix : . +@prefix dc: . +@prefix owl: . +@prefix dgb: . + + + dc:date "2025-12-22" ; + rdfs:comment "DiscourseGraph foundation vocabulary"@en ; + rdfs:label "DiscourseGraph foundation vocabulary"@en ; + owl:versionInfo "0 (tentative)" ; + a owl:Ontology. + +# This is inspired by https://hyperknowledge.org/schemas/hyperknowledge_frames.ttl +# and topic mapping + +dgb:NodeSchema +rdfs:subClassOf owl:Class; + rdfs:comment "Subclasses of DiscourseGraph nodes"@en . + +dgb:Role + rdfs:subClassOf owl:ObjectProperty, + [a owl:Restriction; owl:onProperty rdfs:domain ; owl:allValuesFrom dgb:NodeSchema ], + [a owl:Restriction; owl:onProperty rdfs:range ; owl:allValuesFrom dgb:NodeSchema ]; + rdfs:comment "A role within a node schema"@en . + +dgb:RelationDef rdfs:subClassOf owl:ObjectProperty; + rdfs:comment "DiscourseGraph relations"@en. + +dgb:RelationInstance rdfs:subClassOf rdf:Statement, dgb:NodeSchema, + [a owl:Restriction; owl:onProperty rdfs:predicate ; owl:allValuesFrom dgb:RelationDef ]. + +dgb:source a dgb:Role ; + rdfs:subPropertyOf rdf:subject ; + rdfs:domain dgb:RelationInstance ; + rdfs:range dgb:NodeSchema ; + rdfs:comment "The source of a binary relation"@en . + +dgb:destination a dgb:Role ; + rdfs:subPropertyOf rdf:object ; + rdfs:domain dgb:RelationInstance ; + rdfs:range dgb:NodeSchema ; + rdfs:comment "The destination of a binary relation"@en . + +dgb:textRefersToNode a owl:ObjectRelation; + rdfs:domain dgb:NodeSchema; + rdfs:range dgb:NodeSchema; + rdfs:comment "The text of a node refers to another node"@en . + + +# examples + +# :x a dgb:NodeSchema . +# :y a dgb:NodeSchema . +# :x0 a :x. +# :y0 a :y. +# :r a dgb:NodeSchema. +# :x_r a dgb:Role ; +# rdfs:domain :r ; +# rdfs:range :x . + +# :r0 a :r; +# :x_r :x0. + +# :br a dgb:RelationDef; +# rdfs:domain :x; +# rdfs:range :y; + +# :br0 +# a dgb:RelationInstance; +# rdf:predicate :br ; +# dgb:source :x0 ; +# dgb:destination :y0 ; + +# # This is "about" :x0 :br :y0; + +# Note: we could also use punning, and define +# :br rdfs:subClassOf dgb:RelationInstance, +# [a owl:Restriction; +# owl:onProperty rdf:predicate ; +# owl:hasValue :br ]. +# Then we can more simply state +# :br0 +# a :br; +# dgb:source :x0 ; +# dgb:destination :y0 ; diff --git a/owl/dg_core.ttl b/owl/dg_core.ttl new file mode 100644 index 0000000..95ed997 --- /dev/null +++ b/owl/dg_core.ttl @@ -0,0 +1,92 @@ +@prefix rdf: . +@prefix rdfs: . +@prefix : . +@prefix dc: . +@prefix owl: . +@prefix vs: . +@prefix sioc: . +@prefix prov: . +@prefix dgb: . +@prefix dg: . + +dg:Question a dgb:NodeSchema; + rdfs:label "Question"@en; + rdfs:comment "Scientific unknowns that we want to make known, and are addressable by the systematic application of research methods"@en. + +dg:Claim a dgb:NodeSchema; + rdfs:label "Claim"@en; + rdfs:comment "Atomic, generalized assertions about the world that (propose to) answer research questions"@en. + +dg:Evidence a dgb:NodeSchema; + rdfs:label "Evidence"@en; + rdfs:comment "A specific empirical observation from a particular application of a research method"@en. + +dg:Source a dgb:NodeSchema; + rdfs:label "Source"@en; + rdfs:comment "Some research source that reports/generates evidence, like an experiment/study, book, conference paper, or journal article"@en. + +dg:opposesCE a dgb:RelationDef; + rdfs:label "Opposes"@en; + rdfs:range dg:Claim; + rdfs:domain dg:Evidence. + +dg:opposedByEC a dgb:RelationDef; + rdfs:label "Opposed by"@en; + owl:inverseOf dg:OpposesCE; + rdfs:range dg:Evidence; + rdfs:domain dg:Claim. + +dg:supportsCE a dgb:RelationDef; + rdfs:label "Supports"@en; + rdfs:range dg:Claim; + rdfs:domain dg:Evidence. + +dg:supportedByEC a dgb:RelationDef; + rdfs:label "Supported by"@en; + owl:inverseOf dg:SupportsCE; + rdfs:range dg:Evidence; + rdfs:domain dg:Claim. + +dg:opposesCC a dgb:RelationDef; + rdfs:label "Opposes"@en; + rdfs:range dg:Claim; + rdfs:domain dg:Claim. + +dg:opposedByCC a dgb:RelationDef; + rdfs:label "Opposed by"@en; + owl:inverseOf dg:OpposesCC; + rdfs:range dg:Claim; + rdfs:domain dg:Claim. + +dg:supportsCC a dgb:RelationDef; + rdfs:label "Supports"@en; + rdfs:range dg:Claim; + rdfs:domain dg:Claim. + +dg:supportedByCC a dgb:RelationDef; + rdfs:label "Supported by"@en; + owl:inverseOf dg:SupportsCC; + rdfs:range dg:Claim; + rdfs:domain dg:Claim. + +dg:addresses a dgb:RelationDef; + rdfs:label "Addresses"@en; + rdfs:range dg:Question; + rdfs:domain dg:Claim. + +dg:addressedBy a dgb:RelationDef; + rdfs:label "Addressed by"@en; + owl:inverseOf dg:addresses; + rdfs:range dg:Claim; + rdfs:domain dg:Question. + +dg:curatedTo a dgb:RelationDef; + rdfs:label "Curated to"@en; + rdfs:range dg:Source; + rdfs:domain dg:Evidence. + +dg:curatedFrom a dgb:RelationDef; + owl:inverseOf dg:curatedTo; + rdfs:label "Curated from"@en; + rdfs:range dg:Evidence; + rdfs:domain dg:Source.