Summary
Generate a DCAT-AP dataset description after a pipeline run, describing what was produced: triple counts, distribution URLs, provider metadata.
Context
loda-pipeline generates a datasetdescription.ttl after each pipeline run containing:
dcat:Dataset with title, description, publisher, license
dcat:Distribution entries for each output file (N-Triples, EDM XML ZIP) with byte size, media type, access URL
- Triple and record counts
- Temporal metadata (modification date)
This description is then validated against NDE's dataset register SHACL shapes and registered with the NDE Dataset Register API.
loda-pipeline also has a update_data_catalog.sh that combines all individual dataset descriptions into a single dcat:Catalog for multi-dataset pipelines.
Approach
This is distinct from withProvenance() (which records PROV-O process metadata) and from #82.
Could be:
- A post-pipeline step that counts triples written and generates the description
- A
Writer decorator that tracks counts as quads flow through, then emits the description at the end
- A standalone utility that takes a pipeline's output files and generates the description
The catalog generation (combining multiple dataset descriptions) is a secondary concern for multi-dataset orchestration.
Relates to
Summary
Generate a DCAT-AP dataset description after a pipeline run, describing what was produced: triple counts, distribution URLs, provider metadata.
Context
loda-pipeline generates a
datasetdescription.ttlafter each pipeline run containing:dcat:Datasetwith title, description, publisher, licensedcat:Distributionentries for each output file (N-Triples, EDM XML ZIP) with byte size, media type, access URLThis description is then validated against NDE's dataset register SHACL shapes and registered with the NDE Dataset Register API.
loda-pipeline also has a
update_data_catalog.shthat combines all individual dataset descriptions into a singledcat:Catalogfor multi-dataset pipelines.Approach
This is distinct from
withProvenance()(which records PROV-O process metadata) and from #82.Could be:
Writerdecorator that tracks counts as quads flow through, then emits the description at the endThe catalog generation (combining multiple dataset descriptions) is a secondary concern for multi-dataset orchestration.
Relates to