Skip to content

Commit 23cf4d6

Browse files
committed
docs: final draft of processing page
1 parent 323dfe4 commit 23cf4d6

1 file changed

Lines changed: 35 additions & 5 deletions

File tree

docs/source/acquire_upload/processing.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,46 @@ Scientific computing is currently re-organizing pipelines to be per-modality, ra
66

77
### Data
88

9-
[todo]
9+
See [Data organization/Derived data conventions](../philosophy/data_organization.md#derived-data-conventions) for file organization conventions in derived assets.
1010

1111
### Metadata
1212

13-
All processing pipelines that create derived assets will upgrade the [data_description](https://aind-data-schema.readthedocs.io/en/latest/data_description.html) to a derived data description (changing the name and data_level). Processing also creates additional processing metadata as well as quality_control metadata.
13+
#### data_description.json
1414

15-
Any files that are not modified should simply be copied to the derived asset unchanged.
16-
How to upgrade a data_description
15+
All processing pipelines that create derived assets should upgrade the [data_description](https://aind-data-schema.readthedocs.io/en/latest/data_description.html) to a derived data description (changing the name and data_level).
1716

18-
Use the [`DataDescription.from_data_description()`](https://github.com/AllenNeuralDynamics/aind-data-schema/blob/e172cb06a63b722eaeaaf8933d0a17cbedf3feea/src/aind_data_schema/core/data_description.py#L334) function to create derived data_description objects. Pass the process name as a parameter. If more source data assets were used than just the one being passed into the function then pass the optional source_data parameter as well with the names of those data assets.
17+
##### How to upgrade a data_description
18+
19+
Use the [`DataDescription.from_data_description()`](https://github.com/AllenNeuralDynamics/aind-data-schema/blob/e172cb06a63b722eaeaaf8933d0a17cbedf3feea/src/aind_data_schema/core/data_description.py#L334) function to create derived data_description objects. Pass the process name as a parameter, often just `"processed"`. If more source data assets were used than just the one being passed into the function then pass the optional `source_data` parameter as well with the names of those data assets.
20+
21+
```python
22+
from pathlib import Path
23+
from aind_data_schema.core.data_description import DataDescription
24+
25+
# Load the original data_description.json
26+
original_data_description = DataDescription.model_validate_json(
27+
Path("data_description.json").read_text()
28+
)
29+
30+
# Create a derived data_description with upgrade
31+
derived_data_description = DataDescription.from_data_description(
32+
data_description=original_data_description,
33+
process_name="processed"
34+
)
35+
36+
# Write the derived data_description to the results directory
37+
derived_data_description.write_standard_file(output_directory="/results")
38+
```
39+
40+
#### processing.json
41+
42+
Processing pipelines need to track each [DataProcess](https://aind-data-schema.readthedocs.io/en/latest/processing.html#dataprocess) that were run to create the derived data asset.
43+
44+
If processing was performed as part of a nextflow pipeline, that should be tracked in the `Processing.pipelines` field using a [Code](https://aind-data-schema.readthedocs.io/en/latest/components/identifiers.html#code) object pointing to the github repository with the nextflow configuration. Use the `DataProcess.pipeline_name` field to indicate that processes were run as part of a pipeline.
45+
46+
#### Other metadata
47+
48+
Metadata `.json` files that are not modified should be copied to the derived asset unchanged.
1949

2050
## Per-modality physiology pipelines
2151

0 commit comments

Comments
 (0)