Commit c9afc61
authored
feat(prepro): assign segment/subtype using nextclade sort (3/n) (#5402)
resolves #4847
### Screenshot
Improves #4821, comes
after #5398
You can use pathoplexus/dev_example_data#2 for
testing.
Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).
As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.
## Prepro config changes
Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
L: nextstrain/cchfv/linked/L
M: nextstrain/cchfv/linked/M
S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```
```
nucleotideSequences:
- name: L
nextclade_dataset_name: nextstrain/cchfv/linked/L
nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
- name: M
nextclade_dataset_name: nextstrain/cchfv/linked/M
- name: S
nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```
Note the templates now also generate the genes list from the merged
config.
### PR Checklist
- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping
## Future Work
- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format
🚀 Preview: https://sort-multi-path.loculus.org1 parent 81d9054 commit c9afc61
38 files changed
Lines changed: 754 additions & 361 deletions
File tree
- backend/src
- main/kotlin/org/loculus/backend
- api
- service/submission
- test/kotlin/org/loculus/backend
- controller/submission
- service
- utils
- integration-tests/tests
- fixtures
- specs/features
- search
- kubernetes/loculus
- templates
- preprocessing
- nextclade
- src/loculus_preprocessing
- tests
- ebola-dataset/minimizer
- website
- src
- components
- Edit
- ReviewPage
- Submission
- FileUpload
- types
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
167 | 167 | | |
168 | 168 | | |
169 | 169 | | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
170 | 175 | | |
171 | 176 | | |
172 | 177 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
118 | 118 | | |
119 | 119 | | |
120 | 120 | | |
| 121 | + | |
121 | 122 | | |
122 | 123 | | |
123 | 124 | | |
| |||
144 | 145 | | |
145 | 146 | | |
146 | 147 | | |
| 148 | + | |
147 | 149 | | |
148 | 150 | | |
149 | 151 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
232 | 232 | | |
233 | 233 | | |
234 | 234 | | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
235 | 240 | | |
236 | 241 | | |
237 | 242 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
461 | 461 | | |
462 | 462 | | |
463 | 463 | | |
| 464 | + | |
464 | 465 | | |
465 | 466 | | |
466 | 467 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| 65 | + | |
65 | 66 | | |
66 | 67 | | |
67 | 68 | | |
| |||
101 | 102 | | |
102 | 103 | | |
103 | 104 | | |
| 105 | + | |
104 | 106 | | |
105 | 107 | | |
106 | 108 | | |
| |||
117 | 119 | | |
118 | 120 | | |
119 | 121 | | |
| 122 | + | |
120 | 123 | | |
121 | 124 | | |
122 | 125 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| |||
Lines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
91 | 91 | | |
92 | 92 | | |
93 | 93 | | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
94 | 98 | | |
95 | 99 | | |
96 | 100 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| 65 | + | |
65 | 66 | | |
66 | 67 | | |
67 | 68 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | 26 | | |
30 | 27 | | |
31 | | - | |
| 28 | + | |
32 | 29 | | |
33 | | - | |
| 30 | + | |
34 | 31 | | |
35 | | - | |
| 32 | + | |
36 | 33 | | |
37 | 34 | | |
38 | 35 | | |
| |||
0 commit comments