Skip to content

Commit c9afc61

Browse files
authored
feat(prepro): assign segment/subtype using nextclade sort (3/n) (#5402)
resolves #4847 ### Screenshot Improves #4821, comes after #5398 You can use pathoplexus/dev_example_data#2 for testing. Nextclade sort will be used to assign segments/subtypes for all aligned sequences: ``` minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: `sequenceNameToFastaHeaderMap`. This allows us to surface this assignment on the edit page. ## Prepro config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a list of sequences: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` ``` nucleotideSequences: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output ``` Note the templates now also generate the genes list from the merged config. ### PR Checklist - [ ] Update values.schema.json - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping ## Future Work - [ ] add integration testing for full EV submission user journey - [ ] improve CCHF minimizer (some segments are again not assigned) - [ ] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) - [ ] update PPX docs with new multi-segment submission format 🚀 Preview: https://sort-multi-path.loculus.org
1 parent 81d9054 commit c9afc61

38 files changed

Lines changed: 754 additions & 361 deletions

backend/src/main/kotlin/org/loculus/backend/api/SubmissionTypes.kt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,11 @@ data class ProcessedData<SequenceType>(
167167
description = "The key is the gene name, the value is a list of amino acid insertions",
168168
)
169169
val aminoAcidInsertions: Map<GeneName, List<Insertion>>,
170+
@Schema(
171+
example = """{"segment1": "fastaHeader1", "segment2": "fastaHeader2"}""",
172+
description = "The key is the segment name, the value is the fastaHeader of the original Data",
173+
)
174+
val sequenceNameToFastaHeaderMap: Map<SegmentName, String> = emptyMap(),
170175
@Schema(
171176
example = """{"raw_reads": [{"fileId": "s0m3-uUiDd", "name": "data.fastaq"}], "sequencing_logs": []}""",
172177
description = "The key is the file category name, the value is a list of files, with ID and name.",

backend/src/main/kotlin/org/loculus/backend/service/submission/CompressionService.kt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ class CompressionService(
118118
}
119119
},
120120
processedData.aminoAcidInsertions,
121+
processedData.sequenceNameToFastaHeaderMap,
121122
processedData.files,
122123
)
123124

@@ -144,6 +145,7 @@ class CompressionService(
144145
}
145146
},
146147
processedData.aminoAcidInsertions,
148+
processedData.sequenceNameToFastaHeaderMap,
147149
processedData.files,
148150
)
149151

backend/src/main/kotlin/org/loculus/backend/service/submission/EmptyProcessedDataProvider.kt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ class EmptyProcessedDataProvider(private val backendConfig: BackendConfig) {
2020
alignedAminoAcidSequences = referenceGenome.genes.map { it.name }.associateWith { null },
2121
nucleotideInsertions = referenceGenome.nucleotideSequences.map { it.name }.associateWith { emptyList() },
2222
aminoAcidInsertions = referenceGenome.genes.map { it.name }.associateWith { emptyList() },
23+
sequenceNameToFastaHeaderMap = referenceGenome.nucleotideSequences.map { it.name }.associateWith { "" },
2324
files = null,
2425
)
2526
}

backend/src/main/kotlin/org/loculus/backend/service/submission/ProcessedSequenceEntryValidator.kt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,11 @@ class ProcessedSequenceEntryValidator(private val schema: Schema, private val re
232232
"alignedNucleotideSequences",
233233
)
234234

235+
validateNoUnknownSegment(
236+
processedData.sequenceNameToFastaHeaderMap,
237+
"sequenceNameToFastaHeaderMap",
238+
)
239+
235240
validateNoUnknownSegment(
236241
processedData.unalignedNucleotideSequences,
237242
"unalignedNucleotideSequences",

backend/src/main/kotlin/org/loculus/backend/service/submission/SubmissionDatabaseService.kt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -461,6 +461,7 @@ class SubmissionDatabaseService(
461461
aminoAcidInsertions = processedData.aminoAcidInsertions.mapValues { (_, it) ->
462462
it.map { insertion -> insertion.copy(sequence = insertion.sequence.uppercase(Locale.US)) }
463463
},
464+
sequenceNameToFastaHeaderMap = processedData.sequenceNameToFastaHeaderMap,
464465
)
465466

466467
private fun validateExternalMetadata(

backend/src/test/kotlin/org/loculus/backend/controller/submission/PreparedProcessedData.kt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ val defaultProcessedData = ProcessedData(
6262
Insertion(123, "RN"),
6363
),
6464
),
65+
sequenceNameToFastaHeaderMap = mapOf(MAIN_SEGMENT to "header"),
6566
files = null,
6667
)
6768

@@ -101,6 +102,7 @@ val defaultProcessedDataMultiSegmented = ProcessedData(
101102
Insertion(123, "RN"),
102103
),
103104
),
105+
sequenceNameToFastaHeaderMap = mapOf("notOnlySegment" to "header1", "secondSegment" to "header2"),
104106
files = null,
105107
)
106108

@@ -117,6 +119,7 @@ val defaultProcessedDataWithoutSequences = ProcessedData<GeneticSequence>(
117119
nucleotideInsertions = emptyMap(),
118120
alignedAminoAcidSequences = emptyMap(),
119121
aminoAcidInsertions = emptyMap(),
122+
sequenceNameToFastaHeaderMap = emptyMap(),
120123
files = null,
121124
)
122125

backend/src/test/kotlin/org/loculus/backend/service/ProcessedMetadataPostprocessorTest.kt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ class ProcessedMetadataPostprocessorTest(
4343
nucleotideInsertions = emptyMap(),
4444
alignedAminoAcidSequences = emptyMap(),
4545
aminoAcidInsertions = emptyMap(),
46+
sequenceNameToFastaHeaderMap = emptyMap(),
4647
files = null,
4748
)
4849

backend/src/test/kotlin/org/loculus/backend/service/ProcessedSequencesPostprocessorTest.kt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,10 @@ class ProcessedSequencesPostprocessorTest(
9191
unconfiguredPresentGene to listOf(Insertion(13, "TT")),
9292
unconfiguredNullGene to emptyList(),
9393
),
94+
sequenceNameToFastaHeaderMap = mapOf(
95+
"configuredPresentSeg" to "header1",
96+
"unconfiguredPresentSeg" to "header2",
97+
),
9498
files = null,
9599
)
96100

backend/src/test/kotlin/org/loculus/backend/utils/EarliestReleaseDateFinderTest.kt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ fun row(
6262
nucleotideInsertions = emptyMap(),
6363
alignedAminoAcidSequences = emptyMap(),
6464
aminoAcidInsertions = emptyMap(),
65+
sequenceNameToFastaHeaderMap = emptyMap(),
6566
files = null,
6667
),
6768
isRevocation = false,

integration-tests/tests/fixtures/sequence.fixture.ts

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,13 @@ export const test = groupTest.extend<SequenceFixtures>({
2323
collectionDate: '2021-10-15',
2424
authorAffiliations: 'Test Institute, France',
2525
});
26-
const fastaHeaderL = `${submissionId}_L`;
27-
const fastaHeaderM = `${submissionId}_M`;
28-
const fastaHeaderS = `${submissionId}_S`;
2926

3027
await submissionPage.fillSequenceData({
31-
[fastaHeaderL]:
28+
fastaHeaderL:
3229
'CCACATTGACACAGANAGCTCCAGTAGTGGTTCTCTGTCCTTATTAAACCATGGACTTCTTAAGAAACCTTGACTGGACTCAGGTGATTGCTAGTCAGTATGTGACCAATCCCAGGTTTAATATCTCTGATTACTTCGAGATTGTTCGACAGCCTGGTGACGGGAACTGTTTCTACCACAGTATAGCTGAGTTAACCATGCCCAACAAAACAGATCACTCATACCATAACATCAAACATCTGACTGAGGTGGCAGCACGGAAGTATTATCAGGAGGAGCCGGAGGCTAAGCTCATTGGCCTGAGTCTGGAAGACTATCTTAAGAGGATGCTATCTGACAACGAATGGGGATCGACTCTTGAGGCATCTATGTTGGCTAAGGAAATGGGTATTACTATCATCATTTGGACTGTTGCAGCCAGTGACGAAGTGGAAGCAGGCATAAAGTTTGGTGATGGTGATGTGTTTACAGCCGTGAATCTTCTGCACTCCGGACAGACACACTTTGATGCCCTCAGAATACTGCCNCANTTTGAGGCTGACACAAGAGAGNCCTTNAGTCTGGTAGACAANNTNATAGCTGTGGACCANNTGACCTCNTCTTCAAGTGATGAANTGCAGGACTANGAAGANCTTGCTTTAGCACTTACNAGNGCGGAAGAACCATNTAGACGGTCTAGCNTGGATGAGGTNACCCTNTCTAAGAAACAAGCAGAGNTATTGAGGCAGAAGGCATCTCAGTTGTCNAAACTGGTTAATAAAAGTCAGAACATACCGACTAGAGTTGGCAGGGTTCTGGACTGTATGTTTAACTGCAAACTATGTGTTGAAATATCAGCTGACACTCTAATTCTGCGACCAGAATCTAAAGAAAGAATTGG',
33-
[fastaHeaderM]:
30+
fastaHeaderM:
3431
'GTGGATTGAGCATCTTAATTGCAGCATACTTGTCAACATCATGCATATATCATTGATGTATGCAGTTTTCTGCTTGCAGCTGTGCGGTCTAGGGAAAACTAACGGACTACACAATGGGACTGAACACAATAAGACACACGTTATGACAACGCCTGATGACAGTCAGAGCCCTGAACCGCCAGTGAGCACAGCCCTGCCTGTCACACCGGACCCTTCCACTGTCACACCTACAACACCAGCCAGCGGATTAGAAGGCTCAGGAGAGGTTCACACATCCTCTCCAATCACCACCAAGGGTTTGTCTCTGCCGGGGGCTACATCTGAGCTCCCTGCGACTACTAGCATAGTCACTTCAGGTGCAAGTGATGCCGATTCTAGCACACAGGCAGCCAGAGACACCCCTAAACCATCAGTCCGCACGAGTCTGCCCAACAGCCCTAGCACACCATCCACACCACAAGGCACACACCATCCCGTGAGGAGTCTGCTTTCAGTCACGAGCCCTAAGCCAGAAGAAACACCAACACCGTCAAAATCAAGCAAAGATAGCTCAGCAACCAACAGTCCTCACCCAGCCGCCAGCAGACCAACAACCCCTCCCACAACAGCCCAGAGACCCGCTGAAAACAACAGCCACAACACCACCGAACAGCTTGAGTCCTTAACACAATTAGCAACTTCAGGTTCAATGATCTCTCCAACACAGACAGTCCTCCCAAAGAGTGTTACTTCTATAGCCATTCAAGACATTCATCCCAGCCCAACAAATAGGTCTAAAAGAAACCTTGATATGGAAATAATCT',
35-
[fastaHeaderS]:
32+
fastaHeaderS:
3633
'GTGTTCTCTTGAGTGTTGGCAAAATGGAAAACAAAATCGAGGTGAACAACAAAGATGAGATGAACAAATGGTTTGAGGAGTTCAAGAAAGGAAATGGACTTGTGGACACTTTCACAAACTCNTATTCCTTTTGTGAAAGCGTNCCAAATCTGGACAGNTTTGTNTTCCAGATGGCNAGTGCCACTGATGATGCACAAAANGANTCCATCTACGCATCTGCNCTGGTGGANGCAACCAAATTTTGTGCACCTATATACGAGTGTGCTTGGGCTAGCTCCACTGGCATTGTTAAAAAGGGACTGGAGTGGTTCGAGAAAAATGCAGGAACCATTAAATCCTGGGATGAGAGTTATACTGAGCTTAAAGTTGAAGTTCCCAAAATAGAACAACTCTCCAACTACCAGCAGGCTGCTCTCAAATGGAGAAAAGACATAGGCTTCCGTGTCAATGCAAATACGGCAGCTTTGAGTAACAAAGTCCTAGCAGAGTACAAAGTTCCTGGCGAGATTGTAATGTCTGTCAAAGAGATGTTGTCAGATATGATTAGAAGNAGGAACCTGATTCTCAACAGAGGTGGTGATGAGAACCCACGCGGCCCAGTTAGCCGTGAACATGTGGAGTGGTGC',
3734
});
3835

0 commit comments

Comments
 (0)