Describe the bug
When handling a report, megaqc loops over each data value and checks to see if that SampleDataType already exists. However it only checks on the basis of data_id, but ignores data_section. Therefore if multiple report types (data sections) reuse the same data_id, currently this will reuse that SampleDataType even if data_section is wrong for the incoming report.
This becomes problematic if you want to query for historic results based on data_section.
This is due to this code, which
- Checks
sample_data_type to see if the field's name has been seen before
- If it has NOT been seen before, creates a new entry with
data_key = "{}__{}".format(section, d_key).
But in step (1) it will reuse any key matching d_key, even if section does not match.
To Reproduce
Here is a barebones multiqc_config and set of report files that can reveal the issue.
multiqc_config.yaml
custom_data:
Pipeline_A_Result:
file_format: "csv"
Pipeline_B_Result:
file_format: "csv"
sp:
Pipeline_A_Result:
fn: "*A_report.csv"
Pipeline_B_Result:
fn: "*B_report.csv"
A_report.csv (generated by Pipeline A)
sample_id,patient_id,variant_count
sample_1,patient_1,10
B_report.csv (generated by Pipeline B)
sample_id,patient_id,pvalue
sample_2,patient_2,0.0001
Steps:
- Run pipeline A and submit its data to megaqc,
- Run pipeline B and submit its data to megaqc
megaqc erroneously associates patient_id to only come from Pipeline_A_Result, even though in one case it comes from Pipeline_B_Result.
Specifically, the sample_data and sample_data_type tables will look like
sample_data_type
| sample_data_type_id |
data_id |
data_section |
data_key |
schema |
| 0 |
patient_id |
Pipeline_A_Result-plot |
Pipeline_A_Result-plot__patient_id |
null |
| 1 |
variant_count |
Pipeline_A_Result-plot |
Pipeline_A_Result-plot__variant_count |
null |
| 2 |
pvalue |
Pipeline_B_Result-plot |
Pipeline_B_Result-plot__pvalue |
null |
sample_data
| sample_data_id |
report_id |
sample_data_type_id |
sample_id |
value |
| 0 |
0 |
0 |
0 |
patient_1 |
| 1 |
0 |
1 |
0 |
10 |
| 2 |
1 |
0 (*) |
1 |
patient_2 |
| 3 |
1 |
2 |
1 |
0.0001 |
* NOTE: sample_data_type_id=0 refers to data_section=Pipeline_A_Result-plot, even though this value actually came from Pipeline_B.
Expected behavior
data_id='patient_id' will appear in two separate sample_data_type rows, once with data_section='Pipeline_A_Result-plot' and once with data_section='Pipeline_B_Result-plot'
| sample_data_type_id |
data_id |
data_section |
data_key |
schema |
| 0 |
patient_id |
Pipeline_A_Result-plot |
Pipeline_A_Result-plot__patient_id |
null |
| 1 |
variant_count |
Pipeline_A_Result-plot |
Pipeline_A_Result-plot__variant_count |
null |
| 2 |
patient_id |
Pipeline_B_Result-plot |
Pipeline_B_Result-plot__patient_id |
null |
| 3 |
pvalue |
Pipeline_B_Result-plot |
Pipeline_B_Result-plot__pvalue |
null |
System
Describe the bug
When handling a report, megaqc loops over each data value and checks to see if that
SampleDataTypealready exists. However it only checks on the basis ofdata_id, but ignoresdata_section. Therefore if multiple report types (data sections) reuse the samedata_id, currently this will reuse thatSampleDataTypeeven ifdata_sectionis wrong for the incoming report.This becomes problematic if you want to query for historic results based on data_section.
This is due to this code, which
sample_data_typeto see if the field's name has been seen beforedata_key = "{}__{}".format(section, d_key).But in step (1) it will reuse any key matching
d_key, even ifsectiondoes not match.To Reproduce
Here is a barebones multiqc_config and set of report files that can reveal the issue.
multiqc_config.yamlA_report.csv(generated by Pipeline A)B_report.csv(generated by Pipeline B)Steps:
megaqc erroneously associates
patient_idto only come fromPipeline_A_Result, even though in one case it comes fromPipeline_B_Result.Specifically, the
sample_dataandsample_data_typetables will look likesample_data_typesample_dataExpected behavior
data_id='patient_id'will appear in two separatesample_data_typerows, once withdata_section='Pipeline_A_Result-plot'and once withdata_section='Pipeline_B_Result-plot'System