Skip to content

incompatibility with tab character in line1 of FASTQ #241

@iorga

Description

@iorga

Hello,

With FASTQ files generated by dorado demux --emit-fastq kmc (I am using KMC 3.2.4) doesn't work correctly:

~/temp_test > cat dorado.fastq 
@924f595d-fb8d-473e-bb99-b554dfdd5ce9	st:Z:2024-09-11T03:04:06.675+00:00	RG:Z:41b0457e40e383b3df64af4d8e649576ca9a4668_dna_r10.4.1_e8.2_400bps_fast@v5.0.0
GAAGCGACAGCGTATGCGCGTGTTTAAGTTCGACTGGTTCTCTGCCACCGGTACCGCCATTCTTTTTGCTGCCCTGCTCTCGATTGTCTGGCTGAAGATGAAACCATCTGACGCTATCAGCGCCTTCGGCAGCACGCTGAAGGACTGGCTCTGCCTATCTACTCCATCGGTATGGTGCTGGCGTTCGCCTTTATCTCGAACTATTCCGGACTATCATCAACGCTGGCGCTGGCGCTCGCACACACCGGCCATGCATTCCGCCTTTTCTCTCTCGCCGTTCCTCGGCTGGCTTGGTGTCTTCCTGACCGGATCGGATACCTCATCTAACGCCCTGTTCGCCGCCCTGCAAGCCGCTGCAGCACAACAAATTGGCGTTTCTGACCTGTTGTTGGTTGCCGCCAACACCGCCGGTGGTGTCGCCGGTTAAGATGATCTCTTCCGCAATCTATCGCTATCACCTATGCGGGGATAGGCGTGGTAGGCAAAGAGTCAGATCTCTCGCTTTACCATCAAACGCAGCTAAATCTCACCTGTATGGTCGGCGTGATCGCCACGCTCAGGCTTATGTCTTAACGTGGATAATTTGCTAATGATTGTTTTACCCAGACGCCTGTCAGACAAGGTCCGATCGTGTGCGGGCGCTGATGGTGATG
+
89D>AABD:9:2106.-&)'))**4495@50'&'7.99=54166AH@A>=D>86789A=>=ADSDMIHB@>??BG?=<1<...1>8;9>656M@9869711A?<:422<<GFBC@>E<77<(((EI4116213321882/>=)0'62669)'(01/%%%&%&)+68B?@ABGCAB==>KHDCGB71246:DB64331427/2(%$%%%+,/.>>?>>:99@888B58))*=<@++*)()(((348:8;7/3//:'&&*,*46<64@33)7.@9;<HAAA66B0..@S546>(((*<603878F8<<C+)**0/,'7-,.5//0/3)*2579<;<...6>?96?A811/<546,**42+,--.,03./43-+,-1-/,(+-6>AS?ACD::6771)((-/75.1777.-')-6312596**,1571-*,,,,-420.1*'<<:**-655<,,-299&%&&:''&..6071*+/0.*)*;<76:7+*79.-.%$#($.+%'$&'(;7*&%$()1(76,*+))(%%*0.-*56:HIE??=77''-/-,.345+*+/-(/<';:;?@946@732;<ABEBA::>-.3++/..-26,,11))+>C8==G:66:534H>B5/,,02020---))),()**(.%),;'''.++(*&$&&)

~/temp_test > kmc -sm -m8 -t20 -k21 -ci1 dorado.fastq  /home_local/tmp/MLeT7OaRUA/kmc /home_local/tmp/MLeT7OaRUA
**
Stage 1: 100%
Stage 2: 100%


1st stage: 0.622076s
2nd stage: 0.110107s
3rd stage: 0.0044s
Total    : 0.736583s
Tmp size : 0MB
Tmp size strict memory : 0MB
Tmp total: 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            0
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :            0
   Total no. of reads                 :            1
   Total no. of super-k-mers          :            0

I realized that this is because of the presence of tab characters in the line1 of the FASTQ file. By converting the tabs into spaces, the expected behaviour is retrieved:

~/temp_test > cat dorado.fastq | tr '\t' ' ' > dorado_notab.fastq

~/temp_test > kmc -sm -m8 -t20 -k21 -ci1 dorado_notab.fastq  /home_local/tmp/MLeT7OaRUA/kmc /home_local/tmp/MLeT7OaRUA
**
Stage 1: 100%
Stage 2: 100%


1st stage: 0.645912s
2nd stage: 0.102941s
3rd stage: 0.002236s
Total    : 0.751089s
Tmp size : 0MB
Tmp size strict memory : 0MB
Tmp total: 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :          633
   No. of unique counted k-mers       :          633
   Total no. of k-mers                :          633
   Total no. of reads                 :            1
   Total no. of super-k-mers          :           93

Could you please fix this issue in a future release ? Many thanks !

Bogdan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions