Sequence files come in 2 formats, FASTQ and FASTA. Raw reads will always come as fastq files, sometimes in compressed file formats. Some programs are written to be able to handle compressed files, but I find it’s easier to decompress the files to avoid any possible hang-ups. Compressed files typically have either a .gz
or .zip
file extension.
The fatq format always has 4 lines per read.
@M04771:122:000000000-B62C2:1:1101:13253:1557 1:N:0:0
TCCTTTTTTTTTTCTCTTTTTTCTTCTTTCCTTTTCTTCCCTCTTTCTTCTTCTTTTTTTTCCTTCTTTTTTTCCCTCCCTTTTCTTCCCTTTTTTTTTTCCTTTTTTCCTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTT
+
>>>11311>@00A012BEF1AABEE22DA21DDFF2BF210B0B112DA2DF2DB111>>E01@@11@1@EE?011B0//011BBF12101111<///</011?<11<0<?1<1<-@0<=000<<--------9@---9-9--99-///;9
@M04771:122:000000000-B62C2:1:1101:16150:1745 1:N:0:0
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTCTGGTCAGTCAGATGTGAAAGCCCCGGGCTCAACCTGGGAATTGCATTTGATACTGCCAGACTCGAGTGTGGTAGAGGGAAGTGGAATTCCGAGT
+
AAAA?1AD?AAA1FGGGGE?0BGFF00FFGAGFFGEEGGAGHFEA/AE>EEEBEGE/E00BGDGG>1B1BFGGFFFFECEE/<ECGC1CG<F00FA1?FDGBFGG>FFFHHG1<110?F1?.CGA>1AFFFBGH?.<AGBF0DDC<G/.:C
@M04771:122:000000000-B62C2:1:1101:14494:1756 1:N:0:0
TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTATTCAGTCAGAGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTTTGATACTGCTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCGAGT
+
A1AA11AD@1>>AAGGEGEEFFHFEC/FGFBGFFGEEGC?/GEE>/>EEEEGFEE/>E12BF2GGD1B11?GHGB>BGCC11/CCHE11G?<00?011F<FGGGH1??1FH11F11>GHHF11GGD..0.@.?CGA0.DDEEC;C:C/.:;
The fasta format has two lines per read.
>M04771:122:000000000-B62C2:1:1101:13253:1557 1:N:0:0
TCCTTTTTTTTTTCTCTTTTTTCTTCTTTCCTTTTCTTCCCTCTTTCTTCTTCTTTTTTTTCCTTCTTTTTTTCCCTCCCTTTTCTTCCCTTTTTTTTTTCCTTTTTTCCTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTT
>M04771:122:000000000-B62C2:1:1101:16150:1745 1:N:0:0
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTCTGG
TCAGTCAGATGTGAAAGCCCCGGGCTCAACCTGGGAATTGCATTTGATACTGCCAGACTC
GAGTGTGGTAGAGGGAAGTGGAATTCCGAGT
>M04771:122:000000000-B62C2:1:1101:14494:1756 1:N:0:0
TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTATTCAGTCAGAGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTTTGATACTGCTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCGAGT
Though technically FASTA formay is 2 lines per sequence, you may notice that the actual sequence for the second read here is on more than 1 line. Really, programs that use FASTA files check for the ‘>’ in the read name line to denote a new sequence.. But for best practice, use 2-line format.
The reference database is typically in fasta format, and is used for two purposes in this pipeline. First, it is used to compare the reads to for chimera screening. Second, it is used to classify the read clusters as belonging to particular taxa. The most commonly used databases are Silva, GreenGenes, and the RDP.
>S000494589 uncultured bacterium; YRM60L1D06060904 Lineage=Root;rootrank;Bacteria;domain;"Actinobacteria";phylum;Actinobacteria;class;Acidimicrobidae;subclass;Acidimicrobiales;order;"Acidimicrobineae";suborder;Acidimicrobiaceae;family;Acidimicrobium;genus
gcggcgtgctacacatgcagtcgtacgcggtggcacaccgagtggcgaacgggtgcgtaacacgtgaggaacctaccccg
aagtgggggataacaccgggaaaccggtgctaataccgcatacgctccccggaccgcatggtccagggagcaaagcctcc
gggcgcttcgggacggcctcgcggcctatcagcttgttggtggggtaacggcccaccaaggcgacgacgggtagctggtc
tgagaggacgatcagccacactgggactgagacacggcccagactcctacgggaggcagcagtggggaatattgcgcaat
gggcgaaagcctgacgcagcaacgccgcgtggaggacgaaggccttcgggttgtaaactcctttcagcagggacgaaact
gacggtacctgcagaagaagccccggctaactacgtgccagcagccgcggtaag
These files are used to match the barcodes of multiplexed reads to their source samples. Depending on the sequencing facility there may by 2 files, a mapping file and and I file. The mapping file is used to match the barcode the sample.
#SampleID BarcodeSequence LinkerPrimer Description
ATD1 TCAGCCTCAGCC GTGTGYCAGCMGCCGCGGTAA E_Quiroz_plate1_A1
ATD2 TGCAGGGAACCG GTGTGYCAGCMGCCGCGGTAA E_Quiroz_plate1_A2
ATD3 CAGCGCGGCTAG GTGTGYCAGCMGCCGCGGTAA E_Quiroz_plate1_A3
ATD4 ATGGACCTAGCT GTGTGYCAGCMGCCGCGGTAA E_Quiroz_plate1_A4
ATD5 CTAGCCCGTTCG GTGTGYCAGCMGCCGCGGTAA E_Quiroz_plate1_A5
The I file matches the read to the barcode.
@M04771:122:000000000-B62C2:1:1101:13253:1557 1:N:0:0
TTTTTTTTTTCT
+
1111>33B3B13
@M04771:122:000000000-B62C2:1:1101:16150:1745 1:N:0:0
TGGGTTAACACA
+
>111>>11B3B1
@M04771:122:000000000-B62C2:1:1101:14494:1756 1:N:0:0
TTTCTTTTCTTT
+
111113313B33
Schuyler Smith
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University. Ames, IA.