If you have pair-end reads, reads that were sequenced from both forward and reverse directions, then you need to combine them before processing through the rest of the pipeline. Our tool of choice for this is pandaseq, which is included in RDP-tools.
mkdir -p $DIRECTORY/pandaseq/assembled/../stats
$PANDASEQ -T $CORES -o $OVERLAP -N -F -d rbkfms -l $MINLENGTH -L $MAXLENGTH -f $RAWDAT_DIRECTORY/*_R1_* -r $RAWDAT_DIRECTORY/*_R2_* 1 > $DIRECTORY/pandaseq/assembled/sequences.assembled.fastq 2> $DIRECTORY/pandaseq/stats/sequences.assembled.stats.txt.bz2
find $DIRECTORY/pandaseq/assembled -type f -size 0 -exec rm {} +
As an extra measure of quality control, we can use the RDP SeqFilters
program to check the pandaseq outputs for quality reads. This is additionally important if you did not have to assemble pair-end reads befor this.
mkdir -p $DIRECTORY/quality_check/seqs_25/../chimera_removal/../final_seqs
java -jar $RDP/RDPTools/SeqFilters.jar -Q $Q -s $DIRECTORY/pandaseq/assembled/sequences.assembled.fastq -o $DIRECTORY/quality_check/seqs_$Q -O sequences
python $SCRIPTS/fastq_to_fasta.py $DIRECTORY/quality_check/seqs_$Q/sequences/NoTag/NoTag_trimmed.fastq $DIRECTORY/quality_check/seqs_$Q/sequences/sequences.fa
Sequencing files are very large. Running through them repeatedly with each program can take a while. Fortunately, we can process the files in smaller sets in parrallel to expedite the process. It depends on the facility that does the sequencing, but in most cases the fastq files are already demultiplexed. When this is the case, it is easy, becasue our job is already done. But sometimes they are not, so we need to demultiplex them ourselves, or find an alternative to make the files easier to process.
The first step we need to do is to format the mapping file to fit the RDP parser.
mkdir -p $DIRECTORY/parallel_scripts/../i_file/../demultiplex/parse_index/../bins/../empty_samples/../demultiplex_finalized; split -l 1000000 $RAWDAT_DIRECTORY/*_I*.fastq* $DIRECTORY/i_file/
ls $DIRECTORY/i_file/* | rev | cut -d "/" -f 1 | sort -u | rev > $DIRECTORY/seqs_list.txt
mkdir -p $DIRECTORY/demultiplex/parse_index/../bins/../empty_samples/../demultiplex_finalized
perl $SCRIPTS/dos2unix.pl $MAPPING_FILE > $DIRECTORY/demultiplex/tag_file.txt
python $SCRIPTS/MiSeq_rdptool_map_parser.py $DIRECTORY/demultiplex/tag_file.txt > $DIRECTORY/demultiplex/tag_file.tag
The mapping file needs to be in the order of barcode then sample ID.
BarcodeSequence #SampleID
AGCCTTCGTCGC A1
TCCATACCGGAA A2
AGCCCTGCTACA A3
CCTAACGGTCCA A4
CGCGCCTTAAAC A5
TATGGTACCCAG A6
And this needs to be the tag_file.tag
for the pipeline. Sometimes the files may not be in the expected format and need a bit of work to get them in order.
Once the tag file is set, the RDP SeqFilter will create indices of which reads belong to which samples.
while read I;
do mkdir $DIRECTORY/demultiplex/parse_index/$I
echo "java -jar $RDP/RDPTools/SeqFilters.jar --seq-file $DIRECTORY/i_file/$I --tag-file $DIRECTORY/demultiplex/tag_file.tag --outdir $DIRECTORY/demultiplex/parse_index/$I"
done < $DIRECTORY/seqs_list.txt > $DIRECTORY/parallel_scripts/demultiplex.sh
cat $DIRECTORY/parallel_scripts/demultiplex.sh | parallel -j $CORES
awk '{print $2}' $DIRECTORY/demultiplex/tag_file.tag | tail -n +2 > seqs_list.txt
while read LANE;
do echo "cat $DIRECTORY/demultiplex/parse_index/*/result_dir/$LANE/$LANE\_trimmed.fastq > $DIRECTORY/demultiplex/bins/$LANE\_trimmed.fastq"
done < $DIRECTORY/seqs_list.txt > $DIRECTORY/parallel_scripts/cat_lanes.sh
cat $DIRECTORY/parallel_scripts/cat_lanes.sh | parallel -j $CORES
ls $DIRECTORY/demultiplex/bins/* | rev | cut -d "/" -f 1 | sort -u | rev | cut -d "_" -f 1 > $DIRECTORY/seqs_list.txt
When the bins are made, then the assembles reads can be placed into their corresponding sample.
python $SCRIPTS/bin_reads.py $DIRECTORY/quality_check/seqs_$Q/sequences/sequences.fa $DIRECTORY/demultiplex/bins $DIRECTORY/demultiplex/demultiplex_finalized
Additionally a file seqs_list.txt
will be created. This file contains the names of the split fastq files. The code uses a lot of relativistic commands, so it is possible that it does not properly work if your file has an unconventional name. If that is the case and you need assistance, feel free to conatact me. seqs_list.txt
should appear as:
A file with the concatenation of all the reads from each sample is made. This master file will be used to process chimeras and then to create OTUs/clusters. Once that is created, the reads from each bin will be mapped to match the master file.
find $DIRECTORY/demultiplex/demultiplex_finalized -type f -size 0 -exec mv -t $DIRECTORY/demultiplex/empty_samples/ {} +
ls $DIRECTORY/demultiplex/demultiplex_finalized/* | rev | cut -d "/" -f 1 | sort -u | rev | cut -d "_" -f 1 > $DIRECTORY/seqs_list.txt
cat $DIRECTORY/demultiplex/demultiplex_finalized/* > $DIRECTORY/quality_check/chimera_removal/all_combined_q$Q.fa
Schuyler Smith
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University. Ames, IA.