Read clustering has become a hot-topic in these pipelines recently. Historically, reads have been clustered to 97% similiarity. The reasons for this were partly biological and mostly computational. The file size difference between clustering to 97% and 99% is significant. Some advocate not clustering at all. I’m going to recommend 99% clustering.
Clustering, afterall, is your OTU calling
There are two main types of clustering you will here about. Same as with the chimera detection, they are de novo and reference based clustering.
This type of clustering uses a database t0 match the reads to. This is certainly a very accurate way of clustering reads, though potentially taking more computational time, it is the best method for well annotated systems.
In particular, soil is not a very well annotated system. Meaning, you are going to find many organism that have not been previosuly sequenced, and thus do not exist in a database. For this reason, we used de novo clustering, and we like to use CD-Hit.
For this pipeline we first cluster all reads from every sample together to create a master_otu file.
mkdir -p $DIRECTORY/cdhit_clustering/master_otus/../R/../combined_seqs/../renamed_seqs
$CDHIT/cd-hit-est -i $DIRECTORY/quality_check/chimera_removal/relabeled_denovo_ref.good -o $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER -c $CLUSTER -M 200000 -T -j $CORES
Once we have that master list we then map the individual samples back to the master_otus.
python $SCRIPTS/renaming_seq_w_short_sample_name.py "S_" $DIRECTORY/cdhit_clustering/renamed_seqs/sample_filename_map.txt $DIRECTORY/cdhit_clustering/renamed_seqs/sequence_name_map.txt $DIRECTORY/demultiplex/demultiplex_finalized/*.fasta > $DIRECTORY/cdhit_clustering/renamed_seqs/all_renamed_sequences.fa
$CDHIT/cd-hit-est-2d -i $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER -i2 $DIRECTORY/cdhit_clustering/renamed_seqs/all_renamed_sequences.fa -o $DIRECTORY/cdhit_clustering/master_otus/otu_mapping_cdhit99 -c $CLUSTER -M 200000 -T -j $CORES
The taxa table is independent of the otu table, so the classification does not need to happen prior to clustering. RDP has an efficient classifier that can use the given database to match sequences.
python $SCRIPTS/cdhit_otu_mapping.py $DIRECTORY/cdhit_clustering/master_otus/otu_mapping_cdhit_$CLUSTER.clstr > $DIRECTORY/cdhit_clustering/master_otus/cdhit_otu_table_long.txt
Rscript $SCRIPTS/convert_otu_table_long_to_wide_format.R $DIRECTORY/cdhit_clustering/master_otus/cdhit_otu_table_long.txt $DIRECTORY/cdhit_clustering/R/cdhit_otu_table_wide.txt
python $SCRIPTS/rep_seq_to_otu_mapping.py $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER.clstr > $DIRECTORY/cdhit_clustering/master_otus/rep_seq_to_cluster.map
java -Xmx24g -jar $RDP/classifier.jar classify -c 0.5 -f filterbyconf -o $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER\_taxa_filterbyconf.txt -h $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER\_taxa_filterbyconf_hierarchy.txt $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER
Rscript $SCRIPTS/renaming_taxa_rep_seq_to_otus.R $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER\_taxa_filterbyconf.txt $DIRECTORY/cdhit_clustering/master_otus/rep_seq_to_cluster.map $DIRECTORY/cdhit_clustering/R/cdhit_taxa_table_w_repseq.txt
Schuyler Smith
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University. Ames, IA.