4.1 Clustering

Read clustering has become a hot-topic in these pipelines recently. Historically, reads have been clustered to 97% similiarity. The reasons for this were partly biological and mostly computational. The file size difference between clustering to 97% and 99% is significant. Some advocate not clustering at all. I’m going to recommend 99% clustering.

Clustering, afterall, is your OTU calling

There are two main types of clustering you will here about. Same as with the chimera detection, they are de novo and reference based clustering.

Reference Based

This type of clustering uses a database t0 match the reads to. This is certainly a very accurate way of clustering reads, though potentially taking more computational time, it is the best method for well annotated systems.

De Novo Based

In particular, soil is not a very well annotated system. Meaning, you are going to find many organism that have not been previosuly sequenced, and thus do not exist in a database. For this reason, we used de novo clustering, and we like to use CD-Hit.

For this pipeline we first cluster all reads from every sample together to create a master_otu file.

mkdir -p $DIRECTORY/cdhit_clustering/master_otus/../R/../combined_seqs/../renamed_seqs
$CDHIT/cd-hit-est -i $DIRECTORY/quality_check/chimera_removal/relabeled_denovo_ref.good -o $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER -c $CLUSTER -M 200000 -T -j $CORES

4.2 Sample Mapping

Once we have that master list we then map the individual samples back to the master_otus.

python $SCRIPTS/renaming_seq_w_short_sample_name.py "S_" $DIRECTORY/cdhit_clustering/renamed_seqs/sample_filename_map.txt $DIRECTORY/cdhit_clustering/renamed_seqs/sequence_name_map.txt $DIRECTORY/demultiplex/demultiplex_finalized/*.fasta > $DIRECTORY/cdhit_clustering/renamed_seqs/all_renamed_sequences.fa
$CDHIT/cd-hit-est-2d -i $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER -i2 $DIRECTORY/cdhit_clustering/renamed_seqs/all_renamed_sequences.fa -o $DIRECTORY/cdhit_clustering/master_otus/otu_mapping_cdhit99 -c $CLUSTER -M 200000 -T -j $CORES

4.3 Taxa Classification

The taxa table is independent of the otu table, so the classification does not need to happen prior to clustering. RDP has an efficient classifier that can use the given database to match sequences.

python $SCRIPTS/cdhit_otu_mapping.py $DIRECTORY/cdhit_clustering/master_otus/otu_mapping_cdhit_$CLUSTER.clstr > $DIRECTORY/cdhit_clustering/master_otus/cdhit_otu_table_long.txt
Rscript $SCRIPTS/convert_otu_table_long_to_wide_format.R $DIRECTORY/cdhit_clustering/master_otus/cdhit_otu_table_long.txt $DIRECTORY/cdhit_clustering/R/cdhit_otu_table_wide.txt
python $SCRIPTS/rep_seq_to_otu_mapping.py $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER.clstr > $DIRECTORY/cdhit_clustering/master_otus/rep_seq_to_cluster.map 

java -Xmx24g -jar $RDP/classifier.jar classify -c 0.5 -f filterbyconf -o $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER\_taxa_filterbyconf.txt -h $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER\_taxa_filterbyconf_hierarchy.txt $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER
Rscript $SCRIPTS/renaming_taxa_rep_seq_to_otus.R $DIRECTORY/cdhit_clustering/master_otus/relabeled_denovo_ref_good_cdhit_$CLUSTER\_taxa_filterbyconf.txt $DIRECTORY/cdhit_clustering/master_otus/rep_seq_to_cluster.map $DIRECTORY/cdhit_clustering/R/cdhit_taxa_table_w_repseq.txt



Schuyler Smith
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University. Ames, IA.