Your Shell Environment

1.1 Paths to Your Data

To keep it simple, it’s easiest to put your data into a dedicated folder/directory and your raw fastq read sequence files into a subdirectory of that. However, if you are prone to deleting entire directories and don’t keep backups of your files, keep them separate.

DIRECTORY=/mnt/research/germs/Schuyler/Projects/X
RAWDAT_DIRECTORY=$DIRECTORY/original
MAPPING_FILE=$RAWDAT_DIRECTORY/map.txt

When you enter these in, they will only remain for the time that you have your terminal window open, so you will need to re-enter them if you stop partway through the pipeline and start again.

1.2 Paths to Programs

The pipeline uses several programs, many of which are retained in the RDP package. To make the commands simple, it is easier to give these shortcut names to their paths.

SCRIPTS will be where the files from the scripts folder in this repository will be kept.
RDP is whereever you downloade and unpacked the RDP-tools.
PANDASEQ is a standalone package but was improved by the RDP group and so is contained within their package, this shortcut should got to the executable for the program.
CHIMERA_DB whatever database you are comparing your sequences to for chimera screening
VSEARCH vsearch is the free-ware version of usearch, this shortut should go to the executable of the program.
CORES is how many cpu-cores you want to run eac process on. If you don’t know how many this is.. I would leave it at 2.

SCRIPTS=/mnt/research/germs/Schuyler/code
RDP=/mnt/research/germs/softwares/RDPTools
PANDASEQ=/mnt/research/germs/softwares/pandaseq/pandaseq
VSEARCH=/mnt/research/germs/softwares/vsearch-2.5.0/bin/vsearch
CDHIT=/mnt/research/germs/softwares/cdhit/
CHIMERA_DB=/mnt/research/germs/databases/greengene/current_Bacteria_unaligned.fa

1.3 Sequence Read Parameters

These parameters are used for combining pair-end reads and filtering out poor quality reads.

OVERLAP is the minimum number of bp your read-pairs need to overlap to assemble.
MINLENGTH all assembled reads shorter than this will be discarded.
MAXLENGTH all assembled reads longer than this will be discarded.
Q \(Q = -10\log_{10}(e)\), the minimum quality score to keep reads.

OVERLAP=10 
MINLENGTH=210
MAXLENGTH=280
Q=25

1.4 Cores

Several of the programs can use multiprocessing. Ones that do not, we can use parallele to run separate processes on multiple cores. Any PC can have anywhere from 2-8 cores typically. On the HPCC you can request a large number, I like 20. We use the variable CORES to let the script know how many are available.

CORES=20

Schuyler Smith
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University. Ames, IA.