To keep it simple, it’s easiest to put your data into a dedicated folder/directory and your raw fastq read sequence files into a subdirectory of that. However, if you are prone to deleting entire directories and don’t keep backups of your files, keep them separate.
DIRECTORY=/mnt/research/germs/Schuyler/Projects/X
RAWDAT_DIRECTORY=$DIRECTORY/original
MAPPING_FILE=$RAWDAT_DIRECTORY/map.txt
When you enter these in, they will only remain for the time that you have your terminal window open, so you will need to re-enter them if you stop partway through the pipeline and start again.
The pipeline uses several programs, many of which are retained in the RDP package. To make the commands simple, it is easier to give these shortcut names to their paths.
SCRIPTS
will be where the files from the scripts
folder in this repository will be kept.
RDP
is whereever you downloade and unpacked the RDP-tools.
PANDASEQ
is a standalone package but was improved by the RDP group and so is contained within their package, this shortcut should got to the executable for the program.
CHIMERA_DB
whatever database you are comparing your sequences to for chimera screening
VSEARCH
vsearch is the free-ware version of usearch, this shortut should go to the executable of the program.
CORES
is how many cpu-cores you want to run eac process on. If you don’t know how many this is.. I would leave it at 2.
SCRIPTS=/mnt/research/germs/Schuyler/code
RDP=/mnt/research/germs/softwares/RDPTools
PANDASEQ=/mnt/research/germs/softwares/pandaseq/pandaseq
VSEARCH=/mnt/research/germs/softwares/vsearch-2.5.0/bin/vsearch
CDHIT=/mnt/research/germs/softwares/cdhit/
CHIMERA_DB=/mnt/research/germs/databases/greengene/current_Bacteria_unaligned.fa
These parameters are used for combining pair-end reads and filtering out poor quality reads.
OVERLAP
is the minimum number of bp your read-pairs need to overlap to assemble.
MINLENGTH
all assembled reads shorter than this will be discarded.
MAXLENGTH
all assembled reads longer than this will be discarded.
Q
\(Q = -10\log_{10}(e)\), the minimum quality score to keep reads.
OVERLAP=10
MINLENGTH=210
MAXLENGTH=280
Q=25
Several of the programs can use multiprocessing. Ones that do not, we can use parallele to run separate processes on multiple cores. Any PC can have anywhere from 2-8 cores typically. On the HPCC you can request a large number, I like 20. We use the variable CORES
to let the script know how many are available.
CORES=20
Schuyler Smith
Ph.D. Student - Bioinformatics and Computational Biology
Iowa State University. Ames, IA.