Quality control
The first step in the process is to learn about the quality of sequence reads. As Fastq files contains a quality score per base which indicates the level of accuracy for a particular base read. Our goal in the first step is to check is there something abnormal with the quality scores, e.g., do quality degrades after a particular read position.
Thankfully, we have tools which can automate this lookup process for us. The first tool is Fastqc
which scans each sequence read file and then generate a report per file, providing essential information on many quality aspects of sequence data.
Now imagine going through each generated file to come up with some sort of understanding of the overall quality of datasets. It could be an exhausting task. Thanks again to the bioinformatics field, we have another tool Multiqc
which summarizes multiple report files into a single report summary which makes it easier to determine the quality of sequence reads.
Generating quality report using fastqc
The fastqc command has the following syntax
fastqc -o ./fastqc_results -t 10 ./raw_data/*.fastq.gz
Here, -o flag specifies output directory, and -t flag specifies number of threads to process the files.
Summarizing all quality reports using multiqc
The following command runs multiqc
and generate a summary report of fastqc reports.
multiqc ./fastqc_results -o ./multiqc_results
Applying trimmomatic
Now, we will use insights gained from multiqc report to determine some parameters such as minimum length of sequences. These parameters then we will pass in the trimmomatic
command.
trimmomatic -PE ADenoma12-2799_S12_L001_R1_001.fastq.gz ADenoma12-2799_S12_L001_R2_001.fastq.gz -baseout Ademona1-2065_trimmed_S1.fastq.gz CROP:150
Similar command as above must be run for each paired (or single) end sequences file. To automate this, we can use a shell script.
#!/usr/bin/env bash
# author: pankaj chejara
# Script to iterate over paired read sequences to apply trimmomatic
samples=()
for filename in ./raw_data/*.fastq.gz; do
base=$(basename "$filename" .fastq.gz)
nf=$(echo $base | sed -e 's/.......$//');
if ! [[ ${samples[@]} =~ $nf ]]
then
samples+=("$nf");
fi
done
for sample in "${samples[@]}"; do
forward="./raw_data/${sample}_R1_001.fastq.gz"
backward="./raw_data/${sample}_R2_001.fastq.gz"
baseout="./trimm_outputs/${sample}.fastq.gz"
trimmomatic PE -threads 20 $forward $backward -baseout $baseout CROP:200"
done;