Quality control

The first step in the process is to learn about the quality of sequence reads. As Fastq files contains a quality score per base which indicates the level of accuracy for a particular base read. Our goal in the first step is to check is there something abnormal with the quality scores, e.g., do quality degrades after a particular read position.

Thankfully, we have tools which can automate this lookup process for us. The first tool is Fastqc which scans each sequence read file and then generate a report per file, providing essential information on many quality aspects of sequence data.

Now imagine going through each generated file to come up with some sort of understanding of the overall quality of datasets. It could be an exhausting task. Thanks again to the bioinformatics field, we have another tool Multiqc which summarizes multiple report files into a single report summary which makes it easier to determine the quality of sequence reads.

Generating quality report using fastqc

The fastqc command has the following syntax

fastqc  -o ./fastqc_results -t 10 ./raw_data/*.fastq.gz

Here, -o flag specifies output directory, and -t flag specifies number of threads to process the files.

Summarizing all quality reports using multiqc

The following command runs multiqc and generate a summary report of fastqc reports.

multiqc ./fastqc_results -o ./multiqc_results

Applying trimmomatic

Now, we will use insights gained from multiqc report to determine some parameters such as minimum length of sequences. These parameters then we will pass in the trimmomatic command.

trimmomatic -PE ADenoma12-2799_S12_L001_R1_001.fastq.gz ADenoma12-2799_S12_L001_R2_001.fastq.gz -baseout Ademona1-2065_trimmed_S1.fastq.gz CROP:150
Tip

Similar command as above must be run for each paired (or single) end sequences file. To automate this, we can use a shell script.

#!/usr/bin/env bash
# author: pankaj chejara
# Script to iterate over paired read sequences to apply trimmomatic

samples=()

for filename in ./raw_data/*.fastq.gz; do
    base=$(basename "$filename" .fastq.gz)
    nf=$(echo $base | sed -e 's/.......$//');
    if ! [[ ${samples[@]} =~ $nf ]]
    then
      samples+=("$nf");
    fi
  done

for sample in "${samples[@]}"; do
  forward="./raw_data/${sample}_R1_001.fastq.gz"
  backward="./raw_data/${sample}_R2_001.fastq.gz"
  baseout="./trimm_outputs/${sample}.fastq.gz"
  trimmomatic PE -threads 20 $forward $backward -baseout $baseout CROP:200"
done;

References