Quality control
We will start by looking into quality scores of our sequences to do a quality check. To achieve this task, we use FASTQC
program with raw sequencing data. FASTQC provides an easy-to-understand html report, facilitating an efficient quality check of fastq files.
The command below run the fastqc
program for each fastq (or compressed fastq.gz) file and produces a report for each file in the output_directory
.
fastqc fastq_directory -o output_directory
In our case, we have multiple compressed fastq files with the file extension of .fastq.gz
. We will use the expression *.fastq.gz
to specify all those files for the processing. The following command run fastqc program for every file.
fastqc ./raw_data/*.fastq.gz -o ./fastqc_results
This execution results in the generation of a html
and a zip
file for each processed fastq file. The html file provides a report consisting of tables, figures about the quality check. The zip file contains all the figures, tables, and summaries. Figure fig-quality shows the distribution of quality scores for each position for a particular fastq file (SRR6915103_1
).
![](Meta_chapter1_files/figure-html/41b9f22f-08f9-49ef-9170-fe135a87d1aa-1-71752315-3e2a-49fc-9b0d-cc52a77b1275.png)
Such report is generated for each fastq
file and it can become cumbersome checking each of those files for quality check. To simplify this step and aggregate all such reports into a single one, we use MULTIQC
program (Ewels et al. 2016).
The command below run multiqc program which generates a html report and a directory of images, tables, and summaries.
multiqc ./fastqc_results -o multiqc_results
The below image shows a snapshot of multiqc report.
![](Meta_chapter1_files/figure-html/d7320c68-6a80-4a8f-9864-4011dd44ccaa-1-084c4b9b-b0eb-4cf8-a4b0-a8df898ef865.png)
Trimming sequences to enhance quality
This task involves removing sequences due to low quality or due to other measures (e.g., shorter than a particular length).
We will use here trim-galore
(Krueger 2015) to perform the trimming task. The program runs for each paired-eng sequence. Therefore, we will write a script to iterate over all the fastq files in the directory.
The following script provides that functionality.
#!/bin/bash
# Create output directory if it doesn't exist
mkdir -p glore_output
# Loop over all R1 files ending with _1.fastq.gz
for r1 in raw_data/*_1.fastq.gz; do
# Derive the corresponding R2 file by replacing _1 with _2
r2="raw_data/$(basename "$r1" _1.fastq.gz)_2.fastq.gz"
# Run Trim Galore for each pair and store output in glore_output
trim_galore --nextera --quality 20 --length 75 --paired "$r1" "$r2" --output_dir glore_output
done