Quality check

The first step of the workflow is to perform quality check of raw sequence data. This check provides insights into the quality of reads which facilitate deciding over parameters such as length to trim.

Here, we use Fastqc and Multiqc tools. Fastqc generates a report in html format for each sequence file. Multiqc summarises all those reports into a single file making it easier to comprehend various quality metrics for the entire dataset.

File names

To execute Fastqc for each sequence file, we need a way to have access to all filenames. It can be done manually. But Snakemake provides a wonderful utility called glob_wildcards. This utility automatically scans and fetch all file names following a given naming convention.

For example, our dataset has filenames in a specific format (e.g., Cancer1-2355_S61_L001_R1_001.fastq.gz). Here, Cancer1-2355_S61 is the sample identifier and R1 is the direction of read.

We can represent all those files using a single syntax <sampleid>_L001_<read>_001.fastq.gz. Here, sampleid is the identifier for the read; read is identifier for direction of read, i.e., R1, R2.

To extract all sampleid and read, we simply write a single statement using glob_wildcards The following statement in our workflow scans the directory and fetch all sample and num information in two lists - SAMPLES AND NUMS.

# global wild cards of sample and pairpair list
(SAMPLES,NUMS) = glob_wildcards("/{sample}_L001_{num}_001.fastq.gz")

Fastqc rule

##########################################################
#                 FASTQC - QUALITY REPORTS
##########################################################
rule fastqc_before:
    input:
        INPUTDIR + "/" + FILE_NAME_PATTERN + EXT
    output:
        html =  OUTPUTDIR + "/fastqc/before_trim/" + "{sample}_{num}_fastqc.html",
        zip =  OUTPUTDIR + "/fastqc/before_trim/" + "{sample}_{num}_fastqc.zip",
    log:
        OUTPUTDIR + "/logs/" + "fastqc/fastqc_{sample}_{num}.log",
    threads: 20
    resources:
        mem_mb = 1024
    wrapper:
        "v5.5.2/bio/fastqc"

Multiqc rule

##########################################################
#                 MULTIQC - QUALITY REPORTS MERGE
##########################################################
rule multiqc_before:
    input:
        expand( OUTPUTDIR + "/fastqc/before_trim/" + "{sample}_{num}_fastqc.zip", sample=SAMPLES,num=NUMS)
    output:
        OUTPUTDIR + "/multiqc/before_trim/" + "multiqc_report.html",
    log:
         OUTPUTDIR + "/logs" + "/multiqc/multiqc.log",
    params:
        use_input_files_only=True,
    wrapper:
        "v6.2.0/bio/multiqc"

References

Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047-3048.