Streamlining the analysis process

In the previous chapters, we have seen various steps in generating taxonomic assignment for raw reads data. Manual execution of all those steps, one by one, could be time-consuming and may be prone to errors. To address these challenges, there are several workflow management tools available (e.g., snakemake, nextflow). These tools automate the execution of the entire pipeline and also enable reproducibility of running the same workflow with different datasets.

Here, we will see how to use Snakemake, a workflow management tool, to streamline the entire process of taxonomy generation.

Snakemake configuration file

As the first step, we will create a configuration file where we will keep all needed information for our analysis. It includes project directory, names of output files created in the analysis, parameters values for tools, etc.

This configuration file has information in the format of key:value. The below file is a configuration file which we are going to use for our short analysis (i.e., primer removal -> asv generation -> taxonomy assignment).

## Basic config
project: trial
scratch: /Users/pankaj/Documents/Metrosert/public_dataset/zakular2014
raw_data: partial_raw_data
outputDIR: partial_raw_asv
manifest: manifest_partial.csv
metadata: metadata_partial.tsv

## 16S adapters from Zackular et al., 2014
primerF: GTGCCAGCMGCCGCGGTAA
primerR: GGACTACHVGGGTWTCTAAT
primer_err: 0.4
primer_overlap: 3

## Reference database
database: silva-138.1-ssu-nr99-seqs-515f-806r.qza
database_classifier: silva_classifier.qza
database_tax: silva-138.1-ssu-nr99-tax.qza

## DADA2 - ASV flags
truncation_err: 2
truncation_len-f: 150
truncation_len-r: 140
truncation_err: 2
quality_err: 2

## Diversity metrics
sampling_depth: 500

Pipeline

Now, we will put our analysis codes in one place to be used by Snakemake. To do that, we need to create a file without specifying any extension. For example, we create a file Snakefile-asv. Next, we will write our analysis codes in the file a specific format (which we will see later).

Basically, we will do the following

Specify configuration file
Extract all information provided in configuration file
Write rules for each analysis step

# Snakemake file - input quality controlled fastq reads to generate asv
# Pankaj chejara

# Base snakefile: https://github.com/shu251/tagseq-qiime2-snakemake/blob/master/Snakefile-asv

configfile: "config.yaml"


import io
import os
import pandas as pd
import pathlib

##########################################################
#                 SET CONFIG VARS
##########################################################

PROJ = config["project"]
SCRATCH = config["scratch"]
INPUTDIR = config["raw_data"]
OUTPUTDIR = config['outputDIR']
MANIFEST = config["manifest"]
METADATA = config["metadata"]
SAMPLING_DEPTH= config['sampling_depth']

# Database information
DB = config["database"]
DB_classifier = config["database_classifier"]
DB_tax = config["database_tax"]


rule all:
  input:
    q2_import = SCRATCH + "/" + OUTPUTDIR +"/" +  PROJ + "-PE-demux.qza",
    q2_primerRM = SCRATCH + "/" + OUTPUTDIR +"/" +  PROJ + "-PE-demux-noprimer.qza",
    # Visualization
    raw = SCRATCH + "/" + OUTPUTDIR + "/viz/" + PROJ + "-PE-demux.qzv",
    primer = SCRATCH + "/" + OUTPUTDIR + "/viz/" + PROJ + "-PE-demux-noprimer.qzv",
    # Dada2 results
    table = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-asv-table.qza",
    rep = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-rep-seqs.qza",
    stats = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-stats-dada2.qza",
    stats_viz = SCRATCH + "/" + OUTPUTDIR + "/viz/" + PROJ + "-stats-dada2.qzv",
    # Taxonomic table
    sklearn = SCRATCH + "/" + OUTPUTDIR + "/asv/" +  PROJ + "-tax_sklearn.qza",
    table_biom = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "feature-table.biom",
    table_tsv = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-asv-table.tsv",
    table_tax = SCRATCH + "/" + OUTPUTDIR + "/asv/"  + "taxonomy.tsv",
    # Phylogenetic outputs
    aligned_seqs = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-aligned-rep-seqs.qza",
    aligned_masked = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-masked-aligned-rep-seqs.qza",
    unrooted_tree = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-unrooted-tree.qza",
    rooted_tree = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-rooted-tree.qza",
    # Diversity metrics
    output_dir = directory(SCRATCH + "/qiime2/diversity")


##########################################################
#                 LOAD DATA 
##########################################################
rule import_qiime:
  input:
    MANIFEST
  output:
    q2_import = SCRATCH + "/" + OUTPUTDIR +"/" + PROJ + "-PE-demux.qza"
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_q2.log"
  params:
    type="SampleData[PairedEndSequencesWithQuality]",
    input_format="PairedEndFastqManifestPhred33"
  shell:
    """qiime tools import \
    --type {params.type} \
    --input-path {input} \
    --output-path {output.q2_import} \
    --input-format {params.input_format} 
    """

##########################################################
#                 REMOVE PRIMERS
##########################################################

rule rm_primers:
  input:
    q2_import = SCRATCH + "/" + OUTPUTDIR +"/" + PROJ + "-PE-demux.qza"
  output:
    q2_primerRM = SCRATCH + "/" + OUTPUTDIR +"/" + PROJ +  "-PE-demux-noprimer.qza"
  log:
     SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_primer_q2.log"

  shell:
    """qiime cutadapt trim-paired \
       --i-demultiplexed-sequences {input.q2_import} \
       --p-front-f {config[primerF]} \
       --p-front-r {config[primerR]} \
       --p-error-rate {config[primer_err]} \
       --p-overlap {config[primer_overlap]} \
       --o-trimmed-sequences {output.q2_primerRM}"""


##########################################################
#                 QC STATS
##########################################################

rule get_stats:
  input:
    q2_import = SCRATCH + "/" + OUTPUTDIR +"/" +  PROJ + "-PE-demux.qza",
    q2_primerRM = SCRATCH + "/" + OUTPUTDIR +"/" +  PROJ + "-PE-demux-noprimer.qza"
  output:
    raw = SCRATCH + "/" + OUTPUTDIR + "/viz/" + PROJ + "-PE-demux.qzv",
    primer = SCRATCH + "/" + OUTPUTDIR + "/viz/" + PROJ + "-PE-demux-noprimer.qzv"
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ +  "_getviz_q2.log"
  shell:
    """
     qiime demux summarize --i-data {input.q2_import} --o-visualization {output.raw}
     qiime demux summarize --i-data {input.q2_primerRM} --o-visualization {output.primer}
    """

##########################################################
#                 DENOISE & ASVs
##########################################################

rule dada2:
  input:
    q2_primerRM = SCRATCH + "/" + OUTPUTDIR +"/" + PROJ + "-PE-demux-noprimer.qza"
  output:
    table = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-asv-table.qza",
    rep = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-rep-seqs.qza",
    stats = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-stats-dada2.qza"
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_dada2_q2.log"
  shell:
    """qiime dada2 denoise-paired \
        --i-demultiplexed-seqs {input.q2_primerRM} \
        --p-trunc-q {config[truncation_err]} \
        --p-trunc-len-f {config[truncation_len-f]} \
        --p-trunc-len-r {config[truncation_len-r]} \
        --o-table {output.table} \
        --o-representative-sequences {output.rep} \
        --o-denoising-stats {output.stats}"""


rule dada2_stats:
  input:
    stats = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-stats-dada2.qza"
  output:
    stats_viz = SCRATCH + "/" + OUTPUTDIR + "/viz/" + PROJ + "-stats-dada2.qzv"
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_dada2-stats_q2.log"
  shell:
   """qiime metadata tabulate \
       --m-input-file {input.stats} \
       --o-visualization {output.stats_viz}"""


##########################################################
#                 TAXONOMIC ASSIGNMENT
##########################################################

rule assign_tax:
  input:
    rep = SCRATCH + "/" + OUTPUTDIR + "/asv/" +  PROJ + "-rep-seqs.qza",
    db_classified = DB_classifier
  output:
    sklearn = SCRATCH + "/" + OUTPUTDIR + "/asv/" +  PROJ + "-tax_sklearn.qza"
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ +  "_sklearn_q2.log"
  shell:
    """qiime feature-classifier classify-sklearn \
      --i-classifier {input.db_classified} \
      --i-reads {input.rep} \
      --o-classification {output.sklearn}"""


##########################################################
#                 TAXONOMIC TABLE GENERATION
##########################################################
rule gen_table:
  input:
    table = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-asv-table.qza"
  output:
    table_biom = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "feature-table.biom"
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_exportBIOM_q2.log"
  params:
    directory(SCRATCH + "/" + OUTPUTDIR + "/asv/")
  shell:
    "qiime tools export --input-path {input.table} --output-path {params}"

rule convert:
  input:
    table_biom = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "feature-table.biom"
  output:
    SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-asv-table.tsv"
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_exportTSV_q2.log"
  shell:
    "biom convert -i {input} -o {output} --to-tsv"

rule gen_tax:
  input:
    sklearn = SCRATCH + "/" + OUTPUTDIR + "/asv/" +  PROJ + "-tax_sklearn.qza"
  output:
     table_tax = SCRATCH + "/" + OUTPUTDIR + "/asv/"  + "taxonomy.tsv",
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_exportTAXTSV_q2.log"
  params:
    directory(SCRATCH + "/" + OUTPUTDIR + "/asv/")
  shell:
    "qiime tools export --input-path {input.sklearn} --output-path {params}"


##########################################################
#                 PHYLOGENETIC TREE 
##########################################################
rule phy_tree:
  input:
     rep = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-rep-seqs.qza",
  output:
    aligned_seqs = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-aligned-rep-seqs.qza",
    aligned_masked = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-masked-aligned-rep-seqs.qza",
    unrooted_tree = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-unrooted-tree.qza",
    rooted_tree = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-rooted-tree.qza",
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_phylogeneticTREE_q2.log"

  shell:
    """qiime phylogeny align-to-tree-mafft-fasttree \
        --i-sequences {input.rep} \
        --o-alignment {output.aligned_seqs} \
        --o-masked-alignment {output.aligned_masked} \
        --o-tree {output.unrooted_tree} \
        --o-rooted-tree {output.rooted_tree}"""


##########################################################
#                 DIVERSITY METRICS 
##########################################################
rule div_met:
  input:
     rooted_tree = SCRATCH + "/" + OUTPUTDIR + "/asv/" + "tree/" + PROJ + "-rooted-tree.qza",
     table = SCRATCH + "/" + OUTPUTDIR + "/asv/" + PROJ + "-asv-table.qza"
  output:
     output_dir = directory(SCRATCH + "/qiime2/diversity")
  
  log:
    SCRATCH + "/" + OUTPUTDIR + "/logs/" + PROJ + "_phylogeneticTREE_q2.log"

  shell:
    """qiime diversity core-metrics-phylogenetic \
        --i-phylogeny {input.rooted_tree} \
        --i-table {input.table} \
        --p-sampling-depth {SAMPLING_DEPTH} \
        --m-metadata-file {METADATA} \
        --output-dir {output.output_dir}"""

References

https://github.com/shu251/tagseq-qiime2-snakemake/blob/master/Snakefile-asv
https://forum.qiime2.org/t/qiime2-snakemake-workflow-tutorial-18s-16s-tag-sequencing/11334