Reference database

We will first setup our reference database in QIIME2 (version 2024.10). For that we are going to use SILVA database (Pruesse et al. 2007; Quast et al. 2012). There are other databases as well, such as, Greengenes, RDP.

To simplify the process of setting up reference database, we will use an excellent plugin that is Rescript (Robeson et al. 2021). This plugin provides an already built pipeline to download and process different reference databases (read more here).

Installing RESCRIPt plugin

We will use the following command to install the plugin

pip install git+https://github.com/bokulich-lab/RESCRIPt.git

Preparing SILVA database as reference database

We will follow steps provided here to download and build SILVA database for use in QIIME.

Step-1: Download SILVA

qiime rescript get-silva-data \
    --p-version '138.1' \
    --p-target 'SSURef_NR99' \
    --o-silva-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
    --o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza

RESCRIPt error

RESCRIPt plugin works well for some QIIME versions. In case if you encounter the problem of cannot import name 'DNASequence' from 'q2_types.genome_data' then refer to this page for detailed instructions to resolve the issue. The same steps are also provided below.

We will create an additional conda environment and install required packages to allow rescript to function properly.

conda install -c conda-forge -c bioconda -c qiime2 \
-c https://packages.qiime2.org/qiime2/2023.9/shotgun/released/  \
-c defaults   xmltodict 'q2-types-genomics>2023.5' ncbi-datasets-pylib

pip install git+https://github.com/bokulich-lab/RESCRIPt.git

qiime dev refresh-cache

qiime rescript --help

Step-2: Converting rna sequences to dna sequences

The resultant sequences from the above step are of ‘RNASequences’ data type. To ensure a smooth downstream analysis, we will convert data type to ‘DNASequences’ using the following command.

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
    --o-dna-sequences silva-138.1-ssu-nr99-seqs.qza

The resultant qiime artifact silva-138.1-ssu-nr99-seqs.qza now can be used to train a classifier. Additional steps could also be integrated in the process before building the classifier. Such as cutting low quality sequences, filtering based on length, etc.

The link provides some of those examples and command to do that using RESCRIPt plugin.

Building a taxonomy classifer

Now we will move to train a taxonomy classifier which we are going to use later in our analysis to assign taxonomy labels to sequence data. Before doing that we will select V4 region from the SILVA database and use that extracted database for training our classifier.

This step of extracting V4 regions has been found to improve the performance.

Extract V4 region

This step extract the 16S region from the database using provided primers.

Tip

It is recommeded to use the same primers used in the 16S extraction process of dataset under study.

qiime feature-classifier extract-reads \
    --i-sequences silva-138.1-ssu-nr99-seqs.qza \
    --p-f-primer GTGYCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACNVGGGTWTCTAAT \
    --p-n-jobs 2 \
    --p-read-orientation 'forward' \
    --o-reads silva-138.1-ssu-nr99-seqs-515f-806r.qza

Train the classifier

This step use a Naive-Bayes classifier and trains it on the extracted data from the previous step. The resultant classifier is stored as Qiime2 artifact which can be readily used for classification tasks using Qiime2.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads silva-138.1-ssu-nr99-seqs-515f-806r.qza \
  --i-reference-taxonomy silva-138.1-ssu-nr99-tax.qza \
  --o-classifier silva_classifier.qza

References

Pruesse, Elmar, Christian Quast, Katrin Knittel, Bernhard M. Fuchs, Wolfgang Ludwig, Jörg Peplies, and Frank Oliver Glöckner. 2007. “SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible with ARB.” Nucleic Acids Research 35 (21): 7188–96. https://doi.org/10.1093/nar/gkm864.

Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2012. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41 (D1): D590–96. https://doi.org/10.1093/nar/gks1219.

Robeson, Michael S, Devon R O’Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, and Nicholas A Bokulich. 2021. “RESCRIPt: Reproducible Sequence Taxonomy Reference Database Management.” PLoS Computational Biology 17 (11): e1009581.