Reference database
We will first setup our reference database in QIIME2 (version 2024.10). For that we are going to use SILVA database (Pruesse et al. 2007; Quast et al. 2012). There are other databases as well, such as, Greengenes, RDP.
To simplify the process of setting up reference database, we will use an excellent plugin that is Rescript (Robeson et al. 2021). This plugin provides an already built pipeline to download and process different reference databases (read more here).
Installing RESCRIPt plugin
We will use the following command to install the plugin
pip install git+https://github.com/bokulich-lab/RESCRIPt.git
Preparing SILVA database as reference database
We will follow steps provided here to download and build SILVA database for use in QIIME.
Step-1: Download SILVA
qiime rescript get-silva-data \
--p-version '138.1' \
--p-target 'SSURef_NR99' \
--o-silva-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
--o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza
RESCRIPt plugin works well for some QIIME versions. In case if you encounter the problem of cannot import name 'DNASequence' from 'q2_types.genome_data'
then refer to this page for detailed instructions to resolve the issue. The same steps are also provided below.
We will create an additional conda environment and install required packages to allow rescript to function properly.
conda install -c conda-forge -c bioconda -c qiime2 \
\
-c https://packages.qiime2.org/qiime2/2023.9/shotgun/released/ 'q2-types-genomics>2023.5' ncbi-datasets-pylib
-c defaults xmltodict
pip install git+https://github.com/bokulich-lab/RESCRIPt.git
qiime dev refresh-cache
qiime rescript --help
Step-2: Converting rna sequences to dna sequences
The resultant sequences from the above step are of ‘RNASequences’ data type. To ensure a smooth downstream analysis, we will convert data type to ‘DNASequences’ using the following command.
qiime rescript reverse-transcribe \
--i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
--o-dna-sequences silva-138.1-ssu-nr99-seqs.qza
The resultant qiime artifact silva-138.1-ssu-nr99-seqs.qza
now can be used to train a classifier. Additional steps could also be integrated in the process before building the classifier. Such as cutting low quality sequences, filtering based on length, etc.
The link provides some of those examples and command to do that using RESCRIPt
plugin.
Building a taxonomy classifer
Now we will move to train a taxonomy classifier which we are going to use later in our analysis to assign taxonomy labels to sequence data. Before doing that we will select V4 region from the SILVA database and use that extracted database for training our classifier.
This step of extracting V4 regions has been found to improve the performance.
Extract V4 region
This step extract the 16S region from the database using provided primers.
It is recommeded to use the same primers used in the 16S extraction process of dataset under study.
qiime feature-classifier extract-reads \
--i-sequences silva-138.1-ssu-nr99-seqs.qza \
--p-f-primer GTGYCAGCMGCCGCGGTAA \
--p-r-primer GGACTACNVGGGTWTCTAAT \
--p-n-jobs 2 \
--p-read-orientation 'forward' \
--o-reads silva-138.1-ssu-nr99-seqs-515f-806r.qza
Train the classifier
This step use a Naive-Bayes classifier and trains it on the extracted data from the previous step. The resultant classifier is stored as Qiime2 artifact which can be readily used for classification tasks using Qiime2.
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138.1-ssu-nr99-seqs-515f-806r.qza \
--i-reference-taxonomy silva-138.1-ssu-nr99-tax.qza \
--o-classifier silva_classifier.qza