Skip to content

Config file

A configuration file specifies all project-related inforamation in a yml file. This configuration file is all which needs to be updated in order to execute the workflow for another project. This makes bioinfomatician's file easier. Just think chaning your workflow every time you run it for a different project. In this cases, there is a high likelihood of human error.

To avoid this situation, all project-related information are kept in a yml file. Our workflow uses the following configuration file.

The table provides information on each key and thier meaning.

Key Description
project Name of the project; serves as an identifier for the analysis.
scratch Path to the scratch or working directory where outputs will be stored.
raw_data Directory containing raw input data files (e.g., raw FASTQ files).
outputDIR Directory where final pipeline outputs will be stored.
metadata Path to the QIIME2 metadata file containing sample information such as groups or treatments.
manifest Path to the manifest file containing fastq files path.
tmp_dir Temporary directory for QIIME2 operations (e.g., caching files during execution).
file_name_pattern File name pattern to parse sampleid and read direction information.
prefix File prefix used in raw FASTQ file naming (e.g., _L001_).
r1_suf Identifier used for forward read files (R1) in paired-end sequencing data.
r2_suf Identifier used for reverse read files (R2) in paired-end sequencing data.
file_r1 File name pattern to saved trimmomatic forward direction output file.
file_r2 File name pattern to saved trimmomatic backward direction output file.
trimm_params Parameters for trimming raw sequences with Trimmomatic (e.g., cropping sequences to a length of 200).
primerF Forward primer sequence used for amplifying the 16S rRNA region.
primerR Reverse primer sequence used for amplifying the 16S rRNA region.
primer_err Maximum allowable primer error rate during sequence matching.
primer_overlap Minimum overlap between primer and sequence for matching.
database Path to the QIIME2 reference database sequences file (e.g., SILVA 138.1).
database_classifier Path to the QIIME2 pre-trained classifier for taxonomic classification.
database_tax Path to the QIIME2 taxonomy file associated with the reference database.
truncation_err Maximum allowable error for sequence truncation in DADA2 (used for quality filtering).
truncation_len-f Length to truncate forward reads (R1) in DADA2 analysis.
truncation_len-r Length to truncate reverse reads (R2) in DADA2 analysis.
quality_err Maximum allowable quality error for sequence processing in DADA2.
sampling_depth Depth of subsampling used for diversity analysis to ensure consistency across samples.

The following is the configuration file used for the workflow.

# Basic setup
project: trial
raw_data: 'raw_data'
outputDIR: results
metadata: metadata.tsv
manifest: manifest.csv

# Temp directory
tmp_dir: ./

# File naming pattern
# This pattern will be used to specify input files for the first step
# For example: Dataset with files having name such as "Healthy3-3021_S33_L001_R2_001.fastq.gz" will 
#    Fetch 'Healthy3-3021_S33' as sample and 'R2' as num
file_name_pattern: "{sample}_L001_{num}_001"


# Fastq file naming config
extension: .fastq.gz
prefix: _L001_
r1_suf: R1
r2_suf: R2

# Trimmomatic config
file_r1: "{sample}_L001_R1_001"
file_r2: "{sample}_L001_R2_001"
threads: 20
trimm_params: CROP:200

## 16S adapters from Zackular et al., 2014
primerF: GTGCCAGCMGCCGCGGTAA
primerR: GGACTACHVGGGTWTCTAAT
primer_err: 0.4
primer_overlap: 3

## Reference database
database: reference_db/silva-138.1-ssu-nr99-seqs-515f-806r.qza
database_classifier: reference_db/silva_classifier.qza
database_tax: reference_db/silva-138.1-ssu-nr99-tax.qza

## DADA2 - ASV flags
truncation_err: 2
truncation_len-f: 150
truncation_len-r: 140
truncation_err: 2
quality_err: 2

## Diversity metrics
sampling_depth: 500