Config file
A configuration file specifies all project-related inforamation in a yml file. This configuration file is all which needs to be updated in order to execute the workflow for another project. This makes bioinfomatician's file easier. Just think chaning your workflow every time you run it for a different project. In this cases, there is a high likelihood of human error.
To avoid this situation, all project-related information are kept in a yml file. Our workflow uses the following configuration file.
The table provides information on each key and thier meaning.
Key | Description |
---|---|
project | Name of the project; serves as an identifier for the analysis. |
scratch | Path to the scratch or working directory where outputs will be stored. |
raw_data | Directory containing raw input data files (e.g., raw FASTQ files). |
outputDIR | Directory where final pipeline outputs will be stored. |
metadata | Path to the QIIME2 metadata file containing sample information such as groups or treatments. |
manifest | Path to the manifest file containing fastq files path. |
tmp_dir | Temporary directory for QIIME2 operations (e.g., caching files during execution). |
file_name_pattern | File name pattern to parse sampleid and read direction information. |
prefix | File prefix used in raw FASTQ file naming (e.g., _L001_ ). |
r1_suf | Identifier used for forward read files (R1) in paired-end sequencing data. |
r2_suf | Identifier used for reverse read files (R2) in paired-end sequencing data. |
file_r1 | File name pattern to saved trimmomatic forward direction output file. |
file_r2 | File name pattern to saved trimmomatic backward direction output file. |
trimm_params | Parameters for trimming raw sequences with Trimmomatic (e.g., cropping sequences to a length of 200). |
primerF | Forward primer sequence used for amplifying the 16S rRNA region. |
primerR | Reverse primer sequence used for amplifying the 16S rRNA region. |
primer_err | Maximum allowable primer error rate during sequence matching. |
primer_overlap | Minimum overlap between primer and sequence for matching. |
database | Path to the QIIME2 reference database sequences file (e.g., SILVA 138.1). |
database_classifier | Path to the QIIME2 pre-trained classifier for taxonomic classification. |
database_tax | Path to the QIIME2 taxonomy file associated with the reference database. |
truncation_err | Maximum allowable error for sequence truncation in DADA2 (used for quality filtering). |
truncation_len-f | Length to truncate forward reads (R1) in DADA2 analysis. |
truncation_len-r | Length to truncate reverse reads (R2) in DADA2 analysis. |
quality_err | Maximum allowable quality error for sequence processing in DADA2. |
sampling_depth | Depth of subsampling used for diversity analysis to ensure consistency across samples. |
The following is the configuration file used for the workflow.
# Basic setup
project: trial
raw_data: 'raw_data'
outputDIR: results
metadata: metadata.tsv
manifest: manifest.csv
# Temp directory
tmp_dir: ./
# File naming pattern
# This pattern will be used to specify input files for the first step
# For example: Dataset with files having name such as "Healthy3-3021_S33_L001_R2_001.fastq.gz" will
# Fetch 'Healthy3-3021_S33' as sample and 'R2' as num
file_name_pattern: "{sample}_L001_{num}_001"
# Fastq file naming config
extension: .fastq.gz
prefix: _L001_
r1_suf: R1
r2_suf: R2
# Trimmomatic config
file_r1: "{sample}_L001_R1_001"
file_r2: "{sample}_L001_R2_001"
threads: 20
trimm_params: CROP:200
## 16S adapters from Zackular et al., 2014
primerF: GTGCCAGCMGCCGCGGTAA
primerR: GGACTACHVGGGTWTCTAAT
primer_err: 0.4
primer_overlap: 3
## Reference database
database: reference_db/silva-138.1-ssu-nr99-seqs-515f-806r.qza
database_classifier: reference_db/silva_classifier.qza
database_tax: reference_db/silva-138.1-ssu-nr99-tax.qza
## DADA2 - ASV flags
truncation_err: 2
truncation_len-f: 150
truncation_len-r: 140
truncation_err: 2
quality_err: 2
## Diversity metrics
sampling_depth: 500