A workflow is a series of tasks or programs executed in a specific order to achieve a goal. To automate the execution of these tasks, we use a workflow manager. This post introduces and provides a quick startup guide to Snakemake, a widely used workflow management system in bioinformatics.
🐍 Snakemake is a workflow manager that simplifies the creation and execution of workflows. Moreover, it offers robustness and scalability features.
🛠️ Setup
You can install Snakemake using the following command (make sure you have conda installed)
conda install bioconda::snakemake
📝 Writing your first rule
To illustrate the use of Snakemake, we will write a rule to combine two CSV files into a single file.
A rule to concatenate two CSV files
We will see how to write a snakemake rule to concatenate two files: one.csv, two.csv.
file: one.csv
Name,Age,City
Alice,25,New York
Bob,30,London
Charlie,28,Sydney
David,35,Toronto
Emma,22,Berlin
file: two.csv
Name,Age,City
Frank,27,Paris
Grace,32,Rome
Hannah,29,Tokyo
Ian,40,Madrid
Jack,23,Dublin
We start by writing our first Snakemake rule to concatenate two CSV files. Each Snakemake rule follows a common syntax, which is shown below. We will now break down the rule and explain it in detail
<rule_name>:
rule input:
<input_file_1>,
<input_file_2>
output:<output_file_1>,
...
.../shell:
run"""
<commands to execute>
"""
Declare your rule name
The first line specifies a snakemake rule with specified name. Every rule name must be unique (i.e., it should not conflict with other rule names in your workflow).
rule concatenate_csv:
Specify your input files
Next, we will specify the input & output files for the rule. Snakemake uses this information to determine dependencies among rules (e.g., to decide which rule should be executed next).
input:
'one.csv',
'two.csv'
Do not forget comma after each input file when you have multiple input files. Specify your output files
We will now specify our output file, i.e., third.csv.
Specify your output files
We will now specify our output file, i.e., third.csv.
✅ Snakemake only executes a rule when the output files are not available. In our case, when we run the workflow, Snakemake will automatically decide whether to execute a rule based on the availability of output files. 📂
🔄 If we execute the workflow a second time, Snakemake will not run the rule again because the output file is already there. 🎯
output:'third.csv'
Specify your log file
It is a good practice to have a log file. It comes to very handy to troubleshooting errors when running a workflow with several rules.
log:'concatenate.log'
Specify rule logic
This is where we will execute our commands to achieve the goal of the rule. We can write here Python code or shell commands. For our case, we need to write a logic to concatenate two csv files. Here, we will illustrates the use of Python and shell both. run/shell
Use run for python codes and shell for shell commands.
run:import pandas as pd
= pd.read_csv('one.csv')
first = pd.read_csv('two.csv')
second
= pd.concat([first,second])
third 'third.csv') third.to_csv(
Complete Snakefile
Our final rule will look like the following. First version of Snakefile
This is our first version of Snakefile.
Snakefile
rule concatenate_csv:input:
'one.csv',
'two.csv'
output:'third.csv',
run:import pandas as pd
# Load csv files
= pd.read_csv('one.csv')
first = pd.read_csv('two.csv')
second
# Concatenate files
= pd.concat([first,second])
third
# Save output file
'third.csv') third.to_csv(
Drawbacks of the Rule
Error Handling: Any error occurring during the execution of the Python code (e.g., File Not Found) is displayed only in the terminal. It would be better to store these errors in dedicated log files for each rule.
Flexibility: If we need to run the same workflow for different input files, we must manually modify multiple parts of the workflow, making it less adaptable. 🔄
Second version of Snakefile
In the second version, we will improve the workflow by making the following changes:
Modularizing the code – We will move the Python code to a separate script file and execute it using shell. Additionally, we will redirect both standard error and standard output to a log file for better error tracking.
Enhancing flexibility – Instead of hardcoding file names, we will store them in a configuration file and import them dynamically. This makes the workflow adaptable to different input files with minimal modifications.
Seperate the logic from Snakefile
We will now prepare a seperate python script for concatenating two files.
concatenate.py
import sys
import pandas as pd
# Get filenames from command-line arguments
= sys.argv[1]
file1 = sys.argv[2]
file2 = sys.argv[3]
output_file
# Load csv files
= pd.read_csv(file1)
df1 = pd.read_csv(file2)
df2
# Concatenate files
= pd.concat([df1, df2])
df_combined =False)
df_combined.to_csv(output_file, index
print(f"Successfully merged {file1} and {file2} into {output_file}")
Storing all filenames in a config file
We will now write a configuration file which will store all varying information such as input file names, out filename.
config.yaml
first: 'one.csv'
second: 'two.csv'
output-file: 'third.csv'
Workflow updates
We will enhance the workflow with the following changes:
Import config.yaml 📄 – We will extract file names dynamically from a configuration file instead of hardcoding them.
Modify input and output 🔄 – The workflow will now use extracted filenames from config.yaml, making it more flexible.
Add a log component 📜 – We will log all execution details for better debugging and tracking.
Execute concatenate.py via shell 🖥️ – The script will be executed with input and output filenames passed as arguments. ➡️ Additionally, both standard output and standard error will be redirected to the log file using &>{log}.
Snakefile
# Import config file
"config.yaml"
configfile:
# Fetch file names
= config['first']
file1 = config['second']
file2 = config['output-file']
result
rule concatenate_csv:input:
file1,
file2
output:
result,
log:'concatenate.log'
shell:"""
python3 concatenate.py {file1} {file2} {result} &>{log}
"""
🚀 Execution
It is a good practice to make a sanity check of your rules. This can be done using the following command, known as dry-run.
snakemake -n
The execution of this command will result in the following
snakemake -n
Building DAG of jobs...
Job stats:
job count
--------------- -------
concatenate_csv 1
total 1
[Fri Jan 31 16:33:56 2025]
rule concatenate_csv:
input: one.csv, two.csv
output: third.csv
log: concatenate.log
jobid: 0
reason: Code has changed since last execution
resources: tmpdir=/var/folders/hh/gyd1cnc93nj8sffbhmnpbrfr0000gn/T
Job stats:
job count
--------------- -------
concatenate_csv 1
total 1
Reasons:
(check individual jobs above for details)
code has changed since last execution:
concatenate_csv
Some jobs were triggered by provenance information, see 'reason' section in the rule displays above.
If you prefer that only modification time is used to determine whether a job shall be executed, use the command line option '--rerun-triggers mtime' (also see --help).
If you are sure that a change for a certain output file (say, <outfile>) won't change the result (e.g. because you just changed the formatting of a script or environment definition), you can also wipe its metadata to skip such a trigger via 'snakemake --cleanup-metadata <outfile>'.
Rules with provenance triggered jobs: concatenate_csv
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Now we have our Snakefile
ready. We will execute our workflow using the following command.
Running the workflow To run our workflow, we need to specify the number of cores. We do not need to specify the snakefile because snakemake automatically search for the file named Snakefile
in the current directory.
snakemake --cores 1
On a successfull execution, a new third file third.csv will be creating consisting of records from both one.csv
and two.csv
.
You can specify explicitly specify workflow file at command line using -s flag.
snakemake -s Snakefile_version2 --cores 1