A workflow is a series of tasks or programs executed in a specific order to achieve a goal. To automate the execution of these tasks, we use a workflow manager. This post introduces and provides a quick startup guide to Snakemake, a widely used workflow management system in bioinformatics.

🐍 Snakemake is a workflow manager that simplifies the creation and execution of workflows. Moreover, it offers robustness and scalability features.

🛠️ Setup

You can install Snakemake using the following command (make sure you have conda installed)

conda install bioconda::snakemake

📝 Writing your first rule

To illustrate the use of Snakemake, we will write a rule to combine two CSV files into a single file.

A rule to concatenate two CSV files

We will see how to write a snakemake rule to concatenate two files: one.csv, two.csv.

file: one.csv

Name,Age,City
Alice,25,New York
Bob,30,London
Charlie,28,Sydney
David,35,Toronto
Emma,22,Berlin

file: two.csv

Name,Age,City
Frank,27,Paris
Grace,32,Rome
Hannah,29,Tokyo
Ian,40,Madrid
Jack,23,Dublin

We start by writing our first Snakemake rule to concatenate two CSV files. Each Snakemake rule follows a common syntax, which is shown below. We will now break down the rule and explain it in detail

rule <rule_name>:
    input:
        <input_file_1>,
        <input_file_2>
    output:
        <output_file_1>,
        ...
        ...
    run/shell:
        """
        <commands to execute>
        """

Declare your rule name

The first line specifies a snakemake rule with specified name. Every rule name must be unique (i.e., it should not conflict with other rule names in your workflow).

rule concatenate_csv:

Specify your input files

Next, we will specify the input & output files for the rule. Snakemake uses this information to determine dependencies among rules (e.g., to decide which rule should be executed next).

    input:
        'one.csv',
        'two.csv'

Warning

Do not forget comma after each input file when you have multiple input files. Specify your output files

We will now specify our output file, i.e., third.csv.

Specify your output files

We will now specify our output file, i.e., third.csv.

Tip

✅ Snakemake only executes a rule when the output files are not available. In our case, when we run the workflow, Snakemake will automatically decide whether to execute a rule based on the availability of output files. 📂

🔄 If we execute the workflow a second time, Snakemake will not run the rule again because the output file is already there. 🎯

    output:
        'third.csv'

Specify your log file

It is a good practice to have a log file. It comes to very handy to troubleshooting errors when running a workflow with several rules.

    log:
        'concatenate.log'

Specify rule logic

This is where we will execute our commands to achieve the goal of the rule. We can write here Python code or shell commands. For our case, we need to write a logic to concatenate two csv files. Here, we will illustrates the use of Python and shell both. run/shell

Use run for python codes and shell for shell commands.

    run:
        import pandas as pd
        first = pd.read_csv('one.csv')
        second = pd.read_csv('two.csv')
        
        third = pd.concat([first,second])
        third.to_csv('third.csv')

Complete Snakefile

Our final rule will look like the following. First version of Snakefile

This is our first version of Snakefile.

Snakefile

rule concatenate_csv:
    input:
        'one.csv',
        'two.csv'
    output:
        'third.csv',
    run:
        import pandas as pd
        
        # Load csv files
        first = pd.read_csv('one.csv')
        second = pd.read_csv('two.csv')
            
        # Concatenate files
        third = pd.concat([first,second])

        # Save output file
        third.to_csv('third.csv')

Drawbacks of the Rule

Error Handling: Any error occurring during the execution of the Python code (e.g., File Not Found) is displayed only in the terminal. It would be better to store these errors in dedicated log files for each rule.
Flexibility: If we need to run the same workflow for different input files, we must manually modify multiple parts of the workflow, making it less adaptable. 🔄

Second version of Snakefile

In the second version, we will improve the workflow by making the following changes:

Modularizing the code – We will move the Python code to a separate script file and execute it using shell. Additionally, we will redirect both standard error and standard output to a log file for better error tracking.
Enhancing flexibility – Instead of hardcoding file names, we will store them in a configuration file and import them dynamically. This makes the workflow adaptable to different input files with minimal modifications.

Seperate the logic from Snakefile

We will now prepare a seperate python script for concatenating two files.

concatenate.py

import sys
import pandas as pd

# Get filenames from command-line arguments
file1 = sys.argv[1]
file2 = sys.argv[2]
output_file = sys.argv[3]

# Load csv files
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)

# Concatenate files
df_combined = pd.concat([df1, df2])
df_combined.to_csv(output_file, index=False)

print(f"Successfully merged {file1} and {file2} into {output_file}")

Storing all filenames in a config file

We will now write a configuration file which will store all varying information such as input file names, out filename.

config.yaml

first: 'one.csv'
second: 'two.csv'
output-file: 'third.csv'

Workflow updates

We will enhance the workflow with the following changes:

Import config.yaml 📄 – We will extract file names dynamically from a configuration file instead of hardcoding them.
Modify input and output 🔄 – The workflow will now use extracted filenames from config.yaml, making it more flexible.
Add a log component 📜 – We will log all execution details for better debugging and tracking.
Execute concatenate.py via shell 🖥️ – The script will be executed with input and output filenames passed as arguments. ➡️ Additionally, both standard output and standard error will be redirected to the log file using &>{log}.

Snakefile

# Import config file
configfile: "config.yaml"
    
# Fetch file names
file1 = config['first']
file2 = config['second']
result = config['output-file']

rule concatenate_csv:
    input:
        file1,
        file2
    output:
        result,
    log:
        'concatenate.log'
    shell:
        """
        python3 concatenate.py {file1} {file2} {result}  &>{log}
        """

🚀 Execution

It is a good practice to make a sanity check of your rules. This can be done using the following command, known as dry-run.

snakemake -n

The execution of this command will result in the following

snakemake -n
Building DAG of jobs...
Job stats:
job                count
---------------  -------
concatenate_csv        1
total                  1


[Fri Jan 31 16:33:56 2025]
rule concatenate_csv:
    input: one.csv, two.csv
    output: third.csv
    log: concatenate.log
    jobid: 0
    reason: Code has changed since last execution
    resources: tmpdir=/var/folders/hh/gyd1cnc93nj8sffbhmnpbrfr0000gn/T

Job stats:
job                count
---------------  -------
concatenate_csv        1
total                  1

Reasons:
    (check individual jobs above for details)
    code has changed since last execution:
        concatenate_csv
Some jobs were triggered by provenance information, see 'reason' section in the rule displays above.
If you prefer that only modification time is used to determine whether a job shall be executed, use the command line option '--rerun-triggers mtime' (also see --help).
If you are sure that a change for a certain output file (say, <outfile>) won't change the result (e.g. because you just changed the formatting of a script or environment definition), you can also wipe its metadata to skip such a trigger via 'snakemake --cleanup-metadata <outfile>'. 
Rules with provenance triggered jobs: concatenate_csv


This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Now we have our Snakefile ready. We will execute our workflow using the following command.

Running the workflow To run our workflow, we need to specify the number of cores. We do not need to specify the snakefile because snakemake automatically search for the file named Snakefile in the current directory.

snakemake --cores 1

On a successfull execution, a new third file third.csv will be creating consisting of records from both one.csv and two.csv.

Tip

You can specify explicitly specify workflow file at command line using -s flag.

snakemake -s Snakefile_version2 --cores 1