Reproducible bioinformatics with `Snakemake`

The why and the how

Sibbe Bakker

AG Walther

2024-07-05

Introduction

Why reproducibility?

Is this familiar?

You see a computational tool you want to use.
You spend a full day installing it.

How can we all solve this?

by starting small

Workflow languages

There are programming languages to solve this exact problem.

Snakemake [2]
Nextflow [3].
CLW [4]
Many others \(\dots\)

Snakemake

Why would you use it

You work in bioinformatics
You know python
You want to start quickly.
Your work is based on files.
Good documentation.

Why wouldn’t you use it

You do not need to make a workflow
If your dependencies can’t be installed using conda.
If your programme needs to be used by non experts.

What is a snakemake workflow?

## sha
rule sha:
  input: "data/{filename}.{ex}"
  output: "results/{filename}.{ex}.sha"
  shell: "sha1sum  {input} | cut -d ' ' -f 1 > {output}"

snakemake results/test.txt.sha
# --> Snakemake does: 
# sha1sum data/test.txt | cut -d ' ' -f 1 > results/test.txt.sha

Getting started\(\dots\)

Get Snakemake with the following commands¹

# Global instalation, assuming a bashlike environment...
conda install -n base -c conda-forge mamba
mamba create -c conda-forge--c bioconda -n snakemakes snakemake
mamba activate snakemake

Lets get started with an example problem…

Read mapping with Snakemake

Two strains in an evolution experiment.

Experimental work

We have a culture of Ancestor bacteria, we apply an evolution experiment to two cultures.

Initialisation

We work on a git repository.

We work with a set file structure.

We specificy dependencies via conda.

$ git init snakemake-example
$ mkdir -p workflow/envs workflow/scripts
$ touch workflow/Snakefile
$ mkdir data results
$ tree
├── LICENSE # How can you use it
├── README.md # How to use it
|── results/
|── data/
└── workflow/
    ├── envs/
    |── scripts/
    └── Snakefile

First rule: QC (i)

$ cd data
$ wget https://osf.io/2jc4a/download -O download.tar.gz
$ tar -xzf download.tar.gz
$ ls data
anc_R1.fastq.gz  anc_R2.fastq.gz  evol1_R1.fastq.gz
evol1_R2.fastq.gz  evol2_R1.fastq.gz  evol2_R2.fastq.gz

Now quality control:

We need fastqc and multiqc.
How?
From prefix.dev.

First rule: QC (ii)

$ touch workflow/envs/qc.yml
$ open workflow/envs/qc.yml

Then add the following contents \(\dots\)

name: quality-control
channels:
  - conda-forge
  - bioconda
dependencies:
  - fastqc # quality scoring
  - multiqc # for reporting
  - fastp # Trimming reads

From now we need to run with the --use-conda parameter.

First rule: QC (iii)

Now we type the rule\(\dots\)

$ open workflow/Snakefile

Let’s try \(\dots\)

rule fastqc:
  conda: "envs/qc.yml"
  input: "data/data/{seqname}.fastq.gz"
  output: directory("results/quality-control")
  shell: "mkdir -p {output}; fastqc -i {input} -o {output}"

Test it \(\dots\)

$ snakemake --use-conda results/quality-control
WildcardError in rule fastqc in
snakemake-tutorial/workflow/Snakefile,
line 1: Wildcards in input files cannot be determined from output files:
(rule fastqc, line 4, snakemake-tutorial/workflow/Snakefile)
'seqname'

First rule: QC (iv)

That did not work

rule fastqc:
  conda: "envs/qc.yml"
  input: expand("data/data/{seqname}.fastq.gz", 
      seqname=glob_wildcards("data/data/{seqname}.fastq.gz").seqname)
  output: directory("results/quality-control/fastqc")
  shell: "mkdir {output}; fastqc {input} -o {output}"

This does!

$ tree results
results/
└── quality-control
    ├── anc_R1_fastqc.html
    ├── evol1_R1_fastqc.html
    ...
    ├── evol2_R2_fastqc.html
    └── evol2_R2_fastqc.zip

Intermission: why didn’t that work?

Wildcards are determined from the output!

The Directed acyclic graph [5].

Summarising QC

rule multiqc:
  conda: "envs/qc.yml"
  input: rules.fastqc.output
  output: directory("results/quality-control/multiqc")
  shell: "multiqc {input} -o {output}"

Multiqc output

Advanced usage

see the documentation for more information

Usefull `cli` options to Snakemake.

killall -TERM snakemake: Stops the submission of new tasks and kills snakemake when tasks have finished.
snakemake --dry-run --printshellcmds: Prints the rules that will be executed, with the command. Without running the rules.
snakemake -c 1 --forceall --dag | dot -Tpng > dag.png: Write a png file of the DAG to disk.

Specification of `Pypi` dependencies

name: my_snakemake_env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip
  - your_other_conda_dependencies
  - pip:
    - metapub

Introduction to the `pixi` package manager

$ pixi init project
✔ Initialized project in /home/user/project

➜ cd project
$ pixi add python numpy cmake
✔ Added python 3.11.4.*
✔ Added numpy 1.25.2.*
✔ Added cmake 3.26.4.*

$ pixi run python --version
Python 3.11.4

$ pixi add python==3.10
✔ Added python ==3.10.0.*

$ pixi run python --version
Python 3.10.

The `toml` file

[project]
name = "project"
version = "0.1.0"
description = "Example project"
authors = ["Pixi <hi@prefix.dev>"]
channels = ["conda-forge"]
platforms = ["linux-64", "win-64", "osx-arm64", "osx-64"]

[tasks]
start = "python main.py"

[dependencies]
python = "3.11.4.*"
numpy = "1.25.2.*"
cmake = "3.26.4.*"

Where are things stored?

$ tree
├── LICENSE
├── pixi.toml
├── pixi.lock
├── README.md
|── results/
|── data/
|── workflow/
|   ├── envs/
|   |── scripts/
|   └── Snakefile
|──── .snakemake/
└──── .pixi/

Questions that may remain

How well does CHATGPT understand Snakemake?

Well enough to explain code and design simple rules.

What if I want to have a command with braces?

you escape them as such:

# wrong ->  grep "{a}"
grep "{{a}}" # correct

I want to use a file in a directory?

Problem

rule random:
  output: directory("results/random-{n}")
  shell:
    """
    mkdir {output} -p;
    for i in $(seq 1 {wildcards.n});  do
     filename="$((1 + $RANDOM % 10))-number"
     touch "{output}/$filename";  done
    """

$ ls
results/random-2/3-number
results/random-2/6-number

We want the sha string.

Solution

import glob
def get_random(wildcards):
  out=(checkpoints
    .random
    .get(**wildcards))
  f = out.output
  files = (glob
    .glob(f"{f}/*-number"))
  return files

rule sha:
  input: get_random
  output: 
    "results/sha-{n}.ssv"
  shell: 
    "shasum \
      {input} > {output}"

Further information

What I based my presentation on

Reproducible Data Analytic Workflows for Economics

Repositories

This presentation:
mpi-snakemake-overview.
The tutorial for readmapping:
snakemake tutorial.
Example workflow for the raven cluster:
snakemake cluster.

Cited works

Kim Y-M, Poline J-B, Dumas G (2017) Experimenting with reproducibility in bioinformatics. https://www.biorxiv.org/content/10.1101/143503v2. Accessed 3 Jul 2024

Köster J, Rahmann S (2012) Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28(19):2520–2522. https://doi.org/10.1093/bioinformatics/bts480

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C (2017) Nextflow enables reproducible computational workflows. Nat Biotechnol 35(4):316–319. https://doi.org/10.1038/nbt.3820

Crusoe MR, Abeln S, Iosup A, et al (2022) Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language. Commun ACM 65(6):54–63. https://doi.org/10.1145/3486897

Snakemake for Bioinformatics: How Snakemake plans its jobs. https://carpentries-incubator.github.io/snakemake-novice-bioinformatics/04-the_dag.html. Accessed 3 Jul 2024