Reproducible bioinformatics with Snakemake

The why and the how

Sibbe Bakker

AG Walther

2024-07-05

Introduction

Why reproducibility?

Is this familiar?

  1. You see a computational tool you want to use.

  2. You spend a full day installing it.

How can we all solve this?

by starting small

The reproducibility iceberg [1]

Workflow languages

There are programming languages to solve this exact problem.

  • Snakemake [2]

  • Nextflow [3].

  • CLW [4]

  • Many others \(\dots\)

Snakemake

Why would you use it

  • You work in bioinformatics

  • You know python

  • You want to start quickly.

  • Your work is based on files.

  • Good documentation.

Why wouldn’t you use it

  • You do not need to make a workflow

  • If your dependencies can’t be installed using conda.

  • If your programme needs to be used by non experts.

What is a snakemake workflow?

## sha
rule sha:
  input: "data/{filename}.{ex}"
  output: "results/{filename}.{ex}.sha"
  shell: "sha1sum  {input} | cut -d ' ' -f 1 > {output}"
snakemake results/test.txt.sha
# --> Snakemake does: 
# sha1sum data/test.txt | cut -d ' ' -f 1 > results/test.txt.sha

Getting started\(\dots\)

Get Snakemake with the following commands1

# Global instalation, assuming a bashlike environment...
conda install -n base -c conda-forge mamba
mamba create -c conda-forge--c bioconda -n snakemakes snakemake
mamba activate snakemake

Lets get started with an example problem…

Read mapping with Snakemake

Two strains in an evolution experiment.


Experimental work

We have a culture of Ancestor bacteria, we apply an evolution experiment to two cultures.

Initialisation

  • We work on a git repository.
  • We work with a set file structure.
  • We specificy dependencies via conda.
$ git init snakemake-example
$ mkdir -p workflow/envs workflow/scripts
$ touch workflow/Snakefile
$ mkdir data results
$ tree
├── LICENSE # How can you use it
├── README.md # How to use it
|── results/
|── data/
└── workflow/
    ├── envs/
    |── scripts/
    └── Snakefile

First rule: QC (i)

$ cd data
$ wget https://osf.io/2jc4a/download -O download.tar.gz
$ tar -xzf download.tar.gz
$ ls data
anc_R1.fastq.gz  anc_R2.fastq.gz  evol1_R1.fastq.gz
evol1_R2.fastq.gz  evol2_R1.fastq.gz  evol2_R2.fastq.gz

Now quality control:

  • We need fastqc and multiqc.

  • How?

  • From prefix.dev.

First rule: QC (ii)

$ touch workflow/envs/qc.yml
$ open workflow/envs/qc.yml

Then add the following contents \(\dots\)

name: quality-control
channels:
  - conda-forge
  - bioconda
dependencies:
  - fastqc # quality scoring
  - multiqc # for reporting
  - fastp # Trimming reads

From now we need to run with the --use-conda parameter.

First rule: QC (iii)

Now we type the rule\(\dots\)

$ open workflow/Snakefile

Let’s try \(\dots\)

rule fastqc:
  conda: "envs/qc.yml"
  input: "data/data/{seqname}.fastq.gz"
  output: directory("results/quality-control")
  shell: "mkdir -p {output}; fastqc -i {input} -o {output}"

Test it \(\dots\)

$ snakemake --use-conda results/quality-control
WildcardError in rule fastqc in
snakemake-tutorial/workflow/Snakefile,
line 1: Wildcards in input files cannot be determined from output files:
(rule fastqc, line 4, snakemake-tutorial/workflow/Snakefile)
'seqname'

First rule: QC (iv)

  • That did not work
rule fastqc:
  conda: "envs/qc.yml"
  input: expand("data/data/{seqname}.fastq.gz", 
      seqname=glob_wildcards("data/data/{seqname}.fastq.gz").seqname)
  output: directory("results/quality-control/fastqc")
  shell: "mkdir {output}; fastqc {input} -o {output}"
  • This does!
$ tree results
results/
└── quality-control
    ├── anc_R1_fastqc.html
    ├── evol1_R1_fastqc.html
    ...
    ├── evol2_R2_fastqc.html
    └── evol2_R2_fastqc.zip

Intermission: why didn’t that work?

Wildcards are determined from the output!

image/svg+xml asample:ref1_2 countreads indir: trimmed 4 trimreads asample: ref1_2 2 kallisto_quant sample: ref1 0 countreads indir: trimmed asample:ref1_1 1 trimreads asample: ref1_1 3 kallisto_index strain: Saccharomyces_cerevisiae.R64-1-1 transcriptome/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz reads/ref1_2.fq reads/ref1_1.fq trimmed.ref1_1.fq.count trimmed.ref1_2.fq.count kallisto.ref1/abundance.h5kallisto.ref1/abundance.tsvkallisto.ref1/run_info.json

The Directed acyclic graph [5].

Summarising QC

rule multiqc:
  conda: "envs/qc.yml"
  input: rules.fastqc.output
  output: directory("results/quality-control/multiqc")
  shell: "multiqc {input} -o {output}"

Multiqc output

Advanced usage

see the documentation for more information

Usefull cli options to Snakemake.

killall -TERM snakemake
Stops the submission of new tasks and kills snakemake when tasks have finished.
snakemake --dry-run --printshellcmds
Prints the rules that will be executed, with the command. Without running the rules.
snakemake -c 1 --forceall --dag | dot -Tpng > dag.png
Write a png file of the DAG to disk.

Specification of Pypi dependencies

name: my_snakemake_env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip
  - your_other_conda_dependencies
  - pip:
    - metapub

Sharing your workflow

So now we have snakemake, conda and possibly other dependencies not yet handled \(\dots\).

curl -fsSL https://pixi.sh/install.sh | bash

Introduction to the pixi package manager

$ pixi init project
 Initialized project in /home/user/project

 cd project
$ pixi add python numpy cmake
 Added python 3.11.4.*
 Added numpy 1.25.2.*
 Added cmake 3.26.4.*

$ pixi run python --version
Python 3.11.4

$ pixi add python==3.10
 Added python ==3.10.0.*

$ pixi run python --version
Python 3.10.

The toml file

[project]
name = "project"
version = "0.1.0"
description = "Example project"
authors = ["Pixi <hi@prefix.dev>"]
channels = ["conda-forge"]
platforms = ["linux-64", "win-64", "osx-arm64", "osx-64"]

[tasks]
start = "python main.py"

[dependencies]
python = "3.11.4.*"
numpy = "1.25.2.*"
cmake = "3.26.4.*"

Where are things stored?

$ tree
├── LICENSE
├── pixi.toml
├── pixi.lock
├── README.md
|── results/
|── data/
|── workflow/
|   ├── envs/
|   |── scripts/
|   └── Snakefile
|──── .snakemake/
└──── .pixi/

Questions that may remain

How well does CHATGPT understand Snakemake?

Well enough to explain code and design simple rules.

What if I want to have a command with braces?

you escape them as such:

# wrong ->  grep "{a}"
grep "{{a}}" # correct

I want to use a file in a directory?

Problem

rule random:
  output: directory("results/random-{n}")
  shell:
    """
    mkdir {output} -p;
    for i in $(seq 1 {wildcards.n});  do
     filename="$((1 + $RANDOM % 10))-number"
     touch "{output}/$filename";  done
    """
$ ls
results/random-2/3-number
results/random-2/6-number

We want the sha string.

Solution

import glob
def get_random(wildcards):
  out=(checkpoints
    .random
    .get(**wildcards))
  f = out.output
  files = (glob
    .glob(f"{f}/*-number"))
  return files

rule sha:
  input: get_random
  output: 
    "results/sha-{n}.ssv"
  shell: 
    "shasum \
      {input} > {output}"

Further information

What I based my presentation on

Repositories

Cited works

1.
Kim Y-M, Poline J-B, Dumas G (2017) Experimenting with reproducibility in bioinformatics. https://www.biorxiv.org/content/10.1101/143503v2. Accessed 3 Jul 2024
2.
Köster J, Rahmann S (2012) Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28(19):2520–2522. https://doi.org/10.1093/bioinformatics/bts480
3.
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C (2017) Nextflow enables reproducible computational workflows. Nat Biotechnol 35(4):316–319. https://doi.org/10.1038/nbt.3820
4.
Crusoe MR, Abeln S, Iosup A, et al (2022) Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language. Commun ACM 65(6):54–63. https://doi.org/10.1145/3486897
5.
Snakemake for Bioinformatics: How Snakemake plans its jobs. https://carpentries-incubator.github.io/snakemake-novice-bioinformatics/04-the_dag.html. Accessed 3 Jul 2024