Snakemake
The why and the how
AG Walther
2024-07-05
Why reproducibility?
You see a computational tool you want to use.
You spend a full day installing it.
How can we all solve this?
by starting small
There are programming languages to solve this exact problem.
Why would you use it
You work in bioinformatics
You know python
You want to start quickly.
Your work is based on files.
Good documentation.
Why wouldn’t you use it
You do not need to make a workflow
If your dependencies can’t be installed using conda.
If your programme needs to be used by non experts.
Get Snakemake with the following commands1
# Global instalation, assuming a bashlike environment...
conda install -n base -c conda-forge mamba
mamba create -c conda-forge--c bioconda -n snakemakes snakemake
mamba activate snakemake
Lets get started with an example problem…
Two strains in an evolution experiment.
We have a culture of Ancestor bacteria, we apply an evolution experiment to two cultures.
$ cd data
$ wget https://osf.io/2jc4a/download -O download.tar.gz
$ tar -xzf download.tar.gz
$ ls data
anc_R1.fastq.gz anc_R2.fastq.gz evol1_R1.fastq.gz
evol1_R2.fastq.gz evol2_R1.fastq.gz evol2_R2.fastq.gz
Now quality control:
We need fastqc
and multiqc
.
How?
From prefix.dev
.
Then add the following contents \(\dots\)
name: quality-control
channels:
- conda-forge
- bioconda
dependencies:
- fastqc # quality scoring
- multiqc # for reporting
- fastp # Trimming reads
From now we need to run with the --use-conda
parameter.
Now we type the rule\(\dots\)
$ open workflow/Snakefile
Let’s try \(\dots\)
rule fastqc:
conda: "envs/qc.yml"
input: "data/data/{seqname}.fastq.gz"
output: directory("results/quality-control")
shell: "mkdir -p {output}; fastqc -i {input} -o {output}"
Test it \(\dots\)
rule fastqc:
conda: "envs/qc.yml"
input: expand("data/data/{seqname}.fastq.gz",
seqname=glob_wildcards("data/data/{seqname}.fastq.gz").seqname)
output: directory("results/quality-control/fastqc")
shell: "mkdir {output}; fastqc {input} -o {output}"
Wildcards are determined from the output!
rule multiqc:
conda: "envs/qc.yml"
input: rules.fastqc.output
output: directory("results/quality-control/multiqc")
shell: "multiqc {input} -o {output}"
Multiqc output
see the documentation for more information
cli
options to Snakemake.killall -TERM snakemake
snakemake --dry-run --printshellcmds
snakemake -c 1 --forceall --dag | dot -Tpng > dag.png
Pypi
dependenciesSo now we have snakemake, conda and possibly other dependencies not yet handled \(\dots\).
pixi
package manager$ pixi init project
✔ Initialized project in /home/user/project
➜ cd project
$ pixi add python numpy cmake
✔ Added python 3.11.4.*
✔ Added numpy 1.25.2.*
✔ Added cmake 3.26.4.*
$ pixi run python --version
Python 3.11.4
$ pixi add python==3.10
✔ Added python ==3.10.0.*
$ pixi run python --version
Python 3.10.
toml
file[project]
name = "project"
version = "0.1.0"
description = "Example project"
authors = ["Pixi <hi@prefix.dev>"]
channels = ["conda-forge"]
platforms = ["linux-64", "win-64", "osx-arm64", "osx-64"]
[tasks]
start = "python main.py"
[dependencies]
python = "3.11.4.*"
numpy = "1.25.2.*"
cmake = "3.26.4.*"
$ tree
├── LICENSE
├── pixi.toml
├── pixi.lock
├── README.md
|── results/
|── data/
|── workflow/
| ├── envs/
| |── scripts/
| └── Snakefile
|──── .snakemake/
└──── .pixi/
Well enough to explain code and design simple rules.
you escape them as such:
This presentation:
mpi-snakemake-overview
.
The tutorial for readmapping:
snakemake tutorial
.
Example workflow for the raven cluster:
snakemake cluster
.