7 Project Organization
Note: this .Rmd is run in bash as there are no zsh specific commands and there are several bash-specific commands. If you are running zsh, type bash into your terminal to switch shells before you begin.
Getting your project started
Project organization is one of the most important parts of a sequencing project, and yet is often overlooked amidst the excitement of getting a first look at new data. Of course, while it’s best to get yourself organized before you even begin your analyses, it’s never too late to start, either.
You should approach your sequencing project similarly to how you do a biological experiment and this ideally begins with experimental design. We’re going to assume that you’ve already designed a beautiful sequencing experiment to address your biological question, collected appropriate samples, and that you have enough statistical power to answer the questions you’re interested in asking. These steps are all incredibly important, but beyond the scope of our course. For all of those steps (collecting specimens, extracting DNA, prepping your samples) you’ve likely kept a lab notebook that details how and why you did each step. However, the process of documentation doesn’t stop at the sequencer!
Genomics projects can quickly accumulate hundreds of files across tens of folders. Every computational analysis you perform over the course of your project is going to create many files, which can especially become a problem when you’ll inevitably want to run some of those analyses again. For instance, you might have made significant headway into your project, but then have to remember the PCR conditions you used to create your sequencing library months prior.
Other questions might arise along the way: - What were your best alignment results? - Which folder were they in: Analysis1, AnalysisRedone, or AnalysisRedone2? - Which quality cutoff did you use? - What version of a given program did you implement your analysis in?
Good documentation is key to avoiding this issue, and luckily enough, recording your computational experiments is even easier than recording lab data. Copy/Paste will become your best friend, sensible file names will make your analysis understandable by you and your collaborators, and writing the methods section for your next paper will be easy! Remember that in any given project of yours, it’s worthwhile to consider a future version of yourself as an entirely separate collaborator. The better your documenation is, the more this ‘collaborator’ will feel indebted to you!
With this in mind, let’s have a look at the best practices for documenting your genomics project. Your future self will thank you.
In this exercise we will setup a file system for the project we will be working on for the variant section.
We will start by creating a directory that we can use for the variant section. First navigate to your home directory. Then confirm that you are in the correct directory using the pwd
command.
cd ~
pwd
## bash: no job control in this shell
## /Users/ggiaever
You should see the output:
/Users/your username
7.1 Tip
If you aren’t in your home directory, the easiest way to get there is to enter the command cd ~
, which
always returns you to home.
7.2 Exercise
Use the mkdir
command to make the following directories:
dc_workshop
dc_workshop/docs
dc_workshop/data
dc_workshop/results
Note: if you’ve already downloaded the fastq files into ~/dc_workshop/data/untrimmed_fastq be careful not to remake the data directory
7.3 Solution
## bash: no job control in this shell
Use ls -R
to verify that you have created these directories. The -R
option for ls
stands for recursive. This option causes
ls
to return the contents of each subdirectory within the directory
iteratively.
cd ~
ls -R dc_workshop
## bash: no job control in this shell
## data
## docs
## fastqc_html
## files_for_igv
## results
## scripts
##
## dc_workshop/data:
## ref_genome
## trimmed_fastq
## trimmed_fastq_small
## untrimmed_fastq
## untrimmed_fastq.zip
##
## dc_workshop/data/ref_genome:
## ecoli_rel606.fasta
## ecoli_rel606.fasta.amb
## ecoli_rel606.fasta.ann
## ecoli_rel606.fasta.bwt
## ecoli_rel606.fasta.fai
## ecoli_rel606.fasta.pac
## ecoli_rel606.fasta.sa
##
## dc_workshop/data/trimmed_fastq:
## SRR2584863_1.trim.fastq
## SRR2584863_1.trim.fastq.gz
## SRR2584863_1un.trim.fastq.gz
## SRR2584863_2.trim.fastq
## SRR2584863_2.trim.fastq.gz
## SRR2584863_2un.trim.fastq.gz
## SRR2584866_1.trim.fastq.gz
## SRR2584866_1un.trim.fastq.gz
## SRR2584866_2.trim.fastq.gz
## SRR2584866_2un.trim.fastq.gz
## SRR2589044_1.trim.fastq.gz
## SRR2589044_1un.trim.fastq.gz
## SRR2589044_2.trim.fastq.gz
## SRR2589044_2un.trim.fastq.gz
##
## dc_workshop/data/trimmed_fastq_small:
## SRR2584863_1.trim.sub.fastq
## SRR2584863_2.trim.sub.fastq
## SRR2584866_1.trim.sub.fastq
## SRR2584866_2.trim.sub.fastq
## SRR2589044_1.trim.sub.fastq
## SRR2589044_2.trim.sub.fastq
##
## dc_workshop/data/untrimmed_fastq:
## NexteraPE-PE.fa
## SRR2584863_1.fastq.gz
## SRR2584863_2.fastq.gz
## SRR2584866_1.fastq.gz
## SRR2584866_2.fastq.gz
## SRR2589044_1.fastq.gz
## SRR2589044_2.fastq.gz
##
## dc_workshop/docs:
## fastqc_summaries.txt
##
## dc_workshop/fastqc_html:
## trimmed
## untrimmed
##
## dc_workshop/fastqc_html/trimmed:
## SRR2584863_1.trim_fastqc.html
## SRR2584863_2.trim_fastqc.html
## SRR2584866_1.trim_fastqc.html
## SRR2584866_2.trim_fastqc.html
## SRR2589044_1.trim_fastqc.html
## SRR2589044_2.trim_fastqc.html
##
## dc_workshop/fastqc_html/untrimmed:
## SRR2584863_1_fastqc.html
## SRR2584863_2_fastqc.html
## SRR2584866_1_fastqc.html
## SRR2584866_2_fastqc.html
## SRR2589044_1_fastqc.html
## SRR2589044_2_fastqc.html
##
## dc_workshop/files_for_igv:
## SRR2584866.aligned.sorted.bam
## SRR2584866.aligned.sorted.bam.bai
## SRR2584866_final_variants.vcf
## ecoli_rel606.fasta
## ecoli_rel606.fasta.fai
##
## dc_workshop/results:
## bam
## bcf
## fastqc_untrimmed_reads
## sam
## trimmed_fastqc
## vcf
##
## dc_workshop/results/bam:
## SRR2584863.aligned.bam
## SRR2584863.aligned.sorted.bam
## SRR2584863.aligned.sorted.bam.bai
## SRR2584866.aligned.bam
## SRR2584866.aligned.sorted.bam
## SRR2584866.aligned.sorted.bam.bai
## SRR2589044.aligned.bam
## SRR2589044.aligned.sorted.bam
## SRR2589044.aligned.sorted.bam.bai
##
## dc_workshop/results/bcf:
## SRR2584863_raw.bcf
## SRR2584863_variants.vcf
## SRR2584866_raw.bcf
## SRR2584866_variants.vcf
## SRR2589044_raw.bcf
## SRR2589044_variants.vcf
##
## dc_workshop/results/fastqc_untrimmed_reads:
## SRR2584863_1_fastqc
## SRR2584863_1_fastqc.html
## SRR2584863_1_fastqc.zip
## SRR2584863_2_fastqc
## SRR2584863_2_fastqc.html
## SRR2584863_2_fastqc.zip
## SRR2584866_1_fastqc
## SRR2584866_1_fastqc.html
## SRR2584866_1_fastqc.zip
## SRR2584866_2_fastqc
## SRR2584866_2_fastqc.html
## SRR2584866_2_fastqc.zip
## SRR2589044_1_fastqc
## SRR2589044_1_fastqc.html
## SRR2589044_1_fastqc.zip
## SRR2589044_2_fastqc
## SRR2589044_2_fastqc.html
## SRR2589044_2_fastqc.zip
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584863_1_fastqc:
## Icons
## Images
## fastqc.fo
## fastqc_data.txt
## fastqc_report.html
## summary.txt
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584863_1_fastqc/Icons:
## error.png
## fastqc_icon.png
## tick.png
## warning.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584863_1_fastqc/Images:
## adapter_content.png
## duplication_levels.png
## per_base_n_content.png
## per_base_quality.png
## per_base_sequence_content.png
## per_sequence_gc_content.png
## per_sequence_quality.png
## per_tile_quality.png
## sequence_length_distribution.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584863_2_fastqc:
## Icons
## Images
## fastqc.fo
## fastqc_data.txt
## fastqc_report.html
## summary.txt
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584863_2_fastqc/Icons:
## error.png
## fastqc_icon.png
## tick.png
## warning.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584863_2_fastqc/Images:
## adapter_content.png
## duplication_levels.png
## per_base_n_content.png
## per_base_quality.png
## per_base_sequence_content.png
## per_sequence_gc_content.png
## per_sequence_quality.png
## per_tile_quality.png
## sequence_length_distribution.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584866_1_fastqc:
## Icons
## Images
## fastqc.fo
## fastqc_data.txt
## fastqc_report.html
## summary.txt
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584866_1_fastqc/Icons:
## error.png
## fastqc_icon.png
## tick.png
## warning.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584866_1_fastqc/Images:
## adapter_content.png
## duplication_levels.png
## per_base_n_content.png
## per_base_quality.png
## per_base_sequence_content.png
## per_sequence_gc_content.png
## per_sequence_quality.png
## per_tile_quality.png
## sequence_length_distribution.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584866_2_fastqc:
## Icons
## Images
## fastqc.fo
## fastqc_data.txt
## fastqc_report.html
## summary.txt
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584866_2_fastqc/Icons:
## error.png
## fastqc_icon.png
## tick.png
## warning.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2584866_2_fastqc/Images:
## adapter_content.png
## duplication_levels.png
## per_base_n_content.png
## per_base_quality.png
## per_base_sequence_content.png
## per_sequence_gc_content.png
## per_sequence_quality.png
## per_tile_quality.png
## sequence_length_distribution.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2589044_1_fastqc:
## Icons
## Images
## fastqc.fo
## fastqc_data.txt
## fastqc_report.html
## summary.txt
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2589044_1_fastqc/Icons:
## error.png
## fastqc_icon.png
## tick.png
## warning.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2589044_1_fastqc/Images:
## adapter_content.png
## duplication_levels.png
## per_base_n_content.png
## per_base_quality.png
## per_base_sequence_content.png
## per_sequence_gc_content.png
## per_sequence_quality.png
## per_tile_quality.png
## sequence_length_distribution.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2589044_2_fastqc:
## Icons
## Images
## fastqc.fo
## fastqc_data.txt
## fastqc_report.html
## summary.txt
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2589044_2_fastqc/Icons:
## error.png
## fastqc_icon.png
## tick.png
## warning.png
##
## dc_workshop/results/fastqc_untrimmed_reads/SRR2589044_2_fastqc/Images:
## adapter_content.png
## duplication_levels.png
## per_base_n_content.png
## per_base_quality.png
## per_base_sequence_content.png
## per_sequence_gc_content.png
## per_sequence_quality.png
## per_tile_quality.png
## sequence_length_distribution.png
##
## dc_workshop/results/sam:
## SRR2584863.aligned.sam
## SRR2584866.aligned.sam
## SRR2589044.aligned.sam
##
## dc_workshop/results/trimmed_fastqc:
## SRR2584863_1.trim_fastqc.html
## SRR2584863_1.trim_fastqc.zip
## SRR2584863_2.trim_fastqc.html
## SRR2584863_2.trim_fastqc.zip
## SRR2584866_1.trim_fastqc.html
## SRR2584866_1.trim_fastqc.zip
## SRR2584866_2.trim_fastqc.html
## SRR2584866_2.trim_fastqc.zip
## SRR2589044_1.trim_fastqc.html
## SRR2589044_1.trim_fastqc.zip
## SRR2589044_2.trim_fastqc.html
## SRR2589044_2.trim_fastqc.zip
##
## dc_workshop/results/vcf:
## SRR2584863_final_variants.vcf
## SRR2584866_final_variants.vcf
## SRR2584866_variants.vcf
## SRR2589044_final_variants.vcf
##
## dc_workshop/scripts:
## read_qc.sh
## run_full_variant_calling.sh
## run_variant_calling.sh
You should see the following output:
dc_workshop/: data docs results
dc_workshop/data:
dc_workshop/docs:
dc_workshop/results:
7.4 Organizing your files
Before beginning any analysis, it’s important to save a copy of your raw data. The raw data should never be changed. Regardless of how sure you are that you want to carry out a particular data cleaning step, there’s always the chance that you’ll change your mind later or that there will be an error in carrying out the data cleaning and you’ll need to go back a step in the process. Having a raw copy of your data that you never modify guarantees that you will always be able to start over if something goes wrong with your analysis. When starting any analysis, you can make a copy of your raw data file and do your manipulations on that file, rather than the raw version. We learned in a previous episode how to prevent overwriting our raw data files by setting restrictive file permissions.
You can store any results that are generated from your analysis in
the results
folder. This guarantees that you won’t confuse results
file and data files in six months or two years when you are looking
back through your files in preparation for publishing your study.
The docs
folder is the place to store any written analysis of your
results, notes about how your analyses were carried out, and
documents related to your eventual publication.
7.5 Documenting your activity on the project
When carrying out wet-lab analyses, most scientists work from a written protocol and keep a hard copy of written notes in their lab notebook, including any things they did differently from the written protocol. This detailed record-keeping process is just as important when doing computational analyses. Luckily, it’s even easier to record the steps you’ve carried out computational than it is when working at the bench.
The history
command is a convenient way to document all the
commands you have used while analyzing and manipulating your project
files. Let’s document the work we have done on our project so far.
View the commands that you have used so far during this session using history
:
history | head -n 10
## bash: no job control in this shell
The history likely contains many more commands than you have used for the current project. Let’s view the last several commands that focus on just what we need for this project.
View the last n lines of your history (where n = approximately the last few lines you think relevant). For our example, we will use the last 7:
history | tail -n 7
## bash: no job control in this shell
7.6 Exercise
Using your knowledge of the shell, use the append redirect >>
to create a file called
dc_workshop_log_XXXX_XX_XX.sh
(Use the four-digit year, two-digit month, and two digit day, e.g.
dc_workshop_log_2022_11_.27.sh
)
7.7 Solution
## bash: no job control in this shell
Note we used the last 7 lines as an example, the number of lines may vary.
You may have noticed that your history contains the history
command itself. To remove this redundancy
from our log, let’s use the nano
text editor to fix the file:
nano dc_workshop_log_2022_11_27.sh
(Remember to replace the XXXX_XX_XX
with your workshop date.)
From the nano
screen, you can use your cursor to navigate, type, and delete any redundant lines.