2 Data Generation
2.1 BAM slicing
The relevant reads can be extracted for efficient analysis on a subset of genes.
STAR and samtools enable output sorted alignments to Aligned.out.sort.bam
file.
For simple explanation of data generating steps, suppose that we only consider a case of gene KEAP1 and 5 samples in the bamfiles
folder.
A comprehensive gene annotation file (GTF) gencode.v36.annotation.gtf
can be downloaded at GENCODE.
gtftools is able to save gene coordination and ID mappings in bed format as gencode.v36.genes.bed
from the GTF file, and then genomic region for KEAP1 be obtained as genic_region.bed
.
bedfile="$base/gencode.v36.genes.bed"
genome_region=`grep -P $genename"\t" $bedfile | cut -f5`
awk -v gene="$genename" -F'\t' '$7 == gene {print $1, $2, $3}' OFS='\t' "$bedfile" > $geneDir/bamslice/genic_region.bed
samtools
and BEDTools
are required, and inputs to get a sliced BAM for the specific gene and sample are given:
- Gene of interest
- Genomic coordinates to extract reads
- Input directory for BAM files
- Output directory for sliced BAM files
- Prefix for output BAM files
genename="KEAP1"
genome_region="chr19:10486124-10503558"
ibampath="/bamfiles/STAR_S000004-37933-003_Aligned.out.sort.bam"
bamsliceDir="/bamslice"
ibamslice="KEAP1_S000004-37933-003_slice"
We define 3 cases to select reads within genome region of a gene:
*_contained.bam
: both paired reads are in the genome region (black line)*_overlap_Sonly.bam
: one of paired reads overlaps with the start genomic position (red line)*_overlap_Eonly.bam
: one of paired reads overlaps with the end genomic position (green line)
samtools
can extract paired-end reads in *_1.bam
and *_2.bam
from BAM files and merge as *_temp.bam
.
Based on overlap with the gene, intersectBed
from BEDTools
identifies reads and start and end positions of the gene are created in *_genes_S.bed
and *_genes_E.bed
.
samtools view -b -f 147 ${ibampath} $genome_region > $bamsliceDir/${ibamslice}_1.bam
samtools view -b -f 99 ${ibampath} $genome_region > $bamsliceDir/${ibamslice}_2.bam
samtools merge -f $bamsliceDir/${ibamslice}_temp.bam $bamsliceDir/${ibamslice}_1.bam $bamsliceDir/${ibamslice}_2.bam
intersectBed -a $bamsliceDir/${ibamslice}_temp.bam -b $bamsliceDir/genic_region.bed > $bamsliceDir/${ibamslice}_overlap.bam
intersectBed -a $bamsliceDir/${ibamslice}_temp.bam -b $bamsliceDir/genic_region.bed -f 1.0 > $bamsliceDir/${ibamslice}_contained.bam
awk '{print $1"\t"$2"\t"$2}' $bamsliceDir/genic_region.bed > $bamsliceDir/${ibamslice}_genes_S.bed
awk '{print $1"\t"$3"\t"$3}' $bamsliceDir/genic_region.bed > $bamsliceDir/${ibamslice}_genes_E.bed
After identifying reads overlapping only the start or end of gene, we can merge with reads fully contained within the gene to make a preliminary BAM
file as *_pre.bam
.
The final files are a sorted BAM file *.bam
and a BAM index file *.bam.bai
, and removing all intermediate files are recommended to optimize storage.
intersectBed -a $bamsliceDir/${ibamslice}_overlap.bam -b $bamsliceDir/${ibamslice}_genes_S.bed | intersectBed -a stdin -b $bamsliceDir/${ibamslice}_genes_E.bed -v > $bamsliceDir/${ibamslice}_overlap_Sonly.bam
intersectBed -a $bamsliceDir/${ibamslice}_overlap.bam -b $bamsliceDir/${ibamslice}_genes_E.bed | intersectBed -a stdin -b $bamsliceDir/${ibamslice}_genes_S.bed -v > $bamsliceDir/${ibamslice}_overlap_Eonly.bam
samtools merge -o $bamsliceDir/${ibamslice}_pre.bam $bamsliceDir/${ibamslice}_contained.bam $bamsliceDir/${ibamslice}_overlap_Sonly.bam $bamsliceDir/${ibamslice}_overlap_Eonly.bam
samtools sort $bamsliceDir/${ibamslice}_pre.bam > $bamsliceDir/${ibamslice}.bam
samtools index $bamsliceDir/${ibamslice}.bam $bamsliceDir/${ibamslice}.bam.bai
rm $bamsliceDir/${ibamslice}_1.bam $bamsliceDir/${ibamslice}_2.bam $bamsliceDir/${ibamslice}_temp.bam $bamsliceDir/${ibamslice}_overlap.bam $bamsliceDir/${ibamslice}_contained.bam $bamsliceDir/${ibamslice}_pre.bam $bamsliceDir/${ibamslice}_genes_S.bed $bamsliceDir/${ibamslice}_genes_E.bed $bamsliceDir/${ibamslice}_overlap_Sonly.bam $bamsliceDir/${ibamslice}_overlap_Eonly.bam
2.2 Pileup
SCISSOR
package1 can be applied to generate pileup from BAM file, region info, and coverage plots based on its tutorial.
From the GTF file, build_gaf
function creates a region info file named SCISSOR_gaf.txt
and shows the full path of the file. The file contains regions
by gene_name
and gene_id
.
For instance, we can make pileup and compare coverage plots of part_intron and only_exon using gene KEAP1 and the 5 samples.
If 5 sliced BAM files are saved in the bamslice
folder, the full file path of them can be set as BAMfiles
.
gen_pileup
function saves pileup
, regions
, and Ranges
for Gene
, which become inputs to plot pileup figures.
BAMfiles <- list.files(path=paste0("../bamslice"), pattern="\\.sort.bam$", full.names=TRUE)
caseIDs <- substr(basename(BAMfiles), start=6, stop=22)
gen_pileup(Gene="KEAP1",
regionsFile=data.table::fread(file="./dataPrep/SCISSOR_gaf.txt"),
BAMfiles=BAMfiles,
caseIDs=caseIDs,
outputdir=paste0("../pileup/"))
par(mfrow=c(1,2))
# part_intron
load(paste0("../pileup/KEAP1_pileup_part_intron.RData"))
plot_pileup(Pileup=log10(pileup+1), case=caseIDs, Ranges=Ranges, logcount=1, main="part_intron")
# only_exon
pileupData = build_pileup(Pileup=pileup, case=caseIDs, regions=regions, inputType="part_intron", outputType="only_exon")
geneRanges = get_Ranges(Gene=Ranges$Gene, regions=regions, outputType="only_exon")
plot_pileup(Pileup=log10(pileupData+1), case=caseIDs, Ranges=geneRanges, logcount=1, main="only_exon")
The only_exon figure has 10 exons after filtering introns from part_intron. The start and end positions for exon (e) and intronic part (ip) per exon are matched with the pileup plot and we only consider exon to measure sample quality.
## ip.start e.start e.end ip.end
## exon1 10503558 10503558 10503241 10503187
## exon2 10503186 10503186 10503093 10502897
## exon3 10502896 10502896 10502697 10502688
## exon4 10502687 10502687 10502388 10502039
## exon5 10500429 10500080 10499395 10499046
## exon6 10493066 10492717 10492573 10492263
## exon7 10492262 10492262 10491577 10491228
## exon8 10490202 10489853 10489648 10489436
## exon9 10489435 10489435 10489192 10488843
## exon10 10487167 10486818 10486125 10486125
Choi, H.Y., Jo, H., Zhao, X. et al. SCISSOR: a framework for identifying structural changes in RNA transcripts. Nat Commun 12, 286 (2021). https://doi.org/10.1038/s41467-020-20593-3↩︎