5 Sample Quality

5.1 Sample quality index

A mean coverage depth (MCD) and a window coefficient of variation (wCV) can be calculated from the get_MCD and get_wCV functions. Both functions return the same dimension of a matrix, which is (the number of genes) x (the number of samples).

MCD.mat = get_MCD(genelist, pileupPath, sampleInfo)
wCV.mat = get_wCV(genelist, pileupPath, sampleInfo, rnum=100, method=1, winSize=20, egPct=10)
dim(MCD.mat); dim(wCV.mat)
## [1] 1000  171
## [1] 1000  171

The get_SQI function gives the AUC of the fitted lines in the regression of wCV and log transformed MCD, and normalizes it using projection depth (PD)1. Finally, we can determine the quality of the samples by defining them as Bad if they have PD> 3. Nineteen out of 171 samples are classified as bad quality samples. The SQI plot shows the distribution of AUC and PD and the shape of the regression by sample quality.

result = get_SQI(MCD=MCD.mat, wCV=wCV.mat, rstPct=20, obsPct=50)

auc.vec <- result$auc.vec
table(auc.vec$SQI)
## 
##  Bad Good 
##   19  152
plot_SQI(SQIresult=result)

5.2 Updated gene body coverage

The gene body coverage plot can be updated after removing bad samples using the plot_GBCg function. The coverage patterns become much more stable, especially in the long genes. A continuous legend can be selected among ratio intron and PD for the plot.

GBCg0 = plot_GBCg(stat=2, plot=TRUE, sampleInfo, GBCresult=GBC0, auc.vec=result$auc.vec)
GBCg5 = plot_GBCg(stat=2, plot=TRUE, sampleInfo, GBCresult=GBC5, auc.vec=result$auc.vec)

pg0 <- GBCg0$plotRI +
  coord_cartesian(ylim=c(0, 0.017)) +
  ggtitle("0~5 kb")

pg5 <- GBCg5$plotRI +
  coord_cartesian(ylim=c(0, 0.017)) +
  ggtitle("5+ kb")

ggpubr::ggarrange(pg0, pg5, common.legend=TRUE, legend="bottom", nrow=1)

pg0 <- GBCg0$plotPD +
  coord_cartesian(ylim=c(0, 0.017)) +
  ggtitle("0~5 kb")

pg5 <- GBCg5$plotPD +
  coord_cartesian(ylim=c(0, 0.017)) +
  ggtitle("5+ kb")

ggpubr::ggarrange(pg0, pg5, common.legend=TRUE, legend="bottom", nrow=1)


  1. Choi, H. (2018). Scissor for finding outliers in RNA-seq. https://doi.org/10.17615/dv2e-7a29↩︎