1 Data Description

1.1 Inputs

We need datasets such as genelist, coverage pileup, and sampleInfo to obtain the sample quality index outputs and plots. geneInfo is optional if you want to compare results by the properties of genes. The description of Input data and variable names is listed as follows:

  • genelist: a vector of gene names
  • pileupPath: a vector for file paths of coverage pileupData including .RData file names
  • geneInfo: a data frame of gene information including gene ID and properties based on gencode v36
    • gene_id: ensembl gene ID
    • geneSymbol: gene names
    • merged: gene length
    • exon.wtpct_gc: weighted percentage of GC from exon level data
    • subcategory: protein coding or lncRNA
  • sampleInfo: a data frame of sample information including sample ID and properties from Picard RnaSeqMetrics
    • SampleID: sample ID
    • PF_BASES: the total number of bases within the PF_READS of the SAM or BAM file to be examined
    • PF_ALIGNED_BASES: the total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence
    • RIBOSOMAL_BASES: number of bases in primary alignments that align to ribosomal sequence
    • CODING_BASES: number of bases in primary alignments that align to a non-UTR coding base for some gene, and not ribosomal sequence
    • UTR_BASES: number of bases in primary alignments that align to a UTR base for some gene, and not a coding base
    • INTRONIC_BASES: number of bases in primary alignments that align to an intronic base for some gene, and not a coding or UTR base
    • INTERGENIC_BASES: number of bases in primary alignments that do not align to any gene
    • RINs: RIN value
  • TPM: a data frame for TPM normalization with protein coding and lncRNA genes

1.2 Alliance

This example consists of 1,000 selected genes among protein coding and lncRNA genes and fresh frozen and total RNA-seq (FFT) 171 samples, which can be found in data. Among the samples, 156 are tumor types and the others are normal. You can construct your own structure for bamfiles, bamslice, and dataPrep folders according to data generation chapter.

RNAdegrProjR/
        dataPrep/
                gencode.v36.annotation.gtf
                gencode.v36.genes.bed
                SCISSOR_gaf.txt
                TPM.rda
        data/
                genelist.rda
                geneInfo.rda
                sampleInfo.rda
bamfiles/
        STAR_S000004-37933-003_Aligned.out.sort.bam
        STAR_S000004-37935-002_Aligned.out.sort.bam
        STAR_S000004-37937-002_Aligned.out.sort.bam
        STAR_S000004-37939-002_Aligned.out.sort.bam
        STAR_S000004-37941-002_Aligned.out.sort.bam
        ...
bamslice/
        genic_region.bed
        KEAP1_S000004-37933-003_slice.bam
        KEAP1_S000004-37935-002_slice.bam
        KEAP1_S000004-37937-002_slice.bam
        KEAP1_S000004-37939-002_slice.bam
        KEAP1_S000004-37941-002_slice.bam
        ...
pileup/
        LINC01772_pileup_part_intron.RData
        ...
        MIR133A1HG_pileup_part_intron.RData

The summary table from summarytools show descriptive statistics to review the distribution and missing values for the provided datasets.

library(summarytools)
library(dplyr)

descr(sampleInfo %>% select(-c(SampleID)),
      stats     = c("min", "med", "max", "n.valid"),
      transpose = TRUE,
      headings  = FALSE)
## 
##                                    Min           Median              Max   N.Valid
## ---------------------- --------------- ---------------- ---------------- ---------
##           CODING_BASES    360227386.00    3208673516.00    7201831541.00    171.00
##       INTERGENIC_BASES   1273275706.00    2567603596.00   16945505453.00    171.00
##         INTRONIC_BASES    434293079.00    5593128001.00   10076495286.00    171.00
##       PF_ALIGNED_BASES   5448216119.00   14566862147.00   23291961710.00    171.00
##               PF_BASES   6481274100.00   16315945500.00   26148083100.00    171.00
##        RIBOSOMAL_BASES            0.00           150.00          6600.00    171.00
##                   RINs            1.10             5.30             9.20    167.00
##              UTR_BASES    381863999.00    2783953295.00    4803719959.00    171.00