1 Data Description

1.1 Inputs

We need datasets such as genelist, coverage pileup, and sampleInfo to obtain the sample quality index outputs and plots. geneInfo is optional if you want to compare results by the properties of genes. The description of Input data and variable names is listed as follows:

  • genelist: a vector of gene names
  • pileupPath: a vector for file paths of coverage pileupData including .RData file names
  • geneInfo: a data frame of gene information including gene ID and properties based on gencode v36
    • gene_id: ensembl gene ID
    • geneSymbol: gene names
    • merged: gene length
    • exon.wtpct_gc: weighted percentage of GC from exon level data
    • subcategory: protein coding or lncRNA
  • sampleInfo: a data frame of sample information including sample ID and properties from Picard RnaSeqMetrics
    • SampleID: sample ID
    • PF_BASES: the total number of bases within the PF_READS of the SAM or BAM file to be examined
    • PF_ALIGNED_BASES: the total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence
    • RIBOSOMAL_BASES: number of bases in primary alignments that align to ribosomal sequence
    • CODING_BASES: number of bases in primary alignments that align to a non-UTR coding base for some gene, and not ribosomal sequence
    • UTR_BASES: number of bases in primary alignments that align to a UTR base for some gene, and not a coding base
    • INTRONIC_BASES: number of bases in primary alignments that align to an intronic base for some gene, and not a coding or UTR base
    • INTERGENIC_BASES: number of bases in primary alignments that do not align to any gene
    • RINs: RIN value
  • TPM: a data frame for TPM normalization with protein coding and lncRNA genes

1.2 Alliance

This example consists of 1,000 selected genes among protein coding and lncRNA genes and fresh frozen and total RNA-seq (FFT) 171 samples, which can be found in data. Among the samples, 156 are tumor types and the others are normal.

RNAdegrProjR/
        data/
                genelist.rda
                geneInfo.rda
                sampleInfo.rda
        dataPrep/
                SCISSOR_gaf.txt
                pileup/
                        LINC01772_pileup_part_intron.RData
                        ...
                        MIR133A1HG_pileup_part_intron.RData
                TPM.rda

summarytools

descr(sampleInfo %>% select(-c(SampleID)),
      stats     = c("min", "med", "max", "n.valid"),
      transpose = TRUE,
      headings  = FALSE)
## 
##                                    Min           Median              Max   N.Valid
## ---------------------- --------------- ---------------- ---------------- ---------
##           CODING_BASES    360227386.00    3208673516.00    7201831541.00    171.00
##       INTERGENIC_BASES   1273275706.00    2567603596.00   16945505453.00    171.00
##         INTRONIC_BASES    434293079.00    5593128001.00   10076495286.00    171.00
##       PF_ALIGNED_BASES   5448216119.00   14566862147.00   23291961710.00    171.00
##               PF_BASES   6481274100.00   16315945500.00   26148083100.00    171.00
##        RIBOSOMAL_BASES            0.00           150.00          6600.00    171.00
##                   RINs            1.10             5.30             9.20    167.00
##              UTR_BASES    381863999.00    2783953295.00    4803719959.00    171.00