Section 3 Identifying Gene Signature using MINT

3.1 Identifying Molecular Signatures using MINT sPLSDA

This section follows up on the Data Integration vignette. The aim here is to find the gene signature characterising the cell lines using the MINT sPLS-DA model and also to make relevant comparisons.

The following topics are covered in details in the previous vignette and thus will not be expanded:

The instructions on installation of the required packages
Data normalisation method using scran package

3.2 R Scripts

The R scripts are available at this link.

3.3 Libraries

# load the required libraries
library(SingleCellExperiment)
library(mixOmics)
library(scran)
library(scater)
library(knitr)
library(VennDiagram)
library(tibble)

3.4 Data

The data need to be loaded only if the vignette files are being run separately (and not as a book). The raw and normalised data are loaded from the Data Integration vignette. You can load them directly from github:

# load from github
# raw
RawURL='https://tinyurl.com/sincell-with-class-RData-LuyiT'
load(url(RawURL))

Alternatively, you can download and load directly into R environment:

# or load from local directory, change to your own
load('data/sincell_with_class.RData')

## normalise the QC'ed count matrices
sc10x.norm =  computeSumFactors(sce10x_qc) ## deconvolute using size factors
sc10x.norm =  normalize(sc10x.norm) ## normalise expression values
## DROP-seq
scdrop.norm = computeSumFactors(scedrop_qc_qc)
scdrop.norm = normalize(scdrop.norm)
## CEL-seq2
sccel.norm =  computeSumFactors(sce4_qc)
sccel.norm =  normalize(sccel.norm)

3.5 PLS-DA on Each Protocol Individually

Previously, we combined the datasets using MINT sPLS-DA. We then performed variable selection using optimum number of variables for each MINT PLS-DA component to tune the model parameters using key predictors. In this vignette, we will also carry out (s)PLS-DA on each individual dataset (from every protocol) to compare the signatures from individual and combined studies. Next, we will compare the signatures against differentially expressed genes from a univariate analysis from CellBench study.

3.5.1 The Method

As mentioned in previous vignette, PLS-DA finds the molecular signature that drives the association of cells to their cell lines. Here we use mixOmics’ plsda function to find the PLSDA components in each dataset, refer to the documentation for details.

3.5.2 PLSDA

Since our aim is to find the signature from each dataset and compare, we need to find the data corresponding to common genes across all datasets:

# find the intersect of the genes
list.intersect = Reduce(intersect, list(
# the rownames of the original (un-transposed) count matrix will output the genes
  rownames(logcounts(sc10x.norm)),
  rownames(logcounts(sccel.norm)),
  rownames(logcounts(scdrop.norm))
))

Initially, we apply the function to the 10X data and perform cross validation:

# extract the normalised count matrix from the SCE object (transposed)
normalised.10x = t(logcounts(sc10x.norm))
# keep the common genes only for comparability
normalised.10x = normalised.10x[,list.intersect]
# form a factor variable of cell lines
Y.10x = as.factor(sc10x.norm[list.intersect,]$cell_line)
# PLS-DA on the dataset with 5 components
plsda.10x.res = plsda(X = normalised.10x, Y = Y.10x, ncomp = 5)
# perform cross validation and find the classification error rates
start = Sys.time()
perf.plsda.10x = perf(plsda.10x.res, progressBar=FALSE )
run.time = Sys.time()-start

The run took less than 2 mins. We can now plot the error rate profile:

# optimal number of components
plot(perf.plsda.10x, col = color.mixo(5:7))

Figure 3.1: The classification error rate for different number of PLSDA components of 10X dataset.

Similar to the combined dataset, ncomp=2 leads to optimum error rate. We now apply plsda to the other 2 datasets:

# extract the normalised count matrix from the SCE object (transposed)
# CEL-seq2
normalised.cel = t(logcounts(sccel.norm))
normalised.cel = normalised.cel[,list.intersect]
# Drop-seq
normalised.drop= t(logcounts(scdrop.norm))
normalised.drop = normalised.drop[,list.intersect]
# factor variable of cell lines
Y.cel = as.factor(sccel.norm[list.intersect,]$cell_line)
Y.drop = as.factor(scdrop.norm[list.intersect,]$cell_line)
# PLS-DA on each dataset with 2 components
plsda.10x.res = plsda(X = normalised.10x, Y = Y.10x, ncomp = 2)
plsda.cel.res = plsda(X = normalised.cel, Y = Y.cel, ncomp = 2)
plsda.drop.res = plsda(X = normalised.drop, Y = Y.drop, ncomp = 2)

We plot the plsda plots for the 10X and CEL-seq2 data:

# mint.plsda plot for 10X
plotIndiv(plsda.10x.res,
          legend  = TRUE, title     = 'PLSDA 10X', 
          ellipse = TRUE, legend.title = 'Cell Line',
          X.label = 'PLSDA component 1', 
          Y.label = 'PLSDA component 2', pch=1)

Figure 3.2: PLSDA plot for the 10x dataset.

# mint.plsda plot for CEL-seq2
plotIndiv(plsda.cel.res,
          legend  = TRUE, title     = 'PLSDA CEL-seq2', 
          ellipse = TRUE, legend.title = 'Cell Line',
          X.label = 'PLSDA component 1', 
          Y.label = 'PLSDA component 2', pch=2)

Figure 3.3: PLSDA plot for the CEL-seq2 dataset.

The PLSDA figures show clustering by cell line, while the CEL-seq2 data are more scattered compared to 10X. In interpreting the relations, it is important to note that the apparent change of clusters is simply the result of an inversion on second PLSDA component and does not imply disparity per se.

3.5.3 Sparse PLSDA (sPLSDA) For Each Protocol

For consistency and better comparison, we will use same number of optimum components as MINT for the sparse PLS-DA on each protocol:

# run sparse PLSDA on individual studies with MINT tuned parameters
keepX = c(35,10)
splsda.10x.res = splsda( X =normalised.10x, Y = Y.10x, ncomp = 2,  
                              keepX = keepX)
splsda.cel.res = splsda( X =normalised.cel, Y = Y.cel, ncomp = 2,  
                              keepX = keepX)
splsda.drop.res = splsda( X =normalised.drop, Y = Y.drop, ncomp = 2,  
                              keepX = keepX)

And visualise the output to see how the clusters change with variable selection:

# splsda plots with tuned number of variables for each sPLSDA component
# 10X
plotIndiv(splsda.10x.res, group = Y.10x,
          legend  = TRUE, title     = 'sPLSDA - 10X',
          ellipse = FALSE,legend.title = 'Cell Line',
          pch=1,
          X.label = 'sPLSDA component 1',
          Y.label = 'sPLSDA component 2')

Figure 3.4: The sPLSDA plots of the 10X dataset.

# CEL-seq2
plotIndiv(splsda.cel.res, group = Y.cel,
          legend  = TRUE, title     = 'sPLSDA - CEL-seq2',
          ellipse = FALSE,legend.title = 'Cell Line',
          pch=2,
          X.label = 'sPLSDA component 1',
          Y.label = 'sPLSDA component 2')

Figure 3.5: The sPLSDA plots of the CEL-seq2 dataset.

The variable selection has reduced the noise and refined the clusters compared to the non-sparse model.

# The signature genes from each sPLS-DA study
Chromium.10X.vars = unique(c(selectVar(splsda.10x.res, comp=1)$name,
                             selectVar(splsda.10x.res, comp=2)$name))
Cel.seq.vars =      unique(c(selectVar(splsda.cel.res, comp=1)$name,
                             selectVar(splsda.cel.res, comp=2)$name))
Drop.seq.vars =     unique(c(selectVar(splsda.drop.res, comp=1)$name,
                             selectVar(splsda.drop.res, comp=2)$name))

# create a venn diagram from signatures
vennProtocols <- venn.diagram(
    x = list(
        Chr.10X= Chromium.10X.vars ,
        Cel.seq= Cel.seq.vars,
        Drop.seq = Drop.seq.vars),
    filename = NULL,
    cex=1.5, cat.cex=1.5,
    fill = c('green', 'darkblue',  'yellow')
    )
png(filename = 'figures/vennProtocols.png')
grid.draw(vennProtocols)
dev.off()

Figure 3.6: The venn diagram of common sPLSDA signatures among individual datasets.

The signatures from the 3 studies share 25 common genes.

3.6 Compare with MINT-Combined

In case the vignette is being run separately, the MINT sPLSDA object from previous vignette runs can be loaded directly or from github:

# load from github
MintURL='https://tinyurl.com/Sparse-MINT-Object'
load(url(MintURL))

Or directly from your local directory:

## change to your own
load('data/mint.splsda.tuned.res.RData')

3.6.1 Loading Plots

The consistency of the selected variables across individual and the MINT combined studies can be evaluated by plotting the loadings for each study using the plotLoadings function, which produces the barplot of variable loadings for each component.

The inputs to plotLoadings depends on the analysis object (mint.splsda, mint.pls, etc.) and consists of the plot object plus:

contrib: Whether to show the class in which the expression of the features is maximum or minimum. One of ‘min’ or ‘max’.
method: The criterion to assess the contribution. One of ‘mean’ or ‘median’ (recommended for skewed data).
study: The studies to be plotted (for combined data). all.partial or global for all.

For a complete list of the arguments for any object please refer to the documentation.

We plot the loadings for both components for maximum contribution. Colours indicate the class (cell line) in which the gene is positively/negatively expressed:

# 10X
plotLoadings(splsda.10x.res, contrib='max', method = 'mean', comp=1, 
             study='all.partial', legend=FALSE, title=NULL, 
             subtitle = '10X')

The loading plots for the first component of sPLS-DA on 10X data. The orange colour corresponds to H2228 cells, and the grey colour belongs to HCC827 cell line.

Figure 3.7: The loading plots for the first component of sPLS-DA on 10X data. The orange colour corresponds to H2228 cells, and the grey colour belongs to HCC827 cell line.

# CEL-seq2
plotLoadings(splsda.cel.res, contrib='max', method = 'mean', comp=1, 
             study='all.partial', legend=FALSE, title=NULL,
             subtitle = 'CEL-seq2')

Figure 3.8: The loading plots for the first component of sPLS-DA on CEL-seq2 data.

# Drop-seq
plotLoadings(splsda.drop.res, contrib='max', method = 'mean', comp=1, 
             study='all.partial', legend=FALSE, title=NULL,
             subtitle = 'Drop-seq')

Figure 3.9: The loading plots for the first component of sPLS-DA on Drop-seq data.

These genes are differentiating cells along the first sPLSDA component. Majority of signature genes are postively expressed in H2228 cells (orange). There is not a clear consensus among individual datasets in terms of the signature genes and their weight (loading).

## MINT Comp. 1
plotLoadings(mint.splsda.tuned.res, contrib='max', method = 'mean', comp=1, 
             study='all.partial', legend=FALSE, title=NULL, 
             subtitle = c('10X', 'CEL-seq2', 'Drop-seq') )

Figure 3.10: The loading plots for the first component of the Sparse MINT on the combined dataset.

MINT has produced signature that is consistent across studies. Similarly, we can plot the loadings on the second component variables.

## MINT Comp. 2
plotLoadings(mint.splsda.tuned.res, contrib='max', method = 'mean', comp=2, 
             study='all.partial', legend=FALSE, title=NULL, 
             subtitle = c('10X', 'CEL-seq2', 'Drop-seq') )

Figure 3.11: The loading plots for the second component of the Sparse MINT on the combined dataset.

3.6.2 Signature Comparison

The selectVar function in mixOmics outputs the key predictors on each component along with their loadings. We can create a set of Venn diagrams to visualise the overlap of the signature found using PLS-DA in individual datasets and the combined dataset. For all datasets, we take all the variables on both components as the signature.

## MINT signature
MINT.Combined.vars = unique(c(selectVar(mint.splsda.tuned.res, comp=1)$name,
                             selectVar(mint.splsda.tuned.res, comp=2)$name))

vennMINT <- venn.diagram(
    x = list(
        Chr.10X= Chromium.10X.vars ,
        Cel.seq= Cel.seq.vars,
        Drop.seq = Drop.seq.vars,
        MINT.Combined=MINT.Combined.vars),
    filename = NULL,
    cex=1.5, cat.cex=1.5,
    fill = c('green', 'darkblue',  'yellow', 'red')
    )
png(filename = 'figures/vennMINT.png')
grid.draw(vennMINT)
dev.off()

Figure 3.12: The venn diagram of sPLSDA signatures among individual datasets and the MINT combined dataset.

# the signature genes identified by all studies
common.sig = Reduce(intersect, list(MINT.Combined.vars, Chromium.10X.vars, Cel.seq.vars, Drop.seq.vars))

MINT has successfully detected the core 25 signature genes shared by all individual PLSDA models, plus 10 genes identied in 2 out of 3 studies.

# the 10 signature genes with highest loadings
common.sig[1:10]

##  [1] "ENSG00000196126" "ENSG00000154277" "ENSG00000173432"
##  [4] "ENSG00000019582" "ENSG00000204287" "ENSG00000198502"
##  [7] "ENSG00000231389" "ENSG00000196139" "ENSG00000179344"
## [10] "ENSG00000108602"

3.6.3 Variable Plots

Using plotVar function, it is possible to display the selected variables on a correlation circle plot to find the correlation between gene expressions:

## correlation circle plot
plotVar(mint.splsda.tuned.res, cex = 3)

Figure 3.13: The variable plot highlighting the contribution of each selected variable to each component and their correlation.

The figure above shows that majority of the signature genes are positively expressed in component 1.

we can assess the expression profiles for the genes on the extremes of both components:

## component 1 - most positively and negatively expressed between cell lines
var.c1 = selectVar(mint.splsda.tuned.res, comp=1)$value
positive.gene.c1 = rownames(var.c1)[which.max(var.c1$value.var)]
negative.gene.c1 = rownames(var.c1)[which.min(var.c1$value.var)]

## component 2 - most positively and negatively expressed between cell lines
var.c2 = selectVar(mint.splsda.tuned.res, comp=2)$value
positive.gene.c2 = rownames(var.c2)[which.max(var.c2$value.var)]
negative.gene.c2 = rownames(var.c2)[which.min(var.c2$value.var)]

# a function to create violin + box plots for this specific dataset
violinPlot = function(mint.object, gene){
  cols = c("H2228" = "orange", "H1975" = "dodgerblue3", "HCC827" = "grey")
  ggplot() + 
    geom_boxplot(aes(mint.object$Y,  mint.object$X[,gene],
                     fill= mint.object$Y), alpha=1)+
    geom_violin(aes(mint.object$Y, mint.object$X[,gene],
                     fill= mint.object$Y), alpha=0.7) +
    labs(x = "Cell Line", y="Standardised Expression Value" )+
    guides(fill=guide_legend(title="Cell Line") ) +
    scale_fill_manual(values=cols ) 
}

## violin + box plots for the most positively expressed gene on component 1
violinPlot(mint.splsda.tuned.res, positive.gene.c1)

Figure 3.14: Violin plots of expression profile of the most positively expressed gene on component 1 in different cell lines

## violin + box plots for the most negatively expressed gene on component 1
violinPlot(mint.splsda.tuned.res, negative.gene.c1)

Figure 3.15: Violin plots of expression profile of the most negatively expressed gene on component 1 in different cell lines

As expected, the expression profile of the above genes for H2228 cell line has opposite correlation to the other two. We can also evaluate the genes of the second sPLS-DA component:

## violin + box plots for the most positively expressed gene on component 2
violinPlot(mint.splsda.tuned.res, positive.gene.c2)

Figure 3.16: Violin plots of expression profile of the most positively expressed gene on component 2 in different cell lines

## violin + box plots for the most negatively expressed gene on component 2
violinPlot(mint.splsda.tuned.res, negative.gene.c2)

Figure 3.17: Violin plots of expression profile of the most negatively expressed gene on component 2 in different cell lines

The expression profiles show that the selected gene tends to be negatively expressed in HCC827 cell line and positively expressed in H1975 cell line.

A Clustered Image Map including the final signature can be plotted. The argument comp can be also be specified to highlight only the variables selected on specific components.

cim(mint.splsda.tuned.res, comp = c(1,2), margins=c(10,5), 
    row.sideColors = color.mixo(as.numeric(mint.splsda.tuned.res$Y)), row.names = FALSE,
    title = 'MINT sPLS-DA', save='png', name.save = 'heatmap')

Figure 3.18: The heatmap showing the cell grouping based on the gene expression profiles.

The heatmap shows that the cells from each cell line tend to group together based on their gene expression profile. It can be reiterated that the cells from the H2228 cell line show positive expression in majority of signature genes.

3.7 Signature Comparison with Bulk Assay

3.7.1 Cell Mixtures Data

The experimental design for CellBench data also included a cell mixture design for a pseudo-trajectory analysis. Briefly, in this study 9 cells were pooled from different cell types in different proportions and then reduced to 1/9th to mimic single cells and were sequenced using CEL-seq2 protocol. A mixture of 90 cells in equal amounts was used as the control group for a Differential Expression (DE) analysis of the pure cell lines at the extreme of trajectories. The results available on CellBench repository.

Figure 3.19: The cell mixture design. Adopted from Luyi Tian slides

We will take DE genes from the cell mixture study as reference for comparison. The DE genes are available on CellBench respository.

# load the DEGs from cell mix data from the 'data' folder
DE_URL <- 'https://tinyurl.com/DEtable-90cells'
load(url(DE_URL))

## or load the DEGs from the 'data' folder
load('data/DEtable_90cells.RData')

3.7.2 DE Genes vs MINT signature

We now compare the unique signature genes from MINT sPLS-DA analysis to the ones from cell mixture DE analysis:

## keep the genes present in the sPLS-DA analysis
HCC827_DEtable = HCC827_DEtable[row.names(HCC827_DEtable) %in% list.intersect,]
H2228_DEtable = H2228_DEtable[row.names(H2228_DEtable) %in% list.intersect,]
H1975_DEtable = H1975_DEtable[row.names(H1975_DEtable) %in% list.intersect,]

## create a column of genes for ease of merging
HCC827_DE = rownames_to_column(HCC827_DEtable, 'gene')
H2228_DE = rownames_to_column(H2228_DEtable, 'gene')
H1975_DE = rownames_to_column(H1975_DEtable, 'gene')

## create a combined DEtable from the 3 cell lines
DEtable = rbind(HCC827_DE,H2228_DE, H1975_DE)
## order by gene name and FDR (increasing)
DEtable = DEtable[order(DEtable[,'gene'],DEtable[,'FDR']),]
## keep the ones with FDR<0.05
DEtable = DEtable[DEtable$FDR<0.05,]
## remove duplicate gene names
DEtable = DEtable[!duplicated(DEtable$gene),]
## overlap with MINT
DE.MINT = DEtable[(DEtable$gene %in% MINT.Combined.vars),]
## sort in increasing FDR order
DE.MINT = DE.MINT[order(DE.MINT$FDR),]
## number of MINT signature genes that are differentially expressed
dim(DE.MINT)[1]

## [1] 45

## geometric mean of the FDR of signature
exp(mean(log(DE.MINT$FDR)))

## [1] 1.548651e-18

All of signature genes identified by MINT are differentially expressed at FDR<0.05.