The algorithm takes a list of two or more digital gene expression (DGE) matrices as input. Genes should be in rows and cells in columns. Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell, select variable genes, and scale the data. Note that we do not center the data because nonnegative matrix factorization accepts only positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that corresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.
= readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is
dge1
recommended.= readRDS("dge2.RDS")
dge2 = createLiger(list(name1 = dge1, name2 = dge2)) #Can also pass in more than 2 datasets
ligerex = normalize(ligerex)
ligerex = selectGenes(ligerex, var.thresh = 0.1)
ligerex = scaleNotCenter(ligerex) ligerex
liger
also has functions for reading data generated by
10X’s cellranger count
pipeline (including from 10X V3). We
can merge data by data type (most commonly Gene Expression) across
multiple samples and then use this as a single dataset in a new object
for integration.
# 10X data to be combined for sample 1
= "/path/to/10X/output/dir1"
sample1_data1 = "/path/to/10X/output/dir2"
sample1_data2 = read10X(sample.dirs = list(sample1_data1, sample1_data2),
sample1_dge sample.names = c("s1_data1", "s1_data2"), min.umis = 500)
# 10X data for sample 2
= "/path/to/10X/output/dir3"
sample2_data = read10X(sample.dirs = list(sample2_data),
sample2_dge sample.names = c("s2_data"), min.umis = 500)
= createLiger(list(sample1 = sample1_dge, sample2 = sample2_dge))
liger10X # continue with other preprocessing steps
Next we perform the factorization using an alternating least squares algorithm. After performing the factorization, we identify cells that load on corresponding cell factors and quantile normalize their factor loadings across datasets. The key parameters here are the number of factors (k), the penalty parameter (lambda), and the clustering resolution. In most cases, the default settings of lambda=5.0 and resolution=1.0 provide reasonable results.
= optimizeALS(ligerex, k = 20)
ligerex = quantileAlignSNF(ligerex) #SNF clustering and quantile alignment ligerex
We can visualize the results by using dimensionality reduction
techniques like t-SNE or UMAP (recommended for larger datasets).
Visualizations can be colored by dataset of origin, cluster assignment,
or any feature included in or added to the cell metadata
(cell.data
).
plotWordClouds
and plotGeneLoadings
are
useful ways to visualize the most highly loading genes (both shared and
dataset specific) for each factor, in conjunction with the factor
loadings across cells. These functions are very similar, but
plotGeneLoadings
displays the top loading genes in ordered
scatter plots instead of word clouds.
= runTSNE(ligerex)
ligerex # for larger datasets, may want to use UMAP instead
= runUMAP(ligerex)
ligerex plotByDatasetAndCluster(ligerex) #Can also pass in different set of cluster labels to plot
plotFeature(ligerex, "nUMI")
pdf("word_clouds.pdf")
plotWordClouds(ligerex)
dev.off()
pdf("gene_loadings.pdf")
plotGeneLoadings(ligerex)
dev.off()
Another way to examine factor loadings across cells, and to help
visualize the alignment process is through plotFactors
;
this can be helpful for seeing significant scale differences between the
datasets. We can also visualize the correspondence between clusters and
factors in the data with plotClusterProportions
and
plotClusterFactors
. These can be especially useful for
identifying clusters associated with certain factors and corresponding
marker genes.
plotFactors(ligerex)
plotClusterProportions(ligerex)
plotClusterFactors(ligerex, use.aligned = T)
We can use the factorization to more explicitly identify shared and
dataset-specific markers. The function getFactorMarkers
returns a list, where the first element contains dataset-specific
markers for dataset 1, the second element contains sharedmarkers, the
third element contains dataset-specific markers for dataset 2, and the
last 2 elements indicate the number of factors in which each marker is
found. This information allows the identification of ubiquitous
vs. cell-type-specific dataset differences.
= getFactorMarkers(ligerex, num.genes = 10)
markers plotGene(ligerex, gene = "Malat1")
plotGeneViolin(ligerex, gene = "Malat1")
We can compare and visualize liger
cluster assignments
with existing clusterings (if cluster assignments are available for the
individual datasets).
# published cluster assignments for all cells in dataset 1 and 2
= readRDS("clusters1.RDS")
clusters_published1 = readRDS("clusters2.RDS")
clusters_published2 # calculate adjusted Rand Index between liger cluster assignments and another assignment
calcARI(ligerex, c(clusters_published1, clusters_published2))
# calculate purity between liger cluster assignments and another assignment for just dataset 1
calcPurity(ligerex, clusters_published1)
# visualize joint cluster assignment as related to individual dataset cluster assignments
makeRiverplot(ligerex, clusters_published1, clusters_published2)
liger
includes methods for estimating the level of
integration (alignment) between datasets and the level of distortion of
structure for each individual dataset after factorization and alignment
(compared to before). In general, datasets which have been factorized
and aligned in a meaningful way should show a high degree of integration
(high alignment metric), while maintaining a low degree of distortion
(high agreement metric).
calcAlignment(ligerex)
calcAgreement(ligerex)
# see if certain clusters are more integrated than others
calcAlignmentPerCluster(ligerex)
The suggestK
and suggestLambda
functions
can aid in selecting k and lambda. We want to find the smallest k for
which the increase in entropy metric begins to level off (an “elbow” in
the plot). Similarly, we want the smallest lambda for which the
alignment metric stabilizes.
suggestK(ligerex) # plot entropy metric to find an elbow that can be used to select the number of factors
suggestLambda(ligerex, k) # plot alignment metric to find an elbow that can be used to select the value of lambda
If we want to add new data, change k or lambda, or re-analyze a subset of the data, the functions below provide an efficient method of updating. This is much faster than the naive approach of simply re-running the optimizeALS algorithm.
= optimizeNewK(ligerex, k = 15) #Can also decrease K
ligerex #Add new batches from the same condition/technology/species/protocol
= optimizeNewData(ligerex, newdata = list(name1 = dge1.new, name2 = dge2.new),
ligerex which.datasets = list(name1, name2), add.to.existing = T)
#Add completely new datasets. Specify which existing datasets are most similar.
= optimizeNewData(ligerex, newdata = list(name3 = dge3, name4 = dge4),
ligerex which.datasets = list(name1, name2), add.to.existing = F)
#cell.subset is a list of cells to retain from each dataset
= optimizeSubset(ligerex, cell.subset) ligerex
We can easily create liger
objects from Seurat (V2 or
V3) objects (and vice versa), while keeping calculated features from the
original objects.
# Create liger object from two separate Seurat objects, keeping union of top 2000 highly variable
# genes from each object
= seuratToLiger(list(seurat1, seurat2), names = c('name1', 'name2'), num.hvg.info = 2000)
ligerex # Create liger object from single integrated Seurat V2 object, keeping CCA factorization,
# splitting datasets by Seurat meta.var column "original"
= seuratToLiger(seurat_obj, combined.seurat = T, meta.var = 'original', cca.to.H = T)
ligerex # Create liger object from single integrated Seurat V3 object, splitting datasets by two
# available assays in Seurat
= seuratToLiger(seurat_obj, combined.seurat = T, assays.use = c('RNA', 'ADT'))
ligerex # Create Seurat object from liger object, keeping liger highly variable genes
= ligerToSeurat(ligerex, use.liger.genes = T) seurat_obj