UMAP and SOM from Distance Matrices

C.E. Coombes, Zachary B. Abrams, Kevin R. Coombes

2024-10-22

Introduction

We recently added visualizations based on Self-Organizing Maps (SOM) and Uniform Manifold Approximation and Projection (UMAP) for objects in the Mercator package. This addition relies on the core implementations in the kohonen and umap packages, respectively. The main challenge that we faced was that both of those implementations want to receive the raw data as inputs, while Mercator objects only store a distance matrix computed from the raw data. The purpose of this vignette is, first, to describe how we worked around this restriction, and second, to illustrate how to use these methods in the package.

The Mercator Class

First we load the package.

suppressMessages( suppressWarnings( library(Mercator) ) )

Next, we load a “fake” set of synthetic continuous data that comes with the Mercator package.

set.seed(36766)
data(fakedata)
ls()
##  [1] "C"         "Cl4"       "Delta"     "J"         "N"         "P"        
##  [7] "SV"        "X"         "fakeclin"  "fakedata"  "filename"  "hclass"   
## [13] "jacc.Vis"  "kk"        "mercury"   "my.binmat" "my.clust"  "my.data"  
## [19] "neptune"   "pts"       "sokal.Vis" "tab"       "temp.Vis"  "venus"
dim(fakedata)
## [1] 776 300
dim(fakeclin)
## [1] 300   4

UMAP

Let’s create a UMAP visualization directly from the raw data.

library(umap)
udirect <- umap(t(fakedata))
plot(udirect$layout, main="Original UMAP")

UMAP visualization of fake data. We aren’t going to do anything fancy with labels or colors in this plot; we just want an idea of the main structure in the data. In particular, it looks like there are seven clusters, divided into one group three and two groups of two each.

Next, we are going to create a Meractor object with a multi-dimensional scaling (MDS) visualization. Inspired by the first plot, we will arbitrarily assign seven groups. (This assignment uses the PAM algorithm internally.)

mercury <- Mercator(dist(t(fakedata)), "euclid", "mds", 7)
plot(mercury, main = "MDS")

MDS visualization from a distance matrix. Interestingly, here we see a possibly different number of groups. But that’s a separate issue, and we don’t want to distracted from our main thread.

We want to confirm that the actual raw data is not contained in mercury, just the distance matrix.

summary(mercury)
## An object of the 'Mercator' class, using the ' euclid ' metric, of size
## [1] 300 300
## Contains these visualizations:  mds
slotNames(mercury)
## [1] "metric"   "distance" "view"     "palette"  "symbols"  "clusters"
mercury@metric
## [1] "euclid"
mercury@palette
##       NC1       NC2       NC3       NC4       NC5       NC6       NC7       NC8 
## "#2E91E5" "#E15F99" "#1CA71C" "#FB0D0D" "#DA16FF" "#222A2A" "#B68100" "#750D86" 
##       NC9      NC10      NC11      NC12      NC13      NC14      NC15      NC16 
## "#EB663B" "#511CFB" "#00A08B" "#FB00D1" "#FC0080" "#B2828D" "#6C7C32" "#778AAE" 
##      NC17      NC18      NC19      NC20      NC21      NC22      NC23      NC24 
## "#862A16" "#A777F1" "#620042" "#1616A7" "#DA60CA" "#6C4516" "#0D2A63" "#AF0038"
mercury@symbols
## [1] 16 15 17 18 10  7 11  9
table(mercury@clusters)
## 
##  1  2  3  4  5  6  7 
## 44 68 38 38 37 39 36
class(mercury@view)
## [1] "list"
names(mercury@view)
## [1] "mds"
class(mercury@view[["hclust"]])
## [1] "NULL"
class(D <- mercury@distance)
## [1] "dist"

Next, we add a “umap” visualization to this object.

mercury <- addVisualization(mercury, "umap")
plot(mercury, view = "umap", main="UMAP from distance matrix")

UMAP visualization from a distance matrix. Notice that the plot method knows that we want to see the internal “layout” component of the umap object. It also automatically assigns axis names that (subliminally?) remind us that we computed this visualization with UMAP. And it colors the points using the same values from the initial PAM clustering assignments.

As an aside, we point out that the PAM clustering doesn’t really match the implicit UMAP clustering of the data.

Euclideanization

We obviously constructed this plot just from the distance matrix, not from the raw data. It neverheless has a structure essentially the same as the original one. But how?

Well, internally, we use this function:

Mercator:::makeEuclidean
## function (D) 
## {
##     M <- as.matrix(D)
##     X <- scale(t(scale(t(M^2), scale = FALSE)), scale = FALSE)
##     E <- eigen(-X/2, symmetric = TRUE)$values
##     R <- min(sum(E > 1e-10), nrow(M) - 1)
##     cmdscale(D, k = R)
## }
## <bytecode: 0x000001b8d32067f0>
## <environment: namespace:Mercator>

The first four lines of code inside this function are basically the algorithm used by the cmdscale function to compute the eigenvalues need to create its dimension reduction for visualization. The count of the number of nonzero eigenvalues (which defines R) is careful to stay away from zero. We do this to avoid round-off errors. If we miss by one or two eignevalues, than the final call to cmdscale will generate a confusing warning. That final line, by the way, creates an embedding into Euclidean space that should preserve almost all of the distances between pairs of points in the original data set.

To see that, let’s carry out that step explicitly.

X <- Mercator:::makeEuclidean(D)
dim(X)
## [1] 300 299

Now we compute distances between pairs of points in this new embedding.

D2 <- dist(X)
delta <- as.matrix(D) - as.matrix(D2)
summary(as.vector(delta))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.386e-13 -1.421e-14  0.000e+00  2.740e-17  1.421e-14  9.237e-14

We see that the absolute errors between any pair of points in the two distance computations are less than \(10^{-12}\). And that is the fundamental reason that we expect equivalent structures whether we apply UMAP to the original data or to the re-embedded data arising from the distance matrix.

Self-Organizing Maps

The same approach works to compute self-organizing maps from a distance matrix. Here is the result using the data matrix.

mercury <- addVisualization(mercury, "som",
                            grid = kohonen::somgrid(6, 5, "hexagonal"))
plot(mercury, view = "som")

SOM visualizations from a distance matrix. And here is the corresponding result using the original data.

library(kohonen)
mysom <- som(t(fakedata), grid = somgrid(6, 5, "hexagonal"))
plot(mysom, type = "mapping")

This time, we will work a little harder to color the plot in the same way.

plot(mysom, type = "mapping", pchs=16, col=Mercator:::colv(mercury))

SOM visualizations from a distance matrix.

Warning

Qualitatively, most of the SOM plots should be similar. The fundamental exception, however, is the “codes” plot. This plot shows the patterns of intensities in each of the underlying dimensions (averaged over the assigned objects in each node). Because we have performed a multi-dimensional scaling analysis, we have changed the underlying coordinate system. Here are the corresponding plots.

plot(mysom, type = "codes", main="Original", shape = "straight")

SOM codes.

plot(mercury, view = "som", type="codes", main="After MDS")

SOM codes.

Conclusions

As we have seen, when we start with a Euclidean distance matrix, we can get qualitatively equivalent results from the original data or from creating a “realization” of the distance matrix in a large Euclidean space. The major advantage, however, arises when we start with a completely different distance metric. In that case, we cannot apply SOM or UMAP at all. But we can still embed that distance matrix into a Eucldean space and then run SOM or UMAP.

Appendix

This analaysis was performed in the following environment:

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kohonen_3.0.12       umap_0.2.10.0        Mercator_1.1.5      
## [4] Thresher_1.1.4       PCDimension_1.1.13   ClassDiscovery_3.4.4
## [7] oompaBase_3.2.9      cluster_2.1.6       
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5         xfun_0.46            bslib_0.8.0         
##  [4] ggplot2_3.5.1        lattice_0.22-6       vctrs_0.6.5         
##  [7] tools_4.4.1          generics_0.1.3       stats4_4.4.1        
## [10] flexmix_2.3-19       Polychrome_1.5.1     tibble_3.2.1        
## [13] fansi_1.0.6          highr_0.11           pkgconfig_2.0.3     
## [16] Matrix_1.7-0         KernSmooth_2.23-24   scatterplot3d_0.3-44
## [19] lifecycle_1.0.4      compiler_4.4.1       munsell_0.5.1       
## [22] clue_0.3-65          movMF_0.2-8          htmltools_0.5.8.1   
## [25] sass_0.4.9           yaml_2.3.10          pillar_1.9.0        
## [28] jquerylib_0.1.4      MASS_7.3-60.2        openssl_2.2.0       
## [31] cachem_1.1.0         viridis_0.6.5        mclust_6.1.1        
## [34] RSpectra_0.16-2      cpm_2.3              tidyselect_1.2.1    
## [37] digest_0.6.36        Rtsne_0.17           slam_0.1-52         
## [40] dplyr_1.1.4          kernlab_0.9-32       changepoint_2.2.4   
## [43] ade4_1.7-22          fastmap_1.2.0        grid_4.4.1          
## [46] oompaData_3.1.3      colorspace_2.1-1     cli_3.6.3           
## [49] magrittr_2.0.3       utf8_1.2.4           scales_1.3.0        
## [52] rmarkdown_2.27       skmeans_0.2-16       igraph_2.0.3        
## [55] nnet_7.3-19          reticulate_1.38.0    gridExtra_2.3       
## [58] png_0.1-8            askpass_1.2.0        zoo_1.8-12          
## [61] modeltools_0.2-23    evaluate_0.24.0      knitr_1.48          
## [64] viridisLite_0.4.2    rlang_1.1.4          Rcpp_1.0.13         
## [67] dendextend_1.17.1    glue_1.7.0           jsonlite_1.8.8      
## [70] R6_2.5.1