rng_type. This will be used in favor of
the boolean pcg_rand parameter, although
pcg_rand will still work for backwards compatibility.rng_type = "deterministic" to use a deterministic sampling
of vertices during the optimization phase. This should give
qualitatively similar results to using a real PRNG, but has the
advantage of being faster and giving more reproducible output. This
feature was inspired by a comment by Leland
McInnes on Reddit.num_threads directly in umap2 did
not result in the number of SGD threads being updated to that value when
batch = TRUE, which it should have been.umap_transform continued to return the fuzzy graph in
transposed form. Thank you PedroMilanezAlmeida
for reopening the issue (https://github.com/jlmelville/uwot/issues/118).repulsion_strength was silently ignored if used with
tumap or umap2 with a = 1, b = 1.
Ignoring the setting was on purpose, but it was not documented anywhere.
repulsion_strength is now compatible with these
settings.pca argument if
the input data has a maximum rank smaller than the value of
pca. No PCA is applied in this case. If
verbose = TRUE, a message will be printed to inform the
user.RSpectra is now a required dependency (again). It was a
required dependency up until version 0.1.12, when it became optional
(irlba was used in its place). However, problems with
interactions of the current version of irlba with an ABI
change in the Matrix package means that it’s hard for
downstream packages and users to build uwot without
re-installing Matrix and irlba from source,
which may not be an option for some people. Also it was causing a CRAN
check error. I have changed some tests, examples and vignettes to use
RSpectra explicitly, and to only test irlba
code-paths where necessary. See https://github.com/jlmelville/uwot/issues/115 and links
therein for more details.nn_method = "hnsw" to use it. The behavior of the
method can be controlled by the new nn_args parameter, a
list which may contain M, ef_construction and
ef. See the hnswlib library’s ALGO_PARAMS
documentation for details on these parameters. Although typically
faster than Annoy (for a given accuracy), be aware that the only
supported metric values are "euclidean",
"cosine" and "correlation". Finally, RcppHNSW
is only a suggested package, not a requirement, so you need to install
it yourself (e.g. via install.packages("RcppHNSW")). Also
see the article
on HNSW in uwot in the documentation.nn_method = "nndescent" to use it. The
behavior of the method can be controlled by the new nn_args
parameter. There are many supported metrics and possible parameters that
can be set in nn_args, so please see the article
on nearest neighbor descent in uwot in the documentation, and also
the rnndescent package’s documentation
for details. rnndescent is only a suggested package, not a
requirement, so you need to install it yourself (e.g. via
install.packages("rnndescent")).umap2, which acts like umap
but with modified defaults, reflecting my experience with UMAP and
correcting some small mistakes. See the umap2
article for more details.init_sdev = "range" caused an error with a
user-supplied init matrix.correlation metric was
actually using the cosine metric if you saved and reloaded
the model. Thank you Holly Hall
for the report and helpful detective work (https://github.com/jlmelville/uwot/issues/117).umap_transform could fail if the new data to be
transformed had the scaled:center and
scaled:scale attributes set (e.g. from applying the
scale function).umap_transform to return the fuzzy graph (
ret_extra = c("fgraph")), it was transposed when
batch = TRUE, n_epochs = 0. Thank you PedroMilanezAlmeida
for reporting (https://github.com/jlmelville/uwot/issues/118).n_sgd_threads = "auto" with
umap_transform caused a crash.dist class was meant that may have been particularly
affecting Seurat users. Thank you AndiMunteanu for reporting
(and suggesting a solution) (https://github.com/jlmelville/uwot/issues/121).optimize_graph_layout. Use this to
produce optimized output coordinates that reflect an input similarity
graph (such as that produced by the similarity_graph
function. similarity_graph followed by
optimize_graph_layout is the same as running
umap, so the purpose of these functions is to allow for
more flexibility and decoupling between generating the nearest neighbor
graph and optimizing the low-dimensional approximation to it. Based on a
request by user Chengwei94
(https://github.com/jlmelville/uwot/issues/98).simplicial_set_union and
simplicial_set_intersect. These allow for the combination
of different fuzzy graph representations of a dataset into a single
fuzzy graph using the UMAP simplicial set operations. Based on a request
in the Python UMAP issues tracker by user Dhar xion.umap_transform:
ret_extra. This works like the equivalent parameter for
umap, and should be a character vector specifying the extra
information you would like returned in addition to the embedding, in
which case a list will be returned with an embedding member
containing the optimized coordinates. Supported values are
"fgraph", "nn", "sigma" and
"localr". Based on a request by user PedroMilanezAlmeida
(https://github.com/jlmelville/uwot/issues/104).umap, tumap and
umap_transform: seed. This will do the
equivalent of calling set.seed internally, and hence will
help with reproducibility. The chosen seed is exported if
ret_model = TRUE and umap_transform will use
that seed if present, so you only need to specify it in
umap_transform if you want to change the seed. The default
behavior remains to not modify the random number state. Based on a
request by SuhasSrinivasan (https://github.com/jlmelville/uwot/issues/110).init_sdev: set
init_sdev = "range" and initial coordinates will be
range-scaled so each column takes values between 0-10. This
pre-processing was added to the Python UMAP package at some point after
uwot began development and so should probably always be
used with the default init = "spectral" setting. However,
it is not set by default to maintain backwards compatibility with older
versions of uwot.ret_extra = c("sigma") is now supported by
lvish. The Gaussian bandwidths are returned in a
sigma vector. In addition, a vector of intrinsic
dimensionalities estimated for each point using an analytical expression
of the finite difference method given by Lee and
co-workers is returned in the dint vector.min_dist and spread parameters are now
returned in the model when umap is run with
ret_model = TRUE. This is just for documentation purposes,
these values are not used directly by the model in
umap_transform. If the parameters a and
b are set directly when invoking umap, then
both min_dist and spread will be set to
NULL in the returned model. This feature was added in
response to a question from kjiang18 (https://github.com/jlmelville/uwot/issues/95).n_components seems to have been
set too high.n_components was greater than
n_neighbors then umap_transform would crash
the R session. Thank you to ChVav
for reporting this (https://github.com/jlmelville/uwot/issues/102).umap_transform with a model where
dens_scale was set could cause a segmentation fault,
destroying the session. Even if it didn’t it could give an entirely
artifactual “ring” structure. Thank you FemkeSmit for reporting this and
providing assistance in diagnosing the underlying cause (https://github.com/jlmelville/uwot/issues/103).binary_edge_weights = TRUE, this setting was
not exported when ret_model = TRUE, and was therefore not
respected by umap_transform. This has now been fixed, but
you will need to regenerate any models that used binary edge
weights.init param said that if there were
multiple disconnected components, a spectral initialization would
attempt to merge multiple sub-graphs. Not true: actually, spectral
initialization is abandoned in favor of PCA. The documentation has been
updated to reflect the true state of affairs. No idea what I was
thinking of there.load_model and save_model didn’t work on
Windows 7 due to how the version of tar there handles drive
letters. Thank you mytarmail
for the report (https://github.com/jlmelville/uwot/issues/109).similarity_graph. If you are more
interested in the high-dimensional graph/fuzzy simplicial set
representation of your input data, and don’t care about the low
dimensional approximation, the similarity_graph function
offers a similar API to umap, but neither the
initialization nor optimization of low-dimensional coordinates will be
performed. The return value is the same as that which would be returned
in the results list as the fgraph member if you had
provided ret_extra = c("fgraph"). Compared to getting the
same result via running umap, this function is a bit more
convenient to use, makes your intention clearer if you would be
discarding the embedding, and saves a small amount of time. A
t-SNE/LargeVis similarity graph can be returned by setting
method = "largevis".umap_transform with
pre-generated nearest neighbors (also the error message was completely
useless). Thank you to AustinHartman for reporting
this (https://github.com/jlmelville/uwot/issues/97).fuzzy_simplicial_set) refactored to behave more like that
of previous versions. This change was breaking the behavior of the CRAN
package bbknnR.dens_weight. If set to a value between 0
and 1, an attempt is made to include the relative local densities of the
input data in the output coordinates. This is an approximation to the densMAP method. A
large value of dens_weight will use a larger range of
output densities to reflect the input data. If the data is too spread
out, reduce the value of dens_weight. For more information
see the documentation
at the uwot repo.binary_edge_weights. If set to
TRUE, instead of smoothed knn distances, non-zero edge
weights all have a value of 1. This is how PaCMAP works and
there is practical and theoretical
reasons to believe this won’t have a big effect on UMAP but you can try
it yourself.ret_extra:
"sigma": the return value will contain a
sigma entry, a vector of the smooth knn distance scaling
normalization factors, one for each observation in the input data. A
small value indicates a high density of points in the local neighborhood
of that observation. For lvish the equivalent bandwidths
calculated for the input perplexity is returned.rho will be exported, which is the
distance to the nearest neighbor after the number of neighbors specified
by the local_connectivity. Only applies for
umap and tumap."localr": exports a vector of the local radii, the sum
of sigma and rho and used to scale the output
coordinates when dens_weight is set. Even if not using
dens_weight, visualizing the output coordinates using a
color scale based on the value of localr can reveal regions
of the input data with different densities.umap and tumap only: new
data type for precomputed nearest neighbor data passed as the
nn_method parameter: you may use a sparse distance matrix
of format dgCMatrix with dimensions N x N
where N is the number of observations in the input data.
Distances should be arranged by column, i.e. a non-zero entry in row
j of the ith column indicates that the
jth observation in the input data is a nearest neighbor of
the ith observation with the distance given by the value of
that element. Note that this is a different format to the sparse
distance matrix that can be passed as input to X: notably,
the matrix is not assumed to be symmetric. Unlike other input formats,
you may have a different number of neighbors for each observation (but
there must be at least one neighbor defined per observation).umap_transform can also take a sparse distance matrix
as its nn_method parameter if precomputed nearest neighbor
data is used to generate an initial model. The format is the same as for
the nn_method with umap. Because distances are
arranged by columns, the expected dimensions of the sparse matrix is
N_model x N_new where N_model is the number of
observations in the original data and N_new is the number
of observations in the data to be transformed.n_components = 100 or higher),
RSpectra is recommended and will likely out-perform irlba even if you
have installed a good linear algebra library.init = "laplacian" returned the wrong coordinates
because of a slightly subtle issue around how to order the eigenvectors
when using the random walk transition matrix rather than normalized
graph laplacians.init_sdev parameter was ignored when the
init parameter was a user-supplied matrix. Now the input
will be scaled.bandwidth parameter has been
changed to give results more like the current version (0.5.2) of the
Python UMAP implementation. This is likely to be a breaking change for
non-default settings of bandwidth, but this is not a
parameter which is actually exposed by the Python UMAP public API any
more, so is on the road to deprecation in uwot too and I don’t recommend
you change this.batch. If TRUE, then
results are reproducible when n_sgd_threads > 1 (as long
as you use set.seed). The price to be paid is that the
optimization is slightly less efficient (because coordinates are not
updated as quickly and hence gradients are staler for longer), so it is
highly recommended to set n_epochs = 500 or higher. Thank
you to Aaron Lun who not only came
up with a way to implement this feature, but also wrote an entire C++ implementation of UMAP
which does it (https://github.com/jlmelville/uwot/issues/83).opt_args. The default optimization
method when batch = TRUE is Adam. You can control its
parameters by passing them in the opt_args list. As Adam is
a momentum-based method it requires extra storage of previous gradient
data. To avoid the extra memory overhead you can also use
opt_args = list(method = "sgd") to use a stochastic
gradient descent method like that used when
batch = FALSE.epoch_callback. You may now pass a
function which will be invoked at the end of each epoch. Mainly useful
for producing an image of the state of the embedding at different points
during the optimization. This is another feature taken from umappp.pca_method, used when the
pca parameter is supplied to reduce the initial
dimensionality of the data. This controls which method is used to carry
out the PCA and can be set to one of:
"irlba" which uses irlba::irlba to
calculate a truncated SVD. If this routine deems that you are trying to
extract 50% or more of the singular vectors, you will see a warning to
that effect logged to the console."rsvd", which uses irlba::svdr for
truncated SVD. This method uses a small number of iterations which
should give an accuracy/speed up trade-off similar to that of the scikit-learn
TruncatedSVD method. This can be much faster than using
"irlba" but potentially at a cost in accuracy. However, for
the purposes of dimensionality reduction as input to nearest neighbor
search, this doesn’t seem to matter much."bigstatsr", which uses the bigstatsr
package will be used. Note: that this is not a
dependency of uwot. If you want to use
bigstatsr, you must install it yourself. On platforms
without easy access to fast linear algebra libraries (e.g. Windows),
using bigstatsr may give a speed up to PCA
calculations."svd", which uses base::svd.
Warning: this is likely to be very slow for most
datasets and exists as a fallback for small datasets where the
"irlba" method would print a warning."auto" (the default) which uses "irlba" to
calculate a truncated SVD, unless you are attempting to extract 50% or
more of the singular vectors, in which case "svd" is
used.ret_nn = TRUE. If the
names exist in more than one of the input data parameters listed above,
but are inconsistent, no guarantees are made about which names will be
used. Thank you jwijffels for
reporting this.umap_transform, the learning rate is now down-scaled
by a factor of 4, consistent with the Python implementation of UMAP. If
you need the old behavior back, use the (newly added)
learning_rate parameter in umap_transform to
set it explicitly. If you used the default value in umap
when creating the model, the correct setting in
umap_transform is learning_rate = 1.0.nn_method = "annoy" and
verbose = TRUE would lead to an error with datasets with
fewer than 50 items in them.umap_transform (this was incorrectly
documented to work).umap_transform was wrong in other ways: it has now been
corrected to indicate that there should be neighbor data for each item
in the test data, but the neighbors and distances should refer to items
in training data (i.e. the data used to build the model).n_neighbors parameter is now correctly ignored in model
generation if pre-calculated nearest neighbor data is provided.grain_size didn’t do
anything.This release is mainly to allow for some internal changes to keep compatibility with RcppAnnoy, used for the nearest neighbor calculations.
umap and
tumap now note that the contents of the model
list are subject to change and not intended to be part of the uwot
public API. I recommend not relying on the structure of the
model, especially if your package is intended to appear on
CRAN or Bioconductor, as any breakages will delay future releases of
uwot to CRAN.metric = "correlation" a distance based on
the Pearson correlation (https://github.com/jlmelville/uwot/issues/22).
Supporting this required a change to the internals of how nearest
neighbor data is stored. Backwards compatibility with models generated
by previous versions using ret_model = TRUE should have
been preserved.nn_method, for
umap_transform: pass a list containing pre-computed nearest
neighbor data (identical to that used in the umap
function). You should not pass anything to the X parameter
in this case. This extends the functionality for transforming new points
to the case where nearest neighbor data between the original data and
new data can be calculated external to uwot. Thanks to Yuhan Hao for contributing the PR
(https://github.com/jlmelville/uwot/issues/63 and https://github.com/jlmelville/uwot/issues/64).init, for umap_transform:
provides a variety of options for initializing the output coordinates,
analogously to the same parameter in the umap function (but
without as many options currently). This is intended to replace
init_weighted, which should be considered deprecated, but
won’t be removed until uwot 1.0 (whenever that is). Instead of
init_weighted = TRUE, use init = "weighted";
replace init_weighted = FALSE with
init = "average". Additionally, you can pass a matrix to
init to act as the initial coordinates.umap_transform: previously, setting
n_epochs = 0 was ignored: at least one iteration of
optimization was applied. Now, n_epochs = 0 is respected,
and will return the initialized coordinates without any further
optimization.verbose = TRUE: the progress bar calculations
were taking up a detectable amount of time and has now been fixed. With
very small data sets (< 50 items) the progress bar will no longer
appear when building the index.n_threads is now NULL to
provide a bit more protection from changing dependencies.grain_size parameter has been undeprecated. As the
version that deprecated this never made it to CRAN, this is unlikely to
have affected many people.grain_size parameter is now ignored and remains to
avoid breaking backwards compatibility only.ret_extra, a vector which can contain
any combination of: "model" (same as
ret_model = TRUE), "nn" (same as
ret_nn = TRUE) and fgraph (see below).ret_extra vector contains
"fgraph", the returned list will contain an
fgraph item representing the fuzzy simplicial input graph
as a sparse N x N matrix. For lvish, use "P"
instead of "fgraph” (https://github.com/jlmelville/uwot/issues/47). Note that
there is a further sparsifying step where edges with a very low
membership are removed if there is no prospect of the edge being sampled
during optimization. This is controlled by n_epochs: the
smaller the value, the more sparsifying will occur. If you are only
interested in the fuzzy graph and not the embedded coordinates, set
n_epochs = 0.unload_uwot, to unload the Annoy nearest
neighbor indices in a model. This prevents the model from being used in
umap_transform, but allows for the temporary working
directory created by both save_uwot and
load_uwot to be deleted. Previously, both
load_uwot and save_uwot were attempting to
delete the temporary working directories they used, but would always
silently fail because Annoy is making use of files in those
directories.init = "spca", fixed values of a and
b (rather than allowing them to be calculated through
setting min_dist and spread) and
approx_pow = TRUE. Using the tumap method with
init = "spca" is probably the most robust approach.n_epochs = 0. This used to behave
like (n_epochs = NULL) and gave a default number of epochs
(dependent on the number of vertices in the dataset). Now it more
usefully carries out all calculations except optimization, so the
returned coordinates are those specified by the init
parameter, so this is an easy way to access e.g. the spectral or PCA
initialization coordinates. If you want the input fuzzy graph
(ret_extra vector contains "fgraph"), this
will also prevent the graph having edges with very low membership being
removed. You still get the old default epochs behavior by setting
n_epochs = NULL or to a negative value.save_uwot and load_uwot have been updated
with a verbose parameter so it’s easier to see what
temporary files are being created.save_uwot has a new parameter, unload,
which if set to TRUE will delete the working directory for
you, at the cost of unloading the model, i.e. it can’t be used with
umap_transform until you reload it with
load_uwot.save_uwot now returns the saved model with an extra
field, mod_dir, which points to the location of the
temporary working directory, so you should now assign the result of
calling save_uwot to the model you saved, e.g.
model <- save_uwot(model, "my_model_file"). This field
is intended for use with unload_uwot.load_uwot also returns the model with a
mod_dir item for use with unload_uwot.save_uwot and load_uwot were not correctly
handling relative paths.load_uwot in uwot 0.1.4 to work
with newer versions of RcppAnnoy (https://github.com/jlmelville/uwot/issues/31) failed in
the typical case of a single metric for the nearest neighbor search
using all available columns, giving an error message along the lines of:
Error: index size <size> is not a multiple of vector size <size>.
This has now been fixed, but required changes to both
save_uwot and load_uwot, so existing saved
models must be regenerated. Thank you to reporter OuNao.n_threads caused a crash. This was particularly insidious
if running with a system with only one default thread available as the
default n_threads becomes 0.5. Now
n_threads (and n_sgd_threads) are rounded to
the nearest integer.ERROR: there is already an InterruptableProgressMonitor instance defined.verbose = TRUE, the a, b
curve parameters are now logged.Even with a fix for the bug mentioned above, if the nearest neighbor
index file is larger than 2GB in size, Annoy may not be able to read the
data back in. This should only occur with very large or high-dimensional
datasets. The nearest neighbor search will fail under these conditions.
A work-around is to set n_threads = 0, because the index
will not be written to disk and re-loaded under these circumstances, at
the cost of a longer search time. Alternatively, set the
pca parameter to reduce the dimensionality or lower
n_trees, both of which will reduce the size of the index on
disk. However, either may lower the accuracy of the nearest neighbor
results.
Initial CRAN release.
tmpdir, which allows the user to specify
the temporary directory where nearest neighbor indexes will be written
during Annoy nearest neighbor search. The default is
base::tempdir(). Only used if n_threads > 1
and nn_method = "annoy".Fixed an issue with lvish where there was an
off-by-one error when calculating input probabilities.
Added a safe-guard to lvish to prevent the gaussian
precision, beta, becoming overly large when the binary search fails
during perplexity calibration.
The lvish perplexity calibration uses the
log-sum-exp trick to avoid numeric underflow if beta becomes
large.
pcg_rand. If TRUE (the
default), then a random number generator from the PCG family is used during the
stochastic optimization phase. The old PRNG, a direct translation of an
implementation of the Tausworthe “taus88” PRNG used in the Python
version of UMAP, can be obtained by setting
pcg_rand = FALSE. The new PRNG is slower, but is likely
superior in its statistical randomness. This change in behavior will be
break backwards compatibility: you will now get slightly different
results even with the same seed.fast_sgd. If TRUE, then the
following combination of parameters are set:
n_sgd_threads = "auto", pcg_rand = FALSE and
approx_pow = TRUE. These will result in a substantially
faster optimization phase, at the cost of being slightly less accurate
and results not being exactly repeatable. fast_sgd = FALSE
by default but if you are only interested in visualization, then
fast_sgd gives perfectly good results. For more generic
dimensionality reduction and reproducibility, keep
fast_sgd = FALSE.init_sdev which specifies how large the
standard deviation of each column of the initial coordinates should be.
This will scale any input coordinates (including user-provided matrix
coordinates). init = "spca" can now be thought of as an
alias of init = "pca", init_sdev = 1e-4. This may be too
aggressive scaling for some datasets. The typical UMAP spectral
initializations tend to result in standard deviations of around
2 to 5, so this might be more appropriate in
some cases. If spectral initialization detects multiple components in
the affinity graph and falls back to scaled PCA, it uses
init_sdev = 1.init_sdev, the init
options sspectral, slaplacian and
snormlaplacian have been removed (they weren’t around for
very long anyway). You can get the same behavior by e.g.
init = "spectral", init_sdev = 1e-4.
init = "spca" is sticking around because I use it a
lot.init = "spca".<random> header. This
breaks backwards compatibility even if you set
pcg_rand = FALSE.metric = "cosine" results were incorrectly using the
unmodified Annoy angular distance.categorical metric (fixes https://github.com/jlmelville/uwot/issues/20).n_components (e.g. approximately 50% faster optimization
time with MNIST and n_components = 50).pca_center, which controls whether to
center the data before applying PCA. It would be typical to set this to
FALSE if you are applying PCA to binary data (although note
you can’t use this with setting with
metric = "hamming")metric is
"manhattan" and "cosine". It’s still
not applied when using "hamming" (data still needs
to be in binary format, not real-valued).pca and
pca_center parameter values for a given data block by using
a list for the value of the metric, with the column ids/names as an
unnamed item and the overriding values as named items, e.g. instead of
manhattan = 1:100, use
manhattan = list(1:100, pca_center = FALSE) to turn off PCA
centering for just that block. This functionality exists mainly for the
case where you have mixed binary and real-valued data and want to apply
PCA to both data types. It’s normal to apply centering to real-valued
data but not to binary data.umap_transform, where negative
sampling was over the size of the test data (should be the training
data).verbose = TRUE, log the Annoy recall accuracy,
which may help tune values of n_trees and
search_k.n_sgd_threads, which controls the number
of threads used in the stochastic gradient descent. By default this is
now single-threaded and should result in reproducible results when using
set.seed. To get back the old, less consistent, but faster
settings, set n_sgd_threads = "auto".alpha is now learning_rate.gamma is now repulsion_strength.laplacian and normlaplacian).init options: sspectral,
snormlaplacian and slaplacian. These are like
spectral, normlaplacian,
laplacian respectively, but scaled so that each dimension
has a standard deviation of 1e-4. This is like the difference between
the pca and spca options.pca: set this to a positive integer to
reduce matrix of data frames to that number of columns using PCA. Only
works if metric = "euclidean". If you have > 100
columns, this can substantially improve the speed of the nearest
neighbor search. t-SNE implementations often set this value to 50.metric:
instead of specifying a single metric name
(e.g. metric = "euclidean"), you can pass a list, where the
name of each item is the metric to use and the value is a vector of the
names of the columns to use with that metric, e.g.
metric = list("euclidean" = c("A1", "A2"), "cosine" = c("B1", "B2", "B3"))
treats columns A1 and A2 as one block, using
the Euclidean distance to find nearest neighbors, whereas
B1, B2 and B3 are treated as a
second block, using the cosine distance.categorical.y may now be a data frame or matrix if multiple target
data is available.target_metric, to specify the distance
metric to use with numerical y. This has the same
capabilities as metric.scale = "Z" To Z-scale each column of input (synonym
for scale = TRUE or scale = "scale").scale = "colrange" to scale columns
in the range (0, 1).y, you may pass
nearest neighbor data directly, in the same format as that supported by
X-related nearest neighbor data. This may be useful if you
don’t want to use Euclidean distances for the y data, or if
you have missing data (and have a way to assign nearest neighbors for
those cases, obviously). See the Nearest
Neighbor Data Format section for details.ret_nn: when TRUE returns
nearest neighbor matrices as a nn list: indices in item
idx and distances in item dist. Embedded
coordinates are in embedding. Both ret_nn and
ret_model can be TRUE, and should not cause
any compatibility issues with supervised embeddings.nn_method can now take precomputed nearest neighbor
data. Must be a list of two matrices: idx, containing
integer indexes, and dist containing distances. By no
coincidence, this is the format return by ret_nn.n_components = 1 was broken (https://github.com/jlmelville/uwot/issues/6)init parameter were being
modified, in defiance of basic R pass-by-copy semantics.metric = "cosine" is working again for
n_threads greater than 0 (https://github.com/jlmelville/uwot/issues/5)August 5 2018. You can now use an existing embedding to
add new points via umap_transform. See the example section
below.
August 1 2018. Numerical vectors are now supported for supervised dimension reduction.
July 31 2018. (Very) initial support for supervised
dimension reduction: categorical data only at the moment. Pass in a
factor vector (use NA for unknown labels) as the
y parameter and edges with bad (or unknown) labels are
down-weighted, hopefully leading to better separation of classes. This
works remarkably well for the Fashion MNIST dataset.
July 22 2018. You can now use the cosine and Manhattan
distances with the Annoy nearest neighbor search, via
metric = "cosine" and metric = "manhattan",
respectively. Hamming distance is not supported because RcppAnnoy
doesn’t yet support it.