The hierBAPS algorithm was introduced in (Cheng et al. 2013) and provides a method for hierarchically clustering DNA sequence data to reveal nested population structure. Previously the algorithm was available as a compiled MATLAB binary. We provide a convenient R implementation and include a number of useful additional options including the ability to use multiple cores, save the log marginal likelihood scores and run the algorithm until local convergence. Furthermore, we provide a wrapper to a ggtree plotting function allowing for easy exploration of sub-clusters.
Things to keep in mind before running hierBAPS
n.extra.rounds
parameter.library(rhierbaps)
library(ggtree)
library(phytools)
library(ape)
set.seed(1234)
We first need to load a multiple sequence alignment in fasta format. We can then generate the required SNP matrix.
fasta.file.name <- system.file("extdata", "seqs.fa", package = "rhierbaps")
snp.matrix <- load_fasta(fasta.file.name)
If you wish to include singleton SNPs (those that appear in only one
isolate) then set keep.singletons=FALSE
. However, this is
currently advised against as these SNPs lead to a higher number of
parameters in the model and do not provide information about shared
ancestry.
It is also possible to load an ape DNAbin object. Here me make use of the woodmouse dataset in ape.
data(woodmouse)
woodmouse.snp.matrix <- load_fasta(woodmouse)
We now need to decide how many levels of clustering we are interested
in and the number of initial clusters to start from. It is a good idea
to choose n.pops
to be significantly larger than the number
of clusters you expect.
To run hierBAPS with \(2\) levels and \(20\) initial clusters we run
hb.results <- hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, quiet = TRUE)
head(hb.results$partition.df)
## Isolate level 1 level 2
## 1 1 1 1
## 2 2 1 1
## 3 3 1 1
## 4 4 2 5
## 5 5 3 9
## 6 6 3 9
This produces a list which includes a data frame indicating the resulting partition of the isolates at the difference levels. The isolate names in this data frame are taken from the fasta headers and thus for plotting it is important that these match the isolate names in any tree used later. This function also outputs the log marginal likelihoods at the different levels of clustering.
hierBAPS can also be run until the algorithm converges to a local optimum as
hb.results <- hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, n.extra.rounds = Inf,
quiet = TRUE)
We can also check how long hierBAPS takes to run on the test dataset of 515 samples and 744 SNPs.
system.time(hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, quiet = TRUE))
## user system elapsed
## 84.334 15.494 100.946
To plot the results it is useful to consider a tree of the same isolates. We clustered the example isolates using Iqtree (Kalyaanamoorthy et al. 2017). The ggtree (Yu et al. 2017) package then allows us to plot the results.
First we need to load the newick file.
newick.file.name <- system.file("extdata", "seqs.fa.treefile", package = "rhierbaps")
iqtree <- phytools::read.newick(newick.file.name)
A simple coloured tree allows us to see the top level cluster assignment from hierBAPS.
gg <- ggtree(iqtree, layout = "circular")
gg <- gg %<+% hb.results$partition.df
gg <- gg + geom_tippoint(aes(color = factor(`level 1`)))
gg
As there are many more clusters at the second level using colours to distinguish them can get confusing. Instead we can label the tips with their corresponding clusters.
gg <- ggtree(iqtree, layout = "circular", branch.length = "none")
gg <- gg %<+% hb.results$partition.df
gg <- gg + geom_tippoint(aes(color = factor(`level 1`)))
gg <- gg + theme(legend.position = "right")
gg <- gg + geom_tiplab(aes(label = `level 2`), size = 1, offset = 1)
gg
We can also zoom in on a particular top level cluster to get a better idea of how it is partitioned at the lower level. As an example we zoom in on sub cluster 9 at level 1.
plot_sub_cluster(hb.results, iqtree, level = 1, sub.cluster = 9)
Finally, we can inspect the log marginal likelihoods given for each level.
hb.results$lml.list
## $`Depth 0`
## 1
## -50858.92
##
## $`Depth 1`
## 1 2 3 4 5 6 7
## -2121.8599 -4012.3594 -4237.7639 -3095.1865 -1525.7356 -3180.7572 -4015.5020
## 8 9 10 11 12 13
## -2104.5277 -1736.0192 -780.0635 -810.7793 -688.5214 -163.3198
We can also calculate the individual probabilities of assignment to each cluster. Here we make use of the woodmouse dataset loaded earlier.
hb.results.woodmouse <- hierBAPS(woodmouse.snp.matrix, max.depth = 2, n.extra.rounds = Inf,
quiet = TRUE, assignment.probs = TRUE)
head(hb.results.woodmouse$cluster.assignment.prob[[1]])
## Cluster 1 Cluster 2 Cluster 3
## No305 9.997868e-01 2.104112e-04 2.805482e-06
## No304 5.620947e-06 9.999944e-01 1.699254e-11
## No306 8.996214e-03 9.910038e-01 9.626735e-09
## No0906S 9.965743e-01 3.425673e-03 1.902359e-08
## No0908S 9.911304e-01 8.869359e-03 2.068655e-07
## No0909S 2.615477e-09 1.105831e-10 1.000000e+00
For runs that take a long time it is a good idea to save the output. We can save the partition file as
write.csv(hb.results$partition.df, file = file.path(tempdir(), "hierbaps_partition.csv"),
col.names = TRUE, row.names = FALSE)
save_lml_logs(hb.results, file.path(tempdir(), "hierbaps_logML.txt"))
If you use rhierbaps in a research publication please cite both
Tonkin-Hill, Gerry, John A. Lees, Stephen D. Bentley, Simon D. W. Frost, and Jukka Corander. 2018. “RhierBAPS: An R Implementation of the Population Clustering Algorithm hierBAPS.” Wellcome Open Research 3 (July): 93.
Cheng, Lu, Thomas R. Connor, Jukka Sirén, David M. Aanensen, and Jukka Corander. 2013. “Hierarchical and Spatially Explicit Clustering of DNA Sequences with BAPS Software.” Molecular Biology and Evolution 30 (5): 1224–28.
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phytools_1.2-0 maps_3.4.0 ape_5.6-2 ggtree_3.4.0
## [5] rhierbaps_1.1.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 lattice_0.20-45 tidyr_1.2.0
## [4] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
## [7] R6_2.5.1 evaluate_0.15 coda_0.19-4
## [10] highr_0.9 ggplot2_3.3.6 pillar_1.7.0
## [13] ggfun_0.0.6 yulab.utils_0.0.4 rlang_1.0.5
## [16] lazyeval_0.2.2 rstudioapi_0.13 phangorn_2.9.0
## [19] jquerylib_0.1.4 Matrix_1.4-1 combinat_0.0-8
## [22] rmarkdown_2.14 labeling_0.4.2 stringr_1.4.0
## [25] igraph_1.3.4 munsell_0.5.0 compiler_4.2.0
## [28] numDeriv_2016.8-1.1 xfun_0.31 pkgconfig_2.0.3
## [31] gridGraphics_0.5-1 mnormt_2.1.0 optimParallel_1.0-2
## [34] htmltools_0.5.2 tidyselect_1.1.2 tibble_3.1.7
## [37] expm_0.999-6 matrixStats_0.62.0 codetools_0.2-18
## [40] quadprog_1.5-8 fansi_1.0.3 crayon_1.5.1
## [43] dplyr_1.0.9 MASS_7.3-56 grid_4.2.0
## [46] nlme_3.1-157 jsonlite_1.8.0 gtable_0.3.0
## [49] lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.3
## [52] formatR_1.12 scales_1.2.0 tidytree_0.3.9
## [55] cli_3.3.0 stringi_1.7.6 farver_2.1.0
## [58] scatterplot3d_0.3-42 bslib_0.3.1 ellipsis_0.3.2
## [61] generics_0.1.2 vctrs_0.4.1 fastmatch_1.1-3
## [64] tools_4.2.0 treeio_1.20.0 ggplotify_0.1.0
## [67] glue_1.6.2 purrr_0.3.4 plotrix_3.8-2
## [70] parallel_4.2.0 fastmap_1.1.0 yaml_2.3.5
## [73] colorspace_2.0-3 aplot_0.1.6 clusterGeneration_1.3.7
## [76] knitr_1.39 patchwork_1.1.1 sass_0.4.1