Getting Started with BioMoR

BioMoR: Bioinformatics Modeling with Recursion and Autoencoder-Based Ensembles

BioMoR is an R package for bioinformatics modeling that integrates: • Recursive Transformer architectures via Mixture-of-Recursions (MoR) (Bae et al. 2025 doi:10.48550/arXiv.2507.10524) • Autoencoder-based representation learning (Hinton & Salakhutdinov 2006 doi:10.1126/science.1127647) • Random Forests for robust tree-based modeling (Breiman 2001 doi:10.1023/A:1010933404324) • XGBoost for efficient gradient boosting (Chen & Guestrin 2016 doi:10.1145/2939672.2939785) • Stacked ensembles to combine diverse models for stronger predictive power.

It is designed as a benchmarking framework for predictive workflows in bioinformatics, enabling consistent cross-validation, calibration, and threshold optimization.

Motivation

Modern bioinformatics involves high-dimensional and noisy data such as genomics, transcriptomics, and proteomics. BioMoR addresses these challenges by: • Using Mixture-of-Recursions (MoR) for adaptive recursive depth and computational efficiency. • Learning latent embeddings through autoencoders to improve classifier generalization. • Leveraging ensemble methods (RF, XGB) for robustness. • Providing a standardized benchmarking interface to evaluate models on ROC-AUC, PR-AUC, F1, Balanced Accuracy, Brier score, calibration, and threshold optimization.

Example Workflow

We illustrate with the classic iris dataset (binary recoding for simplicity):

library(BioMoR)

# Prepare dataset: recode labels to binary
data(iris)
iris$Label <- ifelse(iris$Species == "setosa", "Active", "Inactive")

# Cross-validation control
ctrl <- get_cv_control(cv = 3)

# Train a Random Forest
fit <- train_rf(iris, outcome_col = "Label", ctrl = ctrl)

# Benchmark the model
results <- biomor_benchmark(fit, iris, outcome_col = "Label")
#> Warning in bake(object$recipe, new_data = newdata, all_predictors()): ! There was 1 column that was a factor when the recipe was prepped:
#> • `Label`
#> ℹ This may cause errors when processing new data.
#> ! There was 1 column that was a factor when the recipe was prepped:
#> • `Label`
#> ℹ This may cause errors when processing new data.
#> Warning in confusionMatrix.default(y_pred, y_true): Levels are not in the same
#> order for reference and data. Refactoring data to match.

# Print metrics
results$metrics
#> NULL

Visualization

# ROC Curve
results$plots$ROC
#> NULL
# Precision-Recall Curve
results$plots$PR
#> NULL
# Threshold Optimization
results$plots$Thresholds
#> NULL
# Calibration Curve
results$plots$Calibration
#> NULL

Extending BioMoR • Replace train_rf() with train_xgb_caret() for XGBoost. • Incorporate autoencoder features via train_autoencoder() and get_embeddings(). • Use train_biomor() to stack multiple models. • Benchmark across models to compare pipelines in one consistent framework.