| Type: | Package |
| Title: | Missing Data Imputation via Language Models and Statistics |
| Version: | 0.1.0 |
| Description: | Provides missing data imputation through two complementary engines: a large language model engine that communicates with the 'Anthropic' 'Claude' application programming interface for context-aware semantic imputation, and a fully self-contained offline engine implementing nineteen statistical and machine learning algorithms entirely in base R with no additional package dependencies. Offline methods include mean, median, mode, last observation carried forward, next observation carried backward, hot-deck, predictive mean matching, k-nearest neighbours, ordinary least-squares regression, Lasso with coordinate descent, Ridge with closed-form solution, Bayesian Ridge regression with evidence approximation following MacKay (1992), support vector regression with a radial basis function kernel, classification and regression trees, random forests, gradient boosting, iterative random forest imputation, principal component analysis imputation via iterative singular value decomposition, and nuclear-norm minimisation via singular value thresholding. When no API key is available the package automatically falls back to the offline engine, ensuring full operation in environments without internet access. Every imputed value is accompanied by a confidence score and a plain-language reasoning string, producing reproducible audit trails. The automatic method selector chooses the best algorithm per column based on data type, skewness, missingness rate, and inter-column correlations. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Language: | en-US |
| Config/Needs/check: | spelling |
| Depends: | R (≥ 4.1.0) |
| Imports: | httr2 (≥ 1.0.0), methods, jsonlite (≥ 1.8.0), cli (≥ 3.6.0), |
| Suggests: | testthat (≥ 3.0.0), knitr (≥ 1.40), rmarkdown (≥ 2.14), withr (≥ 2.5.0) |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| BugReports: | https://cran.r-project.org/submit.html |
| NeedsCompilation: | no |
| Packaged: | 2026-06-18 06:25:03 UTC; acer |
| Maintainer: | Sadikul Islam <sadikul.islamiasri@gmail.com> |
| Author: | Sadikul Islam |
| Repository: | CRAN |
| Date/Publication: | 2026-06-23 13:40:02 UTC |
llmimpute: Missing Data Imputation via Language Models and Statistics
Description
Provides missing data imputation through two complementary engines: a large language model engine that works with any supported language model provider (cloud or local) for context-aware semantic imputation, and a fully self-contained offline engine implementing nineteen statistical and machine learning algorithms entirely in base R with no additional package dependencies.
Supported cloud providers include Anthropic 'Claude', OpenAI, Google
Gemini, Groq, Mistral AI, Cohere, OpenRouter, Together AI, Fireworks AI,
DeepSeek, Perplexity, xAI Grok, AI21 Labs, and Cerebras. Supported local
servers include Ollama, LM Studio, Jan, llama.cpp, KoboldCpp, and Text
Generation WebUI. Any other OpenAI-compatible endpoint can be used via
the "custom" provider option.
When no API key is available the package automatically falls back to the offline engine, ensuring full operation in environments without internet access.
Quick start
library(llmimpute)
df <- data.frame(
age = c(25L, NA, 35L, 40L),
income = c(50000, 60000, NA, 80000),
edu = c("BSc", NA, "MSc", "BSc"),
stringsAsFactors = FALSE
)
# Offline - no key needed
result <- lmi_impute(df)
# Free cloud (Groq)
# lmi_set_api_key("gsk_...", provider = "groq")
# Free cloud (Gemini)
# lmi_set_api_key("AIza...", provider = "gemini")
# Free local (Ollama)
# lmi_set_api_key(provider = "ollama")
# Free local (LM Studio)
# lmi_set_api_key(provider = "lmstudio")
# Any OpenAI-compatible server
# lmi_set_api_key("key", provider = "custom",
# base_url = "http://localhost:8000/v1/chat/completions")
Main functions
lmi_providersList all supported LLM providers.
lmi_set_api_keyConfigure provider, key, and URL.
lmi_set_modelChoose the model for LLM imputation.
lmi_diagnoseInspect missingness without any API call.
lmi_imputeUnified imputation (LLM or offline fallback).
lmi_impute_offlineOffline-only imputation, 19 algorithms.
lmi_modelsList recommended models per provider.
lmi_exportWrite imputed data and audit trail to files.
Free options
Free cloud: Groq (https://console.groq.com), Google Gemini (https://aistudio.google.com), OpenRouter (https://openrouter.ai), Cerebras (https://cloud.cerebras.ai). Free local (no key): Ollama (https://ollama.com), LM Studio (https://lmstudio.ai), Jan (https://www.jan.ai/).
Author(s)
Maintainer: Sadikul Islam sadikul.islamiasri@gmail.com (ORCID)
Authors:
Rajesh Kaushal
See Also
Useful links:
Report bugs at https://cran.r-project.org/submit.html
Call the configured LLM API for imputation
Description
Internal function. Routes the request to the correct provider endpoint using the active protocol (anthropic, openai-compatible, gemini, cohere, or ollama), builds the prompt, and parses the JSON response.
Usage
.lmi_call_api(chunk, domain, confidence, reasoning, flag_suspicious)
Arguments
chunk |
A |
domain |
Character string domain label. |
confidence, reasoning, flag_suspicious |
Logical flags from
|
Value
Named list with elements imputations and suspicious.
Extract the imputed data frame from an lmi_result
Description
Convenience S3 method allowing as.data.frame(result) to extract
the imputed data frame from an lmi_result object directly.
Usage
## S3 method for class 'lmi_result'
as.data.frame(x, ...)
Arguments
x |
An object of class |
... |
Currently unused. Included for S3 compatibility. |
Value
The imputed data.frame.
Examples
df <- data.frame(x = c(1, NA, 3))
result <- lmi_impute_offline(df, verbose = FALSE)
clean <- as.data.frame(result)
Diagnose missing data in a data frame
Description
Analyses a data frame and prints a report on missing values, column types,
and the number of unique observed values per column. No API calls are made.
Use this function before lmi_impute to preview what will be
imputed and to choose an appropriate method.
Usage
lmi_diagnose(data, na_strings = NULL)
Arguments
data |
A |
na_strings |
Character vector of additional strings to treat as
|
Value
Invisibly returns a data.frame with one row per column and
columns column, type, n_missing, pct_missing,
n_unique.
See Also
lmi_impute, lmi_impute_offline
Examples
df <- data.frame(
age = c(25L, NA, 35L, 40L),
income = c(50000, 60000, NA, 80000),
edu = c("BSc", NA, "MSc", "BSc"),
stringsAsFactors = FALSE
)
lmi_diagnose(df)
lmi_diagnose(df, na_strings = "N/A")
Export imputed data and audit trail to files
Description
Writes the imputed data frame and the imputation audit trail to disk.
Supports CSV (default) and RDS output formats. When
flag_suspicious = TRUE was used in lmi_impute, the
suspicious-values table is also written.
Usage
lmi_export(
result,
path = tempdir(),
prefix = "llmimpute",
format = c("csv", "rds"),
overwrite = FALSE
)
Arguments
result |
An object of class |
path |
Character string. Output directory. Created recursively if it
does not exist. Default is the current working directory |
prefix |
Character string prepended to output file names.
Default |
format |
Character string. |
overwrite |
Logical. Overwrite existing files? Default |
Value
Invisibly returns a named character vector of the file paths written.
See Also
lmi_impute, lmi_impute_offline
Examples
df <- data.frame(
age = c(25L, NA, 35L),
income = c(50000, 60000, NA),
stringsAsFactors = FALSE
)
result <- lmi_impute_offline(df, verbose = FALSE)
## Not run:
lmi_export(result, path = tempdir(), prefix = "my_study")
## End(Not run)
Impute missing values using LLM or built-in statistical methods
Description
Primary entry point for the llmimpute package. When an Anthropic
API key is configured, missing values are filled using the Claude large
language model, which reasons about each missing cell using the semantic
meaning of column names, inter-column relationships, and domain knowledge.
When no API key is available (or when offline = TRUE), the function
transparently delegates to lmi_impute_offline, which runs
entirely in base R without internet access.
Usage
lmi_impute(
data,
domain = c("general", "healthcare", "financial", "hr", "survey", "scientific"),
offline = FALSE,
offline_fallback = TRUE,
offline_method = c("auto", "mean", "median", "mode", "locf", "nocb", "hotdeck", "pmm",
"knn", "linear", "lasso", "ridge", "bayesian_ridge", "svr", "decision_tree",
"random_forest", "gradient_boost", "missforest", "pca_impute", "softimpute"),
na_strings = NULL,
cols = NULL,
confidence = TRUE,
reasoning = TRUE,
flag_suspicious = FALSE,
max_rows = 50L,
knn_k = 5L,
seed = 42L,
verbose = TRUE
)
Arguments
data |
A |
domain |
Character string describing the data domain. Guides LLM
reasoning. One of |
offline |
Logical. If |
offline_fallback |
Logical. If |
offline_method |
Character. Offline imputation strategy passed to
|
na_strings |
Character vector of additional strings to treat as
|
cols |
Character vector of column names to impute. |
confidence |
Logical. Include a confidence score (0-100) per imputed
cell. Default |
reasoning |
Logical. Include a one-sentence explanation per imputed
cell. Default |
flag_suspicious |
Logical. Ask the LLM to flag anomalous existing
values (LLM mode only). Default |
max_rows |
Integer. Maximum rows per API call chunk (LLM mode only).
Default |
knn_k |
Integer. Neighbours for the |
seed |
Integer. Random seed for reproducible offline imputation.
Default |
verbose |
Logical. Print progress messages. Default |
Value
An object of class lmi_result, a named list with:
dataThe imputed
data.frame.imputationsdata.frameaudit trail: one row per imputed cell with columnsrow,col,original,imputed,confidence,reasoning.suspiciousdata.frameof flagged existing values (LLM mode,flag_suspicious = TRUEonly).NULLotherwise.summaryNamed list of imputation statistics.
callThe matched call.
Engine selection
| Situation | Behaviour |
API key present, offline = FALSE | LLM imputation (Anthropic) |
No key, offline_fallback = TRUE | Offline engine (silent) |
No key, offline_fallback = FALSE | Error with guidance |
offline = TRUE | Offline engine always |
See Also
lmi_impute_offline, lmi_diagnose,
lmi_export, print.lmi_result
Examples
# Offline imputation (works with no API key)
df <- data.frame(
age = c(25L, NA, 35L, 40L, NA),
income = c(50000, 60000, NA, 80000, 55000),
edu = c("BSc", NA, "MSc", "BSc", "PhD"),
stringsAsFactors = FALSE
)
result <- lmi_impute(df)
result$data
result$imputations
summary(result)
# Force a specific offline method
result2 <- lmi_impute(df, offline = TRUE, offline_method = "random_forest")
## Not run:
# LLM mode requires a valid Anthropic API key
lmi_set_api_key() # reads ANTHROPIC_API_KEY from environment
result3 <- lmi_impute(df, domain = "hr")
## End(Not run)
Impute missing values using built-in statistical and ML methods (no API required)
Description
Fully offline imputation using 20 methods from scratch in base R. No API key, no internet, no third-party packages needed.
Usage
lmi_impute_offline(
data,
method = c("auto", "mean", "median", "mode", "locf", "nocb", "hotdeck", "pmm", "knn",
"linear", "lasso", "ridge", "bayesian_ridge", "svr", "decision_tree",
"random_forest", "gradient_boost", "missforest", "pca_impute", "softimpute"),
cols = NULL,
na_strings = NULL,
knn_k = 5L,
n_trees = 50L,
max_depth = 4L,
shrinkage = 0.1,
n_iter = 10L,
lambda = 1,
rank = 3L,
seed = 42L,
verbose = TRUE
)
Arguments
data |
A |
method |
One of: |
cols |
Column names to impute. |
na_strings |
Extra strings treated as |
knn_k |
Neighbours for knn. Default |
n_trees |
Trees for rf/gb/missforest. Default |
max_depth |
Tree depth. Default |
shrinkage |
Learning rate for gradient_boost. Default |
n_iter |
Iterations for iterative methods. Default |
lambda |
Regularisation for ridge/lasso/bayesian_ridge. Default |
rank |
SVD rank for pca_impute/softimpute. Default |
seed |
Random seed. Default |
verbose |
Print progress. Default |
Value
An object of class lmi_result.
See Also
Examples
df <- data.frame(
age = c(25, NA, 35, 40, NA),
income = c(50000, 60000, NA, 80000, 55000),
edu = c("BSc", NA, "MSc", "BSc", "PhD"),
stringsAsFactors = FALSE
)
lmi_impute_offline(df)
lmi_impute_offline(df, method = "random_forest")
lmi_impute_offline(df, method = "softimpute")
List all available offline imputation methods
Description
Prints all 20 methods grouped by category with usage guidance.
Usage
lmi_methods()
Value
Invisibly returns a data.frame of method metadata.
Examples
lmi_methods()
List recommended models for each supported LLM provider
Description
Prints recommended model identifiers for every provider supported by
lmi_impute, including free-tier options. Use
lmi_set_model to activate a model after choosing a provider
with lmi_set_api_key.
Usage
lmi_models()
Value
Invisibly returns a character vector of all model identifiers across every provider.
See Also
lmi_set_model, lmi_providers,
lmi_impute
Examples
lmi_models()
List all supported LLM providers
Description
Prints every LLM provider supported by lmi_impute, grouped
into cloud APIs, local servers, and custom endpoints, with free-tier
indicators and required environment variables.
Usage
lmi_providers()
Value
Invisibly returns a character vector of provider names.
See Also
Examples
lmi_providers()
Configure the API key and LLM provider for llmimpute
Description
Sets the API key and LLM provider used by LLM-mode imputation. Supports
cloud APIs (Anthropic 'Claude', OpenAI, Google Gemini, Groq, Mistral,
Cohere, OpenRouter, Together AI, Fireworks AI, DeepSeek, Perplexity,
xAI Grok, AI21 Labs, Cerebras) and local servers (Ollama, LM Studio,
Jan, llama.cpp, KoboldCpp, Text Generation WebUI). Any other
OpenAI-compatible endpoint can be used via provider = "custom"
with a base_url.
If no API key is configured, lmi_impute automatically falls
back to the offline statistical engine. The package is fully functional
without any API key.
Usage
lmi_set_api_key(
api_key = NULL,
provider = "anthropic",
base_url = NULL,
.session = TRUE
)
Arguments
api_key |
Character string. Your API key. Not required for local
providers (Ollama, LM Studio, Jan, llama.cpp, KoboldCpp,
Text Generation WebUI). If |
provider |
Character string. Provider name. One of:
|
base_url |
Character string. Required only for |
.session |
Logical. Store for the current session only (default
|
Value
Invisibly returns the API key string (or "" for keyless
local providers).
See Also
lmi_providers, lmi_set_model,
lmi_impute
Examples
## Not run:
## Cloud APIs
lmi_set_api_key("sk-ant-...", provider = "anthropic")
lmi_set_api_key("sk-...", provider = "openai")
lmi_set_api_key("AIza...", provider = "gemini") # free tier
lmi_set_api_key("gsk_...", provider = "groq") # free tier
lmi_set_api_key("sk-or-...", provider = "openrouter") # free models
lmi_set_api_key("...", provider = "deepseek")
lmi_set_api_key("...", provider = "cerebras") # free tier
## Local servers (no key needed)
lmi_set_api_key(provider = "ollama")
lmi_set_api_key(provider = "lmstudio")
lmi_set_api_key(provider = "jan")
lmi_set_api_key(provider = "llamacpp")
lmi_set_api_key(provider = "koboldcpp")
lmi_set_api_key(provider = "textgenwebui")
## Any OpenAI-compatible endpoint
lmi_set_api_key("mykey", provider = "custom",
base_url = "http://my-server:8000/v1/chat/completions")
## Override local port
lmi_set_api_key(provider = "ollama",
base_url = "http://localhost:11435/api/chat")
## End(Not run)
Set the LLM model used for imputation
Description
Sets or retrieves the model identifier used by lmi_impute
in LLM mode. The default model is determined by the active provider.
Use lmi_models to see recommended models per provider.
Usage
lmi_set_model(model = NULL)
lmi_get_model()
Arguments
model |
Character string. A valid model identifier for the active
provider. If |
Value
Invisibly returns the active model string.
See Also
Examples
lmi_get_model()
## Not run:
lmi_set_model("gpt-4o-mini")
lmi_set_model("gemini-1.5-flash")
lmi_set_model("llama-3.3-70b-versatile")
lmi_set_model("deepseek-chat")
lmi_set_model("llama3.2") # Ollama local
## End(Not run)
Print an lmi_result object
Description
Displays a formatted summary of an imputation result in the console,
including overall statistics, per-column imputation counts, and the first
n imputed values with their confidence scores and reasoning.
Usage
## S3 method for class 'lmi_result'
print(x, n = 10L, ...)
Arguments
x |
An object of class |
n |
Integer. Number of individual imputation rows to display.
Default |
... |
Currently unused. Included for S3 compatibility. |
Value
Invisibly returns x.
Examples
df <- data.frame(
age = c(25L, NA, 35L),
income = c(50000, 60000, NA),
stringsAsFactors = FALSE
)
result <- lmi_impute_offline(df, verbose = FALSE)
print(result)
Summarise an lmi_result object
Description
Returns a data.frame summarising imputation counts and confidence
statistics per column, suitable for further analysis or reporting.
Usage
## S3 method for class 'lmi_result'
summary(object, ...)
Arguments
object |
An object of class |
... |
Currently unused. Included for S3 compatibility. |
Value
A data.frame with columns column, n_imputed,
mean_confidence, min_confidence, max_confidence.
Returns NULL invisibly when no imputations were performed.
Examples
df <- data.frame(
age = c(25L, NA, 35L, 40L),
income = c(50000, 60000, NA, 80000),
stringsAsFactors = FALSE
)
result <- lmi_impute_offline(df, verbose = FALSE)
summary(result)