querying data that won’t fit in memory
dplyr-style queries on larger-than-RAM data, backed by a pull-based columnar engine written in C11.
Point vectra at a file too big to load and query it with the verbs you already use. Data flows through the engine one row group at a time, so peak memory stays bounded no matter how large the file gets. Arrow needs compiled binaries that match your platform, DuckDB links a bundled library, Spark wants a JVM. vectra is a standard R extension with no external dependencies: it compiles where R compiles.
library(vectra)
# Lazy scan over a multi-GB CSV; nothing runs until collect()
tbl_csv("measurements.csv") |>
filter(temperature > 30, year >= 2020) |>
group_by(station) |>
summarise(avg_temp = mean(temperature), n = n()) |>
collect().vtr (vectra’s own columnar format), CSV, SQLite, and
GeoTIFF all open into the same lazy query nodes, so the same pipeline
runs against any of them:
# GeoTIFF climate raster as tidy data
tbl_tiff("worldclim_bio1.tif") |>
filter(band1 > 0) |>
mutate(temp_c = band1 / 10) |>
collect()
# SQLite without DBI
tbl_sqlite("survey.db", "responses") |>
filter(year == 2025) |>
left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
collect()Convert to .vtr once for repeated queries, then read it
back fast:
write_vtr(big_df, "data.vtr", batch_size = 100000)
tbl("data.vtr") |>
filter(x > 0, region == "EU") |>
group_by(region) |>
summarise(total = sum(value), n = n()) |>
collect()The optimizer rewrites the plan before any data moves, and execution streams the result rather than materializing it:
explain() shows the optimized plan, including which
columns and row groups were pruned:
tbl("data.vtr") |>
filter(x > 0) |>
select(id, x) |>
explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#> FilterNode [streaming]
#> ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]String distance runs inside the C engine, with no round-trip to R per row:
tbl("taxa.vtr") |>
filter(levenshtein(species, "Quercus robur") <= 2) |>
mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
arrange(desc(similarity)) |>
collect()levenshtein(), dl_dist()
(Damerau-Levenshtein), and jaro_winkler() are available in
filter() and mutate(), alongside
nchar(), substr(), grepl(),
gsub() and the rest of the string toolkit.
Register dimension tables once, then pull columns from any of them; joins are built for you, and unmatched keys are reported:
s <- vtr_schema(
fact = tbl("observations.vtr"),
species = link("sp_id", tbl("species.vtr")),
site = link("site_id", tbl("sites.vtr"))
)
lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)Append new rows as a new row group without rewriting the file, or take a key-based diff between two snapshots:
append_vtr(new_rows_df, "data.vtr")
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added) # rows present in new but not old
d$deleted # key values present in old but not newcollect() brings the whole result into memory.
collect_chunked() folds a function over a query one batch
at a time, holding a single batch plus the accumulator, so a result
larger than memory reduces to a small summary in one pass: a running
count, per-group sufficient statistics, or the cross-products behind a
linear fit.
# Accumulate X'X and X'y for an exact OLS fit, one streaming batch at a time
acc <- tbl("survey.vtr") |>
select(mpg, wt, hp) |>
collect_chunked(
function(acc, chunk) {
X <- cbind(1, chunk$wt, chunk$hp)
list(XtX = acc$XtX + crossprod(X),
Xty = acc$Xty + crossprod(X, chunk$mpg))
},
.init = list(XtX = matrix(0, 3, 3), Xty = matrix(0, 3, 1))
)
solve(acc$XtX, acc$Xty)For models that re-read the data on every iteration,
chunk_feeder() exposes a query as a resettable generator
that biglm::bigglm() drives directly, fitting a GLM on data
too large to hold in memory:
src <- function() tbl("occurrences.vtr") |> select(presence, bio1, bio12)
biglm::bigglm(presence ~ bio1 + bio12, data = chunk_feeder(src),
family = binomial())offload() spills a prepared query to disk once and
streams it back, so each reweighted pass reads the prepared columns from
a file rather than rebuilding the pipeline. With a by key
it instead splits the query into per-key shards on disk, turning a model
that couples within a group into a set of independent per-shard fits.
group_map() runs a function on each shard and returns the
results keyed by shard; group_modify() recombines per-shard
data.frames into one table.
# Prepare once, then let bigglm re-read the spill on every pass
s <- offload(tbl("occurrences.vtr") |> select(presence, bio1, bio12))
biglm::bigglm(presence ~ bio1 + bio12, data = chunk_feeder(s),
family = binomial())
# Per-region fits: each shard fits in memory on its own
p <- offload(tbl("occurrences.vtr"), by = "region")
fits <- group_map(p, function(d, region)
glm(presence ~ bio1 + bio12, data = d, family = binomial()))vectra covers the dplyr surface most analysis pipelines use:
filter(), select(), mutate(),
group_by() / summarise() with the common
aggregations, all the joins (left_join() through
cross_join(), plus fuzzy_join()),
arrange() and the slice_*() family, and window
functions (row_number(), rank(),
lag(), lead(), cumsum(),
ntile(), and more). Date/time and string functions evaluate
in the engine. select(), rename(),
relocate(), and across() accept the full
tidyselect helper set (starts_with(), where(),
all_of(), and the rest).
The Function Reference lists every verb and expression with examples.
install.packages("vectra") # CRAN
install.packages("pak") # development version
pak::pak("gcol33/vectra")“Software is like sex: it’s better when it’s free.” — Linus Torvalds
I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.
If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.
MIT (see the LICENSE.md file)
@software{vectra,
author = {Colling, Gilles},
title = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
year = {2026},
url = {https://github.com/gcol33/vectra}
}