README

dplyr-style queries on larger-than-RAM data, backed by a pull-based columnar engine written in C11.

Point vectra at a file too big to load and query it with the verbs you already use. Data flows through the engine one row group at a time, so peak memory stays bounded no matter how large the file gets. Arrow needs compiled binaries that match your platform, DuckDB links a bundled library, Spark wants a JVM. vectra is a standard R extension with no external dependencies: it compiles where R compiles.

library(vectra)

# Lazy scan over a multi-GB CSV; nothing runs until collect()
tbl_csv("measurements.csv") |>
  filter(temperature > 30, year >= 2020) |>
  group_by(station) |>
  summarise(avg_temp = mean(temperature), n = n()) |>
  collect()

One engine, several file formats

.vtr (vectra’s own columnar format), CSV, SQLite, and GeoTIFF all open into the same lazy query nodes, so the same pipeline runs against any of them:

# GeoTIFF climate raster as tidy data
tbl_tiff("worldclim_bio1.tif") |>
  filter(band1 > 0) |>
  mutate(temp_c = band1 / 10) |>
  collect()

# SQLite without DBI
tbl_sqlite("survey.db", "responses") |>
  filter(year == 2025) |>
  left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
  collect()

write_vtr(big_df, "data.vtr", batch_size = 100000)

tbl("data.vtr") |>
  filter(x > 0, region == "EU") |>
  group_by(region) |>
  summarise(total = sum(value), n = n()) |>
  collect()

How the engine stays bounded

The optimizer rewrites the plan before any data moves, and execution streams the result rather than materializing it:

explain() shows the optimized plan, including which columns and row groups were pruned:

tbl("data.vtr") |>
  filter(x > 0) |>
  select(id, x) |>
  explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#>   FilterNode [streaming]
#>     ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]

Fuzzy matching in the engine

tbl("taxa.vtr") |>
  filter(levenshtein(species, "Quercus robur") <= 2) |>
  mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
  arrange(desc(similarity)) |>
  collect()

levenshtein(), dl_dist() (Damerau-Levenshtein), and jaro_winkler() are available in filter() and mutate(), alongside nchar(), substr(), grepl(), gsub() and the rest of the string toolkit.

Spatial data, out of core

Same idea, applied to geometry: read and write layers larger than memory, running the GIS operations you would otherwise do in sf. Geometry rides through as WKB in an ordinary column, so the spatial verbs stream one batch at a time like the rest. Tag a billion-point stream with the polygon each point falls in, holding the polygons plus one batch in memory:

library(sf)

tbl("gps_pings.vtr") |>
  spatial_join(zones_sf, join = st_intersects) |>
  count(zone_id) |>
  collect()

st_* geometry functions evaluate inside filter() and mutate(), computed in C on the GEOS library straight off the geometry column:

tbl("parcels.vtr") |>
  filter(st_area(geometry) > 1e6) |>
  mutate(centroid = st_centroid(geometry)) |>
  collect_sf()

Raster operations stream strip by strip over the tiled .vec format, so a grid larger than memory passes through zonal(), focal(), terrain(), warp(), and map algebra one tile-row at a time:

r <- vec_open_raster("dem.vec")
terrain(r, c("slope", "aspect", "hillshade"))     # Horn's method over haloed strips
zonal(r, watersheds_sf, fun = c("mean", "sd"))    # per-zone, one strip resident

The toolbox reaches further: vector-raster bridges (rasterize(), polygonize(), contours()), planar overlay and coverage cleaning (spatial_overlay(), spatial_dissolve(), spatial_eliminate()), and routing on a line network (spatial_network(), spatial_route(), spatial_service_area()). Geometry stays with GEOS through libgeos; sf is needed only for vector format I/O and reprojection.

Star schemas instead of flat-table column creep

Register dimension tables once, then pull columns from any of them; joins are built for you, and unmatched keys are reported:

s <- vtr_schema(
  fact    = tbl("observations.vtr"),
  species = link("sp_id", tbl("species.vtr")),
  site    = link("site_id", tbl("sites.vtr"))
)

lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)

Incremental updates

Append new rows as a new row group without rewriting the file, or take a key-based diff between two snapshots:

append_vtr(new_rows_df, "data.vtr")

d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added)   # rows present in new but not old
d$deleted          # key values present in old but not new

Streaming results into a model fit

collect() brings the whole result into memory. collect_chunked() folds a function over a query one batch at a time, holding a single batch plus the accumulator, so a result larger than memory reduces to a small summary in one pass: a running count, per-group sufficient statistics, or the cross-products behind a linear fit.

# Accumulate X'X and X'y for an exact OLS fit, one streaming batch at a time
acc <- tbl("survey.vtr") |>
  select(mpg, wt, hp) |>
  collect_chunked(
    function(acc, chunk) {
      X <- cbind(1, chunk$wt, chunk$hp)
      list(XtX = acc$XtX + crossprod(X),
           Xty = acc$Xty + crossprod(X, chunk$mpg))
    },
    .init = list(XtX = matrix(0, 3, 3), Xty = matrix(0, 3, 1))
  )
solve(acc$XtX, acc$Xty)

For models that re-read the data on every iteration, chunk_feeder() exposes a query as a resettable generator that biglm::bigglm() drives directly, fitting a GLM on data too large to hold in memory:

src <- function() tbl("occurrences.vtr") |> select(presence, bio1, bio12)
biglm::bigglm(presence ~ bio1 + bio12, data = chunk_feeder(src),
              family = binomial())

offload() spills a prepared query to disk once and streams it back, so each reweighted pass reads the prepared columns from a file rather than rebuilding the pipeline. With a by key it splits the query into per-key shards on disk, each small enough to pull into memory on its own. group_map() then runs your function on each shard, passing the shard as an ordinary data.frame along with its group key, and returns the results keyed by shard. That function is arbitrary R: a model fit, a clustering call, a custom summary, or any computation that needs the whole group in memory at once. group_modify() recombines per-shard data.frames into one table.

# Prepare once, then let bigglm re-read the spill on every pass
s <- offload(tbl("occurrences.vtr") |> select(presence, bio1, bio12))
biglm::bigglm(presence ~ bio1 + bio12, data = chunk_feeder(s),
              family = binomial())

# Per-region: group_map runs any function on each in-memory shard (here a GLM)
p <- offload(tbl("occurrences.vtr"), by = "region")
fits <- group_map(p, function(d, region)
  glm(presence ~ bio1 + bio12, data = d, family = binomial()))

Verbs

vectra covers the dplyr surface most analysis pipelines use: filter(), select(), mutate(), group_by() / summarise() with the common aggregations, all the joins (left_join() through cross_join(), plus fuzzy_join()), arrange() and the slice_*() family, and window functions (row_number(), rank(), lag(), lead(), cumsum(), ntile(), and more). Date/time and string functions evaluate in the engine. select(), rename(), relocate(), and across() accept the full tidyselect helper set (starts_with(), where(), all_of(), and the rest).

Installation

install.packages("vectra")            # CRAN

install.packages("pak")               # development version
pak::pak("gcol33/vectra")

Documentation

Support

I’m a PhD student who builds R packages in my free time because I believe good tools should be free and open. I started these projects for my own work and figured others might find them useful too.

If this package saved you some time, buying me a coffee is a nice way to say thanks. It helps with my coffee addiction.

License

Citation

@software{vectra,
  author = {Colling, Gilles},
  title  = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
  year   = {2026},
  url    = {https://github.com/gcol33/vectra}
}