duckspatial is an R package that simplifies the
process of reading and writing vector spatial data (e.g.,
sf
objects) in a DuckDB
database. This package is designed for users working with geospatial
data who want to leverage DuckDB’s fast analytical capabilities while
maintaining compatibility with R’s spatial data ecosystem.
You can install the development version of duckspatial from GitHub with:
# install.packages("pak")
::pak("Cidree/duckspatial") pak
This is a basic example which shows how to set up DuckDB for spatial data manipulation, and how to write/read vector data.
library(duckdb)
#> Cargando paquete requerido: DBI
library(duckspatial)
library(sf)
#> Linking to GEOS 3.13.1, GDAL 3.10.2, PROJ 9.5.1; sf_use_s2() is TRUE
First, we create a connection with a DuckDB database (in this case in memory database), and we make sure that the spatial extension is installed, and we load it:
## create connection
<- dbConnect(duckdb())
conn
## install and load spatial extension
ddbs_install(conn)
#> ℹ spatial extension version <2905968> is already installed in this database
ddbs_load(conn)
#> ✔ Spatial extension loaded
Now we can get some data to insert into the database. We are creating 10,000,000 random points.
## random word generator
<- function(length = 5) {
random_word paste0(sample(letters, length, replace = TRUE), collapse = "")
}
## create n points
<- 10000000
n <- data.frame(
random_points id = 1:n,
x = runif(n, min = -180, max = 180),
y = runif(n, min = -90, max = 90),
a = sample(1:1000000, size = n, replace = TRUE),
b = sample(replicate(10, random_word(7)), size = n, replace = TRUE),
c = sample(replicate(10, random_word(9)), size = n, replace = TRUE)
)
## convert to sf
<- st_as_sf(random_points, coords = c("x", "y"), crs = 4326)
sf_points
## view first rows
head(sf_points)
#> Simple feature collection with 6 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -117.7598 ymin: -34.15453 xmax: 113.8518 ymax: 89.68161
#> Geodetic CRS: WGS 84
#> id a b c geometry
#> 1 1 709998 bvwprwa izlhlvspq POINT (-100.8183 -34.15453)
#> 2 2 650017 jfgrvgp ikchdbklp POINT (68.39046 25.59802)
#> 3 3 957513 vwmhulb tjevpihjs POINT (-64.22538 42.72978)
#> 4 4 593853 elthvjo tqucqfpuu POINT (-117.7598 16.73306)
#> 5 5 188177 elthvjo ddzbekmdx POINT (113.8518 89.68161)
#> 6 6 245843 yksarig sjksxdtdg POINT (28.08287 -19.54068)
Now we can insert the data into the database using the
ddbs_write_vector()
function. We use the
proc.time()
function to calculate how long does it take,
and we can compare it with writing a shapefile with the
write_sf()
function:
## write data monitoring processing time
<- proc.time()
start_time ddbs_write_vector(conn, sf_points, "test_points")
#> ✔ Table test_points successfully imported
<- proc.time()
end_time
## print elapsed time
<- end_time["elapsed"] - start_time["elapsed"]
elapsed_duckdb print(elapsed_duckdb)
#> elapsed
#> 18.64
## write data monitoring processing time
<- proc.time()
start_time <- tempfile(fileext = ".gpkg")
gpkg_file write_sf(sf_points, gpkg_file)
<- proc.time()
end_time
## print elapsed time
<- end_time["elapsed"] - start_time["elapsed"]
elapsed_gpkg print(elapsed_gpkg)
#> elapsed
#> 244.23
In this case, we can see that DuckDB was 13.1 times faster. Now we will do the same exercise but reading the data back into R:
## write data monitoring processing time
<- proc.time()
start_time <- ddbs_read_vector(conn, "test_points")
sf_points_ddbs #> ✔ Table test_points successfully imported.
<- proc.time()
end_time
## print elapsed time
<- end_time["elapsed"] - start_time["elapsed"]
elapsed_duckdb print(elapsed_duckdb)
#> elapsed
#> 61.91
## write data monitoring processing time
<- proc.time()
start_time <- read_sf(gpkg_file)
sf_points_ddbs <- proc.time()
end_time
## print elapsed time
<- end_time["elapsed"] - start_time["elapsed"]
elapsed_gpkg print(elapsed_gpkg)
#> elapsed
#> 58.58
For reading, we got similar results. Finally, don’t forget to disconnect from the database:
dbDisconnect(conn)