This document introduces TileDB via several simple examples. A corresponding document with more complete API documentation is also available.
Once the TileDB R package is installed, it can be loaded via
library(tiledb)
. Installation is supported for Windows,
Linux and macOS via the official CRAN
package, on Linux and macOS via the conda package as well as
from source.
Documentation for the TileDB R package is available via the
help()
function from within R as well as via the package documentation
and an introductory
notebook. Documentation about TileDB itself is also available.
Several “quickstart” examples that are discussed on the website are available in the examples directory. This vignette discusses similar examples.
In the following examples, the URIs describing arrays point to local
file system object. When TileDB has been built with S3 support, and with
proper AWS credentials in the usual environment variables, URIs such as
s3://some/data/bucket
can be used where a local file would
be used. See the script ex_S3.R
for an example.
These illustrations use the array created by the file ex_1.R which one can run from within R, or on the command-line. To follow along with discussion that follows, it helps to run the example once to create the array after possibly adjusting the array location path from its default value (using the current directory or, if set as an option, an override).
The file ex_1.R
in the examples directory is a simple yet complete example extending quickstart_dense.R
by adding a second and third attribute. In this as well as the following
examples we will use tiledb_array()
to access the array;
the older variants tiledb_dense()
and
tiledb_sparse()
remain supported but are deprecated and may
be removed at some point in the future.
Read 1-D
The first example extracts rows 1 to 2 and column 2 from an array. It
also limits the selection to just one attribute (via
attrs
), asks for the return to be a data.frame
(instead of a simpler list) and for the (row and column, if present as
here) indices to not be printed (via extended=FALSE
).
> A <- tiledb_array(uri = uri, attrs = "b",
+ return_as = "data.frame", extended=FALSE))
> A[1:2,2]
1] 101.5 104.0
[>
Note that the examples create three two-dimensional attributes. The
attributes can be selected via the attrs
argument, or the
attrs()
method on the array object. The square-bracket
indexing then selects with in the 2-d attribute object.
If multiple objects are returned (as list
or
data.frame
), subsetting on the returned object works via
[[var]]
or $var
. A numeric index also works
(but needs to account for rows
and cols
).
> A <- tiledb_array(uri = uri, attrs = c("a","b"),
+ return_as = "data.frame")
> A[1:2,2][["a"]]
1] 2 7
[> A[1:2,2]$a
1] 2 7
[>
Read 2-D
This works analogously. Note that the results are generally returned
as vectors, or as a columns of a data.frame
object in case
that option was set.
> A[6:9,3:4]
$a
1] 28 29 33 34 38 39 43 44
[
$b
1] 114.5 115.0 117.0 117.5 119.5 120.0 122.0 122.5
[
$c
1] "fox" "A" "E" "F" "J" "K" "O" "P"
[>
We can restrict the selection to a subset of attributes when opening the array.
> A <- tiledb_dense(uri = uri, attrs = c("b","c"),
+ return_as = "data.frame", extended=FALSE)
> A[6:9,2:4]
b c1 114.0 brown
2 114.5 fox
3 115.0 A
4 116.5 D
5 117.0 E
6 117.5 F
7 119.0 I
8 119.5 J
9 120.0 K
10 121.5 N
11 122.0 O
12 122.5 P
>
This also illustrated the effect of setting
return_as = "data.frame"
when opening the array.
This scheme can be generalized to variable cells, or cells where N>1, as we can expand each (atomistic) value over corresponding row and column indices.
The column types correspond to the attribute typed in the array schema, subject to the constraint mentioned above on R types. (The char comes in as a factor variable as is still the R 3.6.* default which is about to change. We can also override, users can too.)
> A <- tiledb_array("/tmp/tiledb/ex_1/", attrs=c("b","c"),
+ return_as = "data.frame", extended=TRUE)
> sapply(A[6:9, 3:4], "class")
rows cols b c"integer" "integer" "numeric" "character"
>
Consistent with the data.frame
semantics, now
requesting a named column reduces to a vector as this happens
at the R side:
> A[6:9, 3:4]$b
[1] 114.5 115.0 117.0 117.5 119.5 120.0 122.0 122.5
>
Simple Examples
Basic reading returns the coordinates and any attributes. The following examples use the array created by the quickstart_sparse example.
> A <- tiledb_array(uri = uri, is.sparse = TRUE)
> A[]
$rows
1] 1 2 2
[
$cols
1] 1 3 4
[
$a
1] 1 3 2
[
>
We can also request a data.frame
object, either when
opening or by changing this object characteristic on the fly:
> return.data.frame(A) <- TRUE
> A[]
a rows cols1 1 1 1
2 3 2 3
3 2 2 4
For sparse arrays, the return type is by default ‘extended’ showing rows and column but this can be overridden.
Assignment works similarly:
> A[4,2] <- 42L
> A[]
> A[]
rows cols a1 1 1 1
2 2 3 3
3 2 4 2
4 4 2 42
>
Reads can select rows and or columns:
> A[2,]
rows cols a1 2 3 3
2 2 4 2
> A[,2]
rows cols a1 4 2 42
>
Attributes can be selected similarly.
Similar to the dense array case described earlier, the file ex_2.R illustrates some basic operations on sparse arrays. It also shows date and datetime types instead of just integer and double precision floats.
> A <- tiledb_array(uri = uri, return_as = "data.frame")
> A[1577858403:1577858408]
rows cols a b d e1 1577858403 1 3 103 2020-01-11 2020-01-02 18:24:33.844
2 1577858404 1 4 104 2020-01-15 2020-01-05 02:28:36.214
3 1577858405 1 5 105 2020-01-19 2020-01-05 00:44:04.805
4 1577858406 1 6 106 2020-01-21 2020-01-06 12:58:51.770
5 1577858407 1 7 107 2020-01-25 2020-01-09 04:29:56.309
6 1577858408 1 8 108 2020-01-26 2020-01-07 13:55:10.240
>
The row coordinate is currently a floating point representation of the underlying time type. We can both select attributes (here we excluded the “a” column) and select rows by time (as the time stamps get converted to the required floating point value).
> attrs(A) <- c("b", "d", "e")
> A[as.POSIXct("2020-01-01"):as.POSIXct("2020-01-01 00:00:03")]
rows cols b d e1 1577858401 1 101 2020-01-05 2020-01-01 03:03:07.548
2 1577858402 1 102 2020-01-10 2020-01-02 21:02:19.747
3 1577858403 1 103 2020-01-11 2020-01-02 18:24:33.844
>
More extended examples are available showing indexing by date(time) as well as character dimension.
The TileDB R package is documented via R help functions
(e.g. help("tiledb_array")
shows information for
the tiledb_array()
function) as well as via a website regrouping all
documentation. Extended API
documentation is available, as is a examples/
directory.
TileDB itself has extensive installation, and overall documentation as well as a support forum.