library(SCDB)
#>
#> Attaching package: 'SCDB'
#> The following object is masked from 'package:base':
#>
#> nrow
The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of time-versioned data.
In practice, this is done by labeling each record in the data with three additional fields:
This strategy of time versioning is often called “type 2” history (Kimball and Ross 2013).
Note that identical records may be removed and introduced more than once; for example, in a table of names and addresses, a person may change their address (or name) back to a previous value.
The SCDB package provides the function update_snapshot
to handle the insertion and deactivation of records using this strategy.
It further includes several functions to improve the Quality of life for
working with database data.
A simple example of usage is shown below.
For this example, we
use a temporary, on-disk SQLite database. Note that
get_connection
tries to establish connection using
DBI::dbConnect
with as few additional arguments as
possible. Different drivers may require authentication which can be read
from a configuration file.1
Our example data is datasets::mtcars
reduced to only two
columns: row names converted to a column car
, and
hp
if (!exists("conn")) conn <- get_connection()
example_data <- dplyr::tbl(conn, DBI::Id(table = "example_data"))
example_data
#> # Source: table<`example_data`> [?? x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Mazda RX4 110
#> 2 Mazda RX4 Wag 110
#> 3 Datsun 710 93
#> 4 Hornet 4 Drive 110
#> 5 Hornet Sportabout 175
#> 6 Valiant 105
#> 7 Duster 360 245
#> 8 Merc 240D 62
#> 9 Merc 230 95
#> 10 Merc 280 123
#> # ℹ more rows
Imagine on Day 1, in this case January 1st, 2020, our currently
available data is the first three records of the example_data. We then
store this data in a table mtcars
:
data <- head(example_data, 3)
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-01 11:00:00"))
#> NULL
We can then access out data using the get_table
function, and include information on data validity period using
include_slice_info = TRUE
:
get_table(conn, "mtcars")
#> # Source: SQL [3 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
get_table(conn, "mtcars", include_slice_info = TRUE)
#> # Source: SQL [3 x 5]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 11:0… NA
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 11:0… NA
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 11:0… NA
Note that where e.g. dplyr::tbl
requires a more exact
specification of the table identity
(tbl(conn, DBI::Id(table = "mtcars"))
),
get_table
will parse any character to a
DBI::Id
object input using SCDB::id
.
The following day, the current data is now the first five rows of our example data. We then store this data in the database using update_snapshot:
# Let's say that the next day, our data set is now the first 5 of our example data
data <- head(example_data, 5)
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-02 12:00:00"))
#> NULL
We can again use the get_table
function to see the
latest available data, including time-keeping with
include_slice_info = TRUE
:
get_table(conn, "mtcars")
#> # Source: SQL [5 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
#> 4 Hornet 4 Drive 110
#> 5 Hornet Sportabout 175
get_table(conn, "mtcars", include_slice_info = TRUE)
#> # Source: SQL [5 x 5]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … NA
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … NA
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … NA
#> 4 Hornet 4 Drive 110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … NA
#> 5 Hornet Sportabout 175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … NA
Since our data is time-versioned, we can recover the data from the day before
get_table(conn, "mtcars", slice_ts = "2020-01-01 11:00:00")
#> # Source: SQL [3 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 110
#> 3 Mazda RX4 Wag 110
On day 3, we imagine that we have the same 5 records, but one of them is altered
data <- head(example_data, 5) |>
dplyr::mutate(hp = ifelse(car == "Mazda RX4", hp / 2, hp))
update_snapshot(.data = data,
conn = conn,
db_table = "mtcars", # the name of the DB table to store the data in
timestamp = as.POSIXct("2020-01-03 10:00:00"))
#> NULL
We can again access our data using the get_table
function and see that the currently available data (with the changed hp
value for Mazda RX4)
get_table(conn, "mtcars")
#> # Source: SQL [5 x 2]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp
#> <chr> <dbl>
#> 1 Datsun 710 93
#> 2 Mazda RX4 Wag 110
#> 3 Hornet 4 Drive 110
#> 4 Hornet Sportabout 175
#> 5 Mazda RX4 55
Finally, using slice_ts = NULL
, the full history (and
time-keeping information) is returned:
get_table(conn, "mtcars", slice_ts = NULL)
#> # Source: table<`main`.`mtcars`> [6 x 5]
#> # Database: sqlite 3.46.0 [:memory:]
#> car hp checksum from_ts until_ts
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Datsun 710 93 08c864e3854eb5a1460d87b3360d636f 2020-01-01 … <NA>
#> 2 Mazda RX4 110 7cbe488757cc85aab6583dbc4226bf68 2020-01-01 … 2020-01…
#> 3 Mazda RX4 Wag 110 b82618e7f5dd30d5df68540cecc696c8 2020-01-01 … <NA>
#> 4 Hornet 4 Drive 110 3c1b6c43b206dd93ee4f6c3d06e1b416 2020-01-02 … <NA>
#> 5 Hornet Sportabout 175 9355ed7a70e3ff73a4b6ee7f7129aa35 2020-01-02 … <NA>
#> 6 Mazda RX4 55 1232f78f7befb3a765b91176eaacdbb0 2020-01-03 … <NA>
In the context of the SCDB package, this is most
notably RPostgres::Postgres()
, which may read from a
.pgpass
file. See also the PostgreSQL
documentation.↩︎