This article reviews two possible scenarios for “writing back” predictions to the database, and without importing the data into memory first. Both scenarios share a common model preparation method.
To keep the example reproducible, a SQLite database will be used to simulate a larger scale deployment.
First the data is prepared in memory. The article will use the
nycflights13::flights
data, with a couple of
modifications
library(dplyr)
library(tidypredict)
library(randomForest)
library(dbplyr)
<- nycflights13::flights %>%
flights_table mutate(
current_score = 0,
flight_id = row_number()
)
A new database is created using RSQLite
.
library(DBI)
<- dbConnect(RSQLite::SQLite(), path = ":memory:")
con <- copy_to(con,flights_table ) db_fligths
A sample is downloaded from the database for modeling. This example already selects the needed variables.
<- db_fligths %>%
df select(dep_delay, hour, distance) %>%
head(1000) %>%
collect()
A linear model is fitted using lm()
<- lm(dep_delay ~ ., data = df) model
It is highly recommendable to always run
tidypredict_test()
to make sure that the predictions are in
line with what the predict()
command returns.
The lm
and glm
models contain the data that
they were fitted with, so it’s easy to run a test.
tidypredict_test()
uses the model’s internal data set by
default
tidypredict_test(model)
#> tidypredict test results
#> Difference threshold: 1e-12
#>
#> All results are within the difference threshold
In cases where the model re-fitting is automated, the
tidypredict_test()
function returns an alert
in case the threshold is exceeded, which can be used to fail the
automated script.
if(tidypredict_test(model)$alert) stop("Threshold exceeded!")
In this scenario, The table that supplies the term values is also the
recipient of the new score. This is done by updating a specific field in
the table. For this example, the field is called
current_score
. The following SQL UPDATE statement should
work in most databases:
library(dbplyr)
<- build_sql("UPDATE flights_table SET current_score = ", tidypredict_sql(model, con = con), con = con)
update_statement
update_statement#> <SQL> UPDATE flights_table SET current_score = (-3.5984422918702 + (`hour` * 1.38710560882252)) + (`distance` * -0.00307606912118567)
This statement can be then passed on to the database team, via documentation or an automated process. In cases where the analyst has the responsibility to run the new SQL statement, or if R is being used to automate the scoring, the next line can be used:
dbSendQuery(con, update_statement)
#> <SQLiteResult>
#> SQL UPDATE flights_table SET current_score = (-3.5984422918702 + (`hour` * 1.38710560882252)) + (`distance` * -0.00307606912118567)
#> ROWS Fetched: 0 [complete]
#> Changed: 336776
Here is a sample of the newly populated field:
%>%
db_fligths select(current_score) %>%
head(10)
#> Warning: Closing open result set, pending rows
#> # Source: SQL [10 x 1]
#> # Database: sqlite 3.40.0 []
#> current_score
#> <dbl>
#> 1 -0.969
#> 2 -1.02
#> 3 -0.0128
#> 4 -1.51
#> 5 2.38
#> 6 1.13
#> 7 1.45
#> 8 4.02
#> 9 1.82
#> 10 2.47
There may be the need to retain not only the new score, but when it
was determined and its history. This is usually possible because the
source of record possesses a unique key identifier per entity, such as
transaction ID or customer ID. In the example, flights_id
is the unique identifier.
In this example, the new scores will be stored in a new table called
daily_scores
. The following code is part of the example
preparation, it creates the table and seeds it with a single row. This
is not the best way to create an empty table, but it’ll do for the
purposes of this example.
dbWriteTable(con, "daily_scores",
tibble(
flight_id = 0,
score = 0,
date = ""
))
The plan is to use a SQL statement that most vendors support, is
called INSERT INTO SELECT. The idea is to use dplyr
laziness to prepare the data transformation and predictions, but it’s
not going to be executed until is parsed into SQL and sent as part of a
another statement. The INSERT INTO SELECT statement allows for the
results of a query to be saved in a table, and without leaving the
database.
In this example, predictions are going to be executed for just the
records in the month of December. The data is filtered, and then
tidypredict_to_column()
is used to create the new fit
field. The results are then transformed to match to the structure of the
new daily_scores
table.
<- db_fligths %>%
new_predictions filter(month == 12) %>%
tidypredict_to_column(model, vars = "score") %>%
select(
flight_id,%>%
score) mutate(date = "01/01/2018")
<- build_sql("INSERT INTO daily_scores ", sql_render(new_predictions, con = con), con = con)
insert_scores
insert_scores#> <SQL> INSERT INTO daily_scores SELECT *, '01/01/2018' AS `date`
#> FROM (
#> SELECT
#> `flight_id`,
#> (-3.5984422918702 + (`hour` * 1.38710560882252)) + (`distance` * -0.00307606912118567) AS `score`
#> FROM `flights_table`
#> WHERE (`month` = 12.0)
#> )
As in the first scenario, this statement can be then passed on to the database team, via documentation or an automated process. In cases where the analyst has the responsibility to run the new SQL statement, or if R is being used to automate the scoring, the next line can be used:
dbSendQuery(con, insert_scores)
#> <SQLiteResult>
#> SQL INSERT INTO daily_scores SELECT *, '01/01/2018' AS `date`
#> FROM (
#> SELECT
#> `flight_id`,
#> (-3.5984422918702 + (`hour` * 1.38710560882252)) + (`distance` * -0.00307606912118567) AS `score`
#> FROM `flights_table`
#> WHERE (`month` = 12.0)
#> )
#> ROWS Fetched: 0 [complete]
#> Changed: 28135
A simple table join can be used to confirm that the new update worked. For real life scenarios, a more sophisticated query should be performed in order to only get the latest score. For this example, we simple filter on the same date we inserted
tbl(con, "daily_scores") %>%
inner_join(tbl(con, "flights_table"), by = "flight_id") %>%
filter(date == "01/01/2018") %>%
select(dep_delay, hour, distance, score, date)
#> Warning: Closing open result set, pending rows
#> # Source: SQL [?? x 5]
#> # Database: sqlite 3.40.0 []
#> dep_delay hour distance score date
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 14 23 1617 23.3 01/01/2018
#> 2 18 23 1576 23.5 01/01/2018
#> 3 -7 5 529 1.71 01/01/2018
#> 4 5 5 1400 -0.969 01/01/2018
#> 5 -4 5 1089 -0.0128 01/01/2018
#> 6 -10 5 1576 -1.51 01/01/2018
#> 7 -4 5 569 1.59 01/01/2018
#> 8 1 5 1416 -1.02 01/01/2018
#> 9 -11 6 214 4.07 01/01/2018
#> 10 -10 6 1065 1.45 01/01/2018
#> # … with more rows