Chapter 9 Lazy Evaluation and Lazy Queries
This chapter:
- Reviews lazy loading, lazy evaluation and lazy query execution
- Demonstrates how
dplyr
code gets executed (and how R determines what is translated to SQL and what is processed locally by R)- Offers some further resources on lazy loading, evaluation, execution, etc.
9.1 Setup
The following packages are used in this chapter:
library(tidyverse)
library(DBI)
library(RPostgres)
library(dbplyr)
require(knitr)
library(bookdown)
library(sqlpetr)
library(connections)
sleep_default <- 3
Start your adventureworks
container:
Connect to the database:
# con <- connection_open( # use in an interactive session
con <- dbConnect( # use in other settings
RPostgres::Postgres(),
# without the previous and next lines, some functions fail with bigint data
# so change int64 to integer
bigint = "integer",
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "adventureworks",
host = "localhost",
port = 5432)
9.2 R is lazy and comes with guardrails
By design, R is both a language and an interactive development environment (IDE). As a language, R tries to be as efficient as possible. As an IDE, R creates some guardrails to make it easy and safe to work with your data. For example getOption("max.print")
prevents R from printing more rows of data than you want to handle in an interactive session, with a default of 99999 lines, which may or may not suit you.
On the other hand SQL is a “Structured Query Language (SQL): a standard computer language for relational database management and data manipulation.”.1 SQL has various database-specific Interactive Development Environments (IDEs), such as pgAdmin for PostgreSQL. Roger Peng explains in R Programming for Data Science that:
R has maintained the original S philosophy, which is that it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools.
This is complicated when R interacts with SQL. In a vignette for dbplyr Hadley Wickham explains:
The most important difference between ordinary data frames and remote database queries is that your R code is translated into SQL and executed in the database on the remote server, not in R on your local machine. When working with databases, dplyr tries to be as lazy as possible:
It never pulls data into R unless you explicitly ask for it.
It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.
Exactly when, which, and how much data is returned from the dbms is the topic of this chapter. Exactly how the data is represented in the dbms and then translated to a data frame is discussed in the DBI specification.
Eventually, if you are interacting with a dbms from R you will need to understand the differences between lazy loading, lazy evaluation, and lazy queries.
9.2.1 Lazy loading
“Lazy loading is always used for code in packages but is optional (selected by the package maintainer) for datasets in packages.”2 Lazy loading means that the code for a particular function doesn’t actually get loaded into memory until the last minute – when it’s actually being used.
9.2.2 Lazy evaluation
Essentially “Lazy evaluation is a programming strategy that allows a symbol to be evaluated only when needed.”3 That means that lazy evaluation is about symbols such as function arguments4 when they are evaluated. Tidy evaluation complicates lazy evaluation.5
9.2.3 Lazy Queries
“When you create a "lazy" query, you’re creating a pointer to a set of conditions on the database, but the query isn’t actually run and the data isn’t actually loaded until you call "next" or some similar method to actually fetch the data and load it into an object.”6
9.3 Lazy evaluation and lazy queries
When does a lazy query trigger data retrieval? It depends on a lot of factors, as we explore below:
9.3.1 Create a black box query for experimentation
Define the three tables discussed in the previous chapter to build a black box query:
sales_person_table <- tbl(con, in_schema("sales", "salesperson")) %>%
select(-rowguid) %>%
rename(sale_info_updated = modifieddate)
employee_table <- tbl(con, in_schema("humanresources", "employee")) %>%
select(-modifieddate, -rowguid)
person_table <- tbl(con, in_schema("person", "person")) %>%
select(-modifieddate, -rowguid)
Here is a typical string of dplyr
verbs strung together with the magrittr %>%
pipe command that will be used to tease out the several different behaviors that a lazy query has when passed to different R functions. This query joins three connection objects into a query we’ll call Q
:
Q <- sales_person_table %>%
dplyr::left_join(employee_table, by = c("businessentityid" = "businessentityid")) %>%
dplyr::left_join(person_table , by = c("businessentityid" = "businessentityid")) %>%
dplyr::select(firstname, lastname, salesytd, birthdate)
The str
function gives us a hint at how R is collecting information that can be used to construct and execute a query later on:
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'PqConnection' [package "RPostgres"] with 3 slots
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_PqConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 4
## ..$ name: chr "select"
## ..$ x :List of 4
## .. ..- attr(*, "class")= chr [1:3] "op_join" "op_double" "op"
## ..$ dots: list()
## ..$ args:List of 1
## ..- attr(*, "class")= chr [1:3] "op_select" "op_single" "op"
## - attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
9.3.2 Experiment overview
Think of Q
as a black box for the moment. The following examples will show how Q
is interpreted differently by different functions. It’s important to remember in the following discussion that the “and then” operator (%>%
) actually wraps the subsequent code inside the preceding code so that Q %>% print()
is equivalent to print(Q)
.
Notation
A single green check indicates that some rows are returned.
Two green checks indicate that all the rows are returned.
The red X indicates that no rows are returned.
R code Result Q %>% print()
Prints x rows; same as just entering Q
Q %>% dplyr::as_tibble()
Forces Q
to be a tibbleQ %>% head()
Prints the first 6 rows Q %>% tail()
Error: tail() is not supported by sql sources Q %>% length()
Counts the rows in Q
Q %>% str()
Shows the top 3 levels of the object Q
Q %>% nrow()
Attempts to determine the number of rows Q %>% dplyr::tally()
Counts all the rows – on the dbms side Q %>% dplyr::collect(n = 20)
Prints 20 rows Q %>% dplyr::collect(n = 20) %>% head()
Prints 6 rows Q %>% ggplot
Plots a barchart Q %>% dplyr::show_query()
Translates the lazy query object into SQL
The next chapter will discuss how to build queries and how to explore intermediate steps. But first, the following subsections provide a more detailed discussion of each row in the preceding table.
9.3.3 Q %>% print()
Remember that Q %>% print()
is equivalent to print(Q)
and the same as just entering Q
on the command line. We use the magrittr pipe operator here, because chaining functions highlights how the same object behaves differently in each use.
## # Source: lazy query [?? x 4]
## # Database: postgres [postgres@localhost:5432/adventureworks]
## firstname lastname salesytd birthdate
## <chr> <chr> <dbl> <date>
## 1 Stephen Jiang 559698. 1951-10-17
## 2 Michael Blythe 3763178. 1968-12-25
## 3 Linda Mitchell 4251369. 1980-02-27
## 4 Jillian Carson 3189418. 1962-08-29
## 5 Garrett Vargas 1453719. 1975-02-04
## 6 Tsvi Reiter 2315186. 1974-01-18
## 7 Pamela Ansman-Wolfe 1352577. 1974-12-06
## 8 Shu Ito 2458536. 1968-03-09
## 9 José Saraiva 2604541. 1963-12-11
## 10 David Campbell 1573013. 1974-02-11
## # … with more rows
R retrieves 10 observations and 3 columns. In its role as IDE, R has provided nicely formatted output that is similar to what it prints for a tibble, with descriptive information about the dataset and each column:
9.3.3 Source: lazy query [?? x 4]
9.3.3 Database: postgres
9.3.3 [postgres@localhost:5432/adventureworks]
firstname lastname salesytd birthdate
R has not determined how many rows are left to retrieve as it shows with [?? x 4]
and ... with more rows
in the data summary.
9.3.4 Q %>% dplyr::as_tibble()
In contrast to print()
, the as_tibble()
function causes R to download the whole table, using tibble’s default of displaying only the first 10 rows.
## # A tibble: 17 x 4
## firstname lastname salesytd birthdate
## <chr> <chr> <dbl> <date>
## 1 Stephen Jiang 559698. 1951-10-17
## 2 Michael Blythe 3763178. 1968-12-25
## 3 Linda Mitchell 4251369. 1980-02-27
## 4 Jillian Carson 3189418. 1962-08-29
## 5 Garrett Vargas 1453719. 1975-02-04
## 6 Tsvi Reiter 2315186. 1974-01-18
## 7 Pamela Ansman-Wolfe 1352577. 1974-12-06
## 8 Shu Ito 2458536. 1968-03-09
## 9 José Saraiva 2604541. 1963-12-11
## 10 David Campbell 1573013. 1974-02-11
## 11 Tete Mensa-Annan 1576562. 1978-01-05
## 12 Syed Abbas 172524. 1975-01-11
## 13 Lynn Tsoflias 1421811. 1977-02-14
## 14 Amy Alberts 519906. 1957-09-20
## 15 Rachel Valdez 1827067. 1975-07-09
## 16 Jae Pak 4116871. 1968-03-17
## 17 Ranjit Varkey Chudukatil 3121616. 1975-09-30
9.3.5 Q %>% head()
The head()
function is very similar to print but has a different “max.print
” value.
## # Source: lazy query [?? x 4]
## # Database: postgres [postgres@localhost:5432/adventureworks]
## firstname lastname salesytd birthdate
## <chr> <chr> <dbl> <date>
## 1 Stephen Jiang 559698. 1951-10-17
## 2 Michael Blythe 3763178. 1968-12-25
## 3 Linda Mitchell 4251369. 1980-02-27
## 4 Jillian Carson 3189418. 1962-08-29
## 5 Garrett Vargas 1453719. 1975-02-04
## 6 Tsvi Reiter 2315186. 1974-01-18
9.3.6 Q %>% tail()
Produces an error, because Q
does not hold all of the data, so it is not possible to list the last few items from the table:
## Error : tail() is not supported by sql sources
9.3.7 Q %>% length()
Because the Q
object is relatively complex, using str()
on it prints many lines. You can glimpse what’s going on with length()
:
## [1] 2
9.3.8 Q %>% str()
Looking inside shows some of what’s going on (three levels deep):
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'PqConnection' [package "RPostgres"] with 3 slots
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_PqConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 4
## ..$ name: chr "select"
## ..$ x :List of 4
## .. ..$ name: chr "join"
## .. ..$ x :List of 2
## .. .. ..- attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
## .. ..$ y :List of 2
## .. .. ..- attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
## .. ..$ args:List of 4
## .. ..- attr(*, "class")= chr [1:3] "op_join" "op_double" "op"
## ..$ dots: list()
## ..$ args:List of 1
## .. ..$ vars:List of 4
## ..- attr(*, "class")= chr [1:3] "op_select" "op_single" "op"
## - attr(*, "class")= chr [1:5] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
9.3.9 Q %>% nrow()
Notice the difference between nrow()
and tally()
. The nrow
functions returns NA
and does not execute a query:
## [1] NA
9.3.10 Q %>% dplyr::tally()
The tally
function actually counts all the rows.
## # Source: lazy query [?? x 1]
## # Database: postgres [postgres@localhost:5432/adventureworks]
## n
## <int>
## 1 17
The nrow()
function knows that Q
is a list. On the other hand, the tally()
function tells SQL to go count all the rows. Notice that Q
results in 1,000 rows – the same number of rows as film
.
9.3.11 Q %>% dplyr::collect()
The dplyr::collect function triggers a call to the DBI:dbFetch()
function behind the scenes, which forces R to download a specified number of rows:
## # A tibble: 17 x 4
## firstname lastname salesytd birthdate
## <chr> <chr> <dbl> <date>
## 1 Stephen Jiang 559698. 1951-10-17
## 2 Michael Blythe 3763178. 1968-12-25
## 3 Linda Mitchell 4251369. 1980-02-27
## 4 Jillian Carson 3189418. 1962-08-29
## 5 Garrett Vargas 1453719. 1975-02-04
## 6 Tsvi Reiter 2315186. 1974-01-18
## 7 Pamela Ansman-Wolfe 1352577. 1974-12-06
## 8 Shu Ito 2458536. 1968-03-09
## 9 José Saraiva 2604541. 1963-12-11
## 10 David Campbell 1573013. 1974-02-11
## 11 Tete Mensa-Annan 1576562. 1978-01-05
## 12 Syed Abbas 172524. 1975-01-11
## 13 Lynn Tsoflias 1421811. 1977-02-14
## 14 Amy Alberts 519906. 1957-09-20
## 15 Rachel Valdez 1827067. 1975-07-09
## 16 Jae Pak 4116871. 1968-03-17
## 17 Ranjit Varkey Chudukatil 3121616. 1975-09-30
## # A tibble: 6 x 4
## firstname lastname salesytd birthdate
## <chr> <chr> <dbl> <date>
## 1 Stephen Jiang 559698. 1951-10-17
## 2 Michael Blythe 3763178. 1968-12-25
## 3 Linda Mitchell 4251369. 1980-02-27
## 4 Jillian Carson 3189418. 1962-08-29
## 5 Garrett Vargas 1453719. 1975-02-04
## 6 Tsvi Reiter 2315186. 1974-01-18
The dplyr::collect
function triggers the creation of a tibble and controls the number of rows that the DBMS sends to R. Notice that head
only prints 6 of the 20 rows that R has retrieved.
If you do not provide a value for the n
argument, all of the rows will be retrieved into your R workspace.
9.3.12 Q %>% ggplot
Passing the Q
object to ggplot
executes the query and plots the result.
* Rewrite previous query and this comment with adventureworks in mind.
Comment on the plot…
9.3.13 Q %>% dplyr::show_query()
## <SQL>
## SELECT "firstname", "lastname", "salesytd", "birthdate"
## FROM (SELECT "LHS"."businessentityid" AS "businessentityid", "LHS"."territoryid" AS "territoryid", "LHS"."salesquota" AS "salesquota", "LHS"."bonus" AS "bonus", "LHS"."commissionpct" AS "commissionpct", "LHS"."salesytd" AS "salesytd", "LHS"."saleslastyear" AS "saleslastyear", "LHS"."sale_info_updated" AS "sale_info_updated", "LHS"."nationalidnumber" AS "nationalidnumber", "LHS"."loginid" AS "loginid", "LHS"."jobtitle" AS "jobtitle", "LHS"."birthdate" AS "birthdate", "LHS"."maritalstatus" AS "maritalstatus", "LHS"."gender" AS "gender", "LHS"."hiredate" AS "hiredate", "LHS"."salariedflag" AS "salariedflag", "LHS"."vacationhours" AS "vacationhours", "LHS"."sickleavehours" AS "sickleavehours", "LHS"."currentflag" AS "currentflag", "LHS"."organizationnode" AS "organizationnode", "RHS"."persontype" AS "persontype", "RHS"."namestyle" AS "namestyle", "RHS"."title" AS "title", "RHS"."firstname" AS "firstname", "RHS"."middlename" AS "middlename", "RHS"."lastname" AS "lastname", "RHS"."suffix" AS "suffix", "RHS"."emailpromotion" AS "emailpromotion", "RHS"."additionalcontactinfo" AS "additionalcontactinfo", "RHS"."demographics" AS "demographics"
## FROM (SELECT "LHS"."businessentityid" AS "businessentityid", "LHS"."territoryid" AS "territoryid", "LHS"."salesquota" AS "salesquota", "LHS"."bonus" AS "bonus", "LHS"."commissionpct" AS "commissionpct", "LHS"."salesytd" AS "salesytd", "LHS"."saleslastyear" AS "saleslastyear", "LHS"."sale_info_updated" AS "sale_info_updated", "RHS"."nationalidnumber" AS "nationalidnumber", "RHS"."loginid" AS "loginid", "RHS"."jobtitle" AS "jobtitle", "RHS"."birthdate" AS "birthdate", "RHS"."maritalstatus" AS "maritalstatus", "RHS"."gender" AS "gender", "RHS"."hiredate" AS "hiredate", "RHS"."salariedflag" AS "salariedflag", "RHS"."vacationhours" AS "vacationhours", "RHS"."sickleavehours" AS "sickleavehours", "RHS"."currentflag" AS "currentflag", "RHS"."organizationnode" AS "organizationnode"
## FROM (SELECT "businessentityid", "territoryid", "salesquota", "bonus", "commissionpct", "salesytd", "saleslastyear", "modifieddate" AS "sale_info_updated"
## FROM sales.salesperson) "LHS"
## LEFT JOIN (SELECT "businessentityid", "nationalidnumber", "loginid", "jobtitle", "birthdate", "maritalstatus", "gender", "hiredate", "salariedflag", "vacationhours", "sickleavehours", "currentflag", "organizationnode"
## FROM humanresources.employee) "RHS"
## ON ("LHS"."businessentityid" = "RHS"."businessentityid")
## ) "LHS"
## LEFT JOIN (SELECT "businessentityid", "persontype", "namestyle", "title", "firstname", "middlename", "lastname", "suffix", "emailpromotion", "additionalcontactinfo", "demographics"
## FROM person.person) "RHS"
## ON ("LHS"."businessentityid" = "RHS"."businessentityid")
## ) "dbplyr_009"
Hand-written SQL code to do the same job will probably look a lot nicer and could be more efficient, but functionally dplyr
does the job.
## Disconnect from the database and stop Docker
9.4 Other resources
- Benjamin S. Baumer. 2017. A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data. https://arxiv.org/abs/1708.07073
- dplyr Reference documentation: Remote tables. https://dplyr.tidyverse.org/reference/index.html#section-remote-tables
- Data Carpentry. SQL Databases and R. https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html