Chapter 7 Introduction to DBMS queries

This chapter demonstrates how to:

Download all or part of a table from the DBMS, including different kinds of subsets

See how dplyr code is translated into SQL commands and how they can be mixed

Get acquainted with some useful functions and packages for investigating a single table

Begin thinking about how to divide the work between your local R session and the DBMS

7.1 Setup

The following packages are used in this chapter:

library(tidyverse)
library(DBI)
library(RPostgres)
library(dbplyr)
require(knitr)
library(bookdown)
library(sqlpetr)
library(skimr)
library(connections)
sleep_default <- 3

Assume that the Docker container with PostgreSQL and the adventureworks database are ready to go. If not go back to [Chapter 6][#chapter_setup-adventureworks-db]

sqlpetr::sp_docker_start("adventureworks")
Sys.sleep(sleep_default)

Connect to the database:

# con <- connection_open(  # use in an interactive session
con <- dbConnect(          # use in other settings
  RPostgres::Postgres(),
  # without the previous and next lines, some functions fail with bigint data 
  #   so change int64 to integer
  bigint = "integer",  
  host = "localhost",
  user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
  password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
  dbname = "adventureworks",
  port = 5432
)

7.2 Methods for downloading a single table

For the moment, assume you know something about the database and specifically what table you need to retrieve. We return to the topic of investigating the whole database later on.

dbExecute(con, "set search_path to sales, humanresources;")

## [1] 0

7.2.1 Read the entire table

There are a few different methods of getting data from a DBMS, and we’ll explore the different ways of controlling each one of them.

DBI::dbReadTable will download an entire table into an R tibble.

salesorderheader_tibble <- DBI::dbReadTable(con, "salesorderheader")

str(salesorderheader_tibble)

## 'data.frame':    31465 obs. of  25 variables:
##  $ salesorderid          : int  43659 43660 43661 43662 43663 43664 43665 43666 43667 43668 ...
##  $ revisionnumber        : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ orderdate             : POSIXct, format: "2011-05-31" "2011-05-31" ...
##  $ duedate               : POSIXct, format: "2011-06-12" "2011-06-12" ...
##  $ shipdate              : POSIXct, format: "2011-06-07" "2011-06-07" ...
##  $ status                : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ onlineorderflag       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ purchaseordernumber   : chr  "PO522145787" "PO18850127500" "PO18473189620" "PO18444174044" ...
##  $ accountnumber         : chr  "10-4020-000676" "10-4020-000117" "10-4020-000442" "10-4020-000227" ...
##  $ customerid            : int  29825 29672 29734 29994 29565 29898 29580 30052 29974 29614 ...
##  $ salespersonid         : int  279 279 282 282 276 280 283 276 277 282 ...
##  $ territoryid           : int  5 5 6 6 4 1 1 4 3 6 ...
##  $ billtoaddressid       : int  985 921 517 482 1073 876 849 1074 629 529 ...
##  $ shiptoaddressid       : int  985 921 517 482 1073 876 849 1074 629 529 ...
##  $ shipmethodid          : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ creditcardid          : int  16281 5618 1346 10456 4322 806 15232 13349 10370 1566 ...
##  $ creditcardapprovalcode: chr  "105041Vi84182" "115213Vi29411" "85274Vi6854" "125295Vi53935" ...
##  $ currencyrateid        : int  NA NA 4 4 NA NA NA NA NA 4 ...
##  $ subtotal              : num  20566 1294 32726 28833 419 ...
##  $ taxamt                : num  1971.5 124.2 3153.8 2775.2 40.3 ...
##  $ freight               : num  616.1 38.8 985.6 867.2 12.6 ...
##  $ totaldue              : num  23153 1457 36866 32475 472 ...
##  $ comment               : chr  NA NA NA NA ...
##  $ rowguid               : chr  "79b65321-39ca-4115-9cba-8fe0903e12e6" "738dc42d-d03b-48a1-9822-f95a67ea7389" "d91b9131-18a4-4a11-bc3a-90b6f53e9d74" "4a1ecfc0-cc3a-4740-b028-1c50bb48711c" ...
##  $ modifieddate          : POSIXct, format: "2011-06-07" "2011-06-07" ...

That’s very simple, but if the table is very large it may not be a problem, since R is designed to keep the entire table in memory. The tables that are found in an enterprise database such as adventureworks may be large, they are most often records kept by people. That somewhat limits their size (relative to data generated by machines) and expands the possibilities for human error.

Note that the first line of the str() output reports the total number of observations.

Later on we’ll use this tibble to demonstrate several packages and functions, but use only the first 13 columns for simplicity.

salesorderheader_tibble <- salesorderheader_tibble[,1:13]

7.2.2 Create a pointer to a table that can be reused

The dplyr::tbl function gives us more control over access to a table by enabling control over which columns and rows to download. It creates an object that might look like a data frame, but it’s actually a list object that dplyr uses for constructing queries and retrieving data from the DBMS.

salesorderheader_table <- dplyr::tbl(con, "salesorderheader")
class(salesorderheader_table)

## [1] "tbl_PqConnection" "tbl_dbi"          "tbl_sql"          "tbl_lazy"        
## [5] "tbl"

7.2.3 Controlling the number of rows returned with `collect()`

The collect function triggers the creation of a tibble and controls the number of rows that the DBMS sends to R. For more complex queries, the dplyr::collect() function provides a mechanism to indicate what’s processed on on the DBMS server and what’s processed by R on the local machine. The chapter on Lazy Evaluation and Execution Environment discusses this issue in detail.

salesorderheader_table %>% dplyr::collect(n = 3) %>% dim()

## [1]  3 25

salesorderheader_table %>% dplyr::collect(n = 500) %>% dim()

## [1] 500  25

7.2.4 Retrieving random rows from the DBMS

When the DBMS contains many rows, a sample of the data may be plenty for your purposes. Although dplyr has nice functions to sample a data frame that’s already in R (e.g., the sample_n and sample_frac functions), to get a sample from the DBMS we have to use dbGetQuery to send native SQL to the database. To peek ahead, here is one example of a query that retrieves 20 rows from a 1% sample:

one_percent_sample <- DBI::dbGetQuery(
  con,
  "SELECT orderdate, subtotal, taxamt, freight, totaldue
  FROM salesorderheader TABLESAMPLE BERNOULLI(3) LIMIT 20;
  "
)

one_percent_sample

##     orderdate   subtotal    taxamt   freight   totaldue
## 1  2011-06-22   699.0982   55.9279   17.4775   772.5036
## 2  2011-06-25  3578.2700  286.2616   89.4568  3953.9884
## 3  2011-06-29  3374.9900  269.9992   84.3748  3729.3640
## 4  2011-06-30  3578.2700  286.2616   89.4568  3953.9884
## 5  2011-07-01 32492.6040 3118.7048  974.5952 36585.9040
## 6  2011-07-03  3578.2700  286.2616   89.4568  3953.9884
## 7  2011-07-22  3578.2700  286.2616   89.4568  3953.9884
## 8  2011-08-01  2039.9940  195.8394   61.1998  2297.0332
## 9  2011-08-01  1362.3067  130.1463   40.6707  1533.1237
## 10 2011-08-07  3578.2700  286.2616   89.4568  3953.9884
## 11 2011-08-07  3578.2700  286.2616   89.4568  3953.9884
## 12 2011-08-14  3578.2700  286.2616   89.4568  3953.9884
## 13 2011-09-06  3578.2700  286.2616   89.4568  3953.9884
## 14 2011-09-08  3374.9900  269.9992   84.3748  3729.3640
## 15 2011-09-08   699.0982   55.9279   17.4775   772.5036
## 16 2011-09-10  3578.2700  286.2616   89.4568  3953.9884
## 17 2011-09-11  3578.2700  286.2616   89.4568  3953.9884
## 18 2011-09-12  3578.2700  286.2616   89.4568  3953.9884
## 19 2011-09-19  3578.2700  286.2616   89.4568  3953.9884
## 20 2011-10-01 35651.0339 3424.4400 1070.1375 40145.6114

Exact sample of 100 records

This technique depends on knowing the range of a record index, such as the businessentityid in the salesorderheader table of our adventureworks database.

Start by finding the min and max values.

DBI::dbListFields(con, "salesorderheader")

##  [1] "salesorderid"           "revisionnumber"         "orderdate"             
##  [4] "duedate"                "shipdate"               "status"                
##  [7] "onlineorderflag"        "purchaseordernumber"    "accountnumber"         
## [10] "customerid"             "salespersonid"          "territoryid"           
## [13] "billtoaddressid"        "shiptoaddressid"        "shipmethodid"          
## [16] "creditcardid"           "creditcardapprovalcode" "currencyrateid"        
## [19] "subtotal"               "taxamt"                 "freight"               
## [22] "totaldue"               "comment"                "rowguid"               
## [25] "modifieddate"

salesorderheader_df <- DBI::dbReadTable(con, "salesorderheader")

(max_id <- max(salesorderheader_df$salesorderid))

## [1] 75123

(min_id <- min(salesorderheader_df$salesorderid))

## [1] 43659

Set the random number seed and draw the sample.

set.seed(123)
sample_rows <- sample(1:max(salesorderheader_df$salesorderid), 10)
salesorderheader_table <- dplyr::tbl(con, "salesorderheader")

Run query with the filter verb listing the randomly sampled rows to be retrieved:

salesorderheader_sample <- salesorderheader_table %>% 
  dplyr::filter(salesorderid %in% sample_rows) %>% 
  dplyr::collect()

str(salesorderheader_sample)

## Classes 'tbl_df', 'tbl' and 'data.frame':    7 obs. of  25 variables:
##  $ salesorderid          : int  45404 46435 51663 57870 62555 65161 68293
##  $ revisionnumber        : int  8 8 8 8 8 8 8
##  $ orderdate             : POSIXct, format: "2012-01-10" "2012-05-06" ...
##  $ duedate               : POSIXct, format: "2012-01-22" "2012-05-18" ...
##  $ shipdate              : POSIXct, format: "2012-01-17" "2012-05-13" ...
##  $ status                : int  5 5 5 5 5 5 5
##  $ onlineorderflag       : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ purchaseordernumber   : chr  NA NA NA NA ...
##  $ accountnumber         : chr  "10-4030-011217" "10-4030-012251" "10-4030-016327" "10-4030-018572" ...
##  $ customerid            : int  11217 12251 16327 18572 13483 29799 13239
##  $ salespersonid         : int  NA NA NA NA NA 281 NA
##  $ territoryid           : int  1 9 8 4 1 4 6
##  $ billtoaddressid       : int  19321 24859 19265 16902 15267 997 27923
##  $ shiptoaddressid       : int  19321 24859 19265 16902 15267 997 27923
##  $ shipmethodid          : int  1 1 1 1 1 5 1
##  $ creditcardid          : int  8241 13188 16357 1884 4409 12582 1529
##  $ creditcardapprovalcode: chr  "332581Vi42712" "635144Vi68383" "420152Vi84562" "1224478Vi9772" ...
##  $ currencyrateid        : int  NA 4121 NA NA NA NA 11581
##  $ subtotal              : num  3578 3375 2466 14 57 ...
##  $ taxamt                : num  286.26 270 197.31 1.12 4.56 ...
##  $ freight               : num  89.457 84.375 61.658 0.349 1.424 ...
##  $ totaldue              : num  3954 3729.4 2725.3 15.4 63 ...
##  $ comment               : chr  NA NA NA NA ...
##  $ rowguid               : chr  "358f91b2-dadd-4014-8d4f-7f9736cb664e" "eb312409-fcd5-4bac-bd3b-16d4bd7889db" "ddc60552-af98-4166-9249-d09d424d8430" "fe46e631-47b9-4e14-9da5-1e4a4a135364" ...
##  $ modifieddate          : POSIXct, format: "2012-01-17" "2012-05-13" ...

7.2.5 Sub-setting variables

A table in the DBMS may not only have many more rows than you want, but also many more columns. The select command controls which columns are retrieved.

salesorderheader_table %>% dplyr::select(orderdate, subtotal, taxamt, freight, totaldue) %>% 
  head()

## # Source:   lazy query [?? x 5]
## # Database: postgres [postgres@localhost:5432/adventureworks]
##   orderdate           subtotal taxamt freight totaldue
##   <dttm>                 <dbl>  <dbl>   <dbl>    <dbl>
## 1 2011-05-31 00:00:00   20566. 1972.    616.    23153.
## 2 2011-05-31 00:00:00    1294.  124.     38.8    1457.
## 3 2011-05-31 00:00:00   32726. 3154.    986.    36866.
## 4 2011-05-31 00:00:00   28833. 2775.    867.    32475.
## 5 2011-05-31 00:00:00     419.   40.3    12.6     472.
## 6 2011-05-31 00:00:00   24433. 2345.    733.    27510.

That’s exactly equivalent to submitting the following SQL commands directly:

DBI::dbGetQuery(
  con,
  'SELECT "orderdate", "subtotal", "taxamt", "freight", "totaldue"
    FROM "salesorderheader"
    LIMIT 6')

##    orderdate   subtotal    taxamt  freight   totaldue
## 1 2011-05-31 20565.6206 1971.5149 616.0984 23153.2339
## 2 2011-05-31  1294.2529  124.2483  38.8276  1457.3288
## 3 2011-05-31 32726.4786 3153.7696 985.5530 36865.8012
## 4 2011-05-31 28832.5289 2775.1646 867.2389 32474.9324
## 5 2011-05-31   419.4589   40.2681  12.5838   472.3108
## 6 2011-05-31 24432.6088 2344.9921 732.8100 27510.4109

We won’t discuss dplyr methods for sub-setting variables, deriving new ones, or sub-setting rows based on the values found in the table, because they are covered well in other places, including:

Comprehensive reference: https://dplyr.tidyverse.org/
Good tutorial: https://suzan.rbind.io/tags/dplyr/

In practice we find that, renaming variables is often quite important because the names in an SQL database might not meet your needs as an analyst. In “the wild”, you will find names that are ambiguous or overly specified, with spaces in them, and other problems that will make them difficult to use in R. It is good practice to do whatever renaming you are going to do in a predictable place like at the top of your code. The names in the adventureworks database are simple and clear, but if they were not, you might rename them for subsequent use in this way:

tbl(con, "salesorderheader") %>%
  dplyr::rename(order_date = orderdate, sub_total_amount = subtotal,
              tax_amount = taxamt, freight_amount = freight, total_due_amount = totaldue) %>% 
  dplyr::select(order_date, sub_total_amount, tax_amount, freight_amount, total_due_amount ) %>%
  show_query()

## <SQL>
## SELECT "orderdate" AS "order_date", "subtotal" AS "sub_total_amount", "taxamt" AS "tax_amount", "freight" AS "freight_amount", "totaldue" AS "total_due_amount"
## FROM "salesorderheader"

That’s equivalent to the following SQL code:

DBI::dbGetQuery(
  con,
    'SELECT "orderdate" AS "order_date", 
    "subtotal" AS "sub_total_amount", 
    "taxamt" AS "tax_amount", 
    "freight" AS "freight_amount", 
    "totaldue" AS "total_due_amount"
    FROM "salesorderheader"' ) %>% 
  head()

##   order_date sub_total_amount tax_amount freight_amount total_due_amount
## 1 2011-05-31       20565.6206  1971.5149       616.0984       23153.2339
## 2 2011-05-31        1294.2529   124.2483        38.8276        1457.3288
## 3 2011-05-31       32726.4786  3153.7696       985.5530       36865.8012
## 4 2011-05-31       28832.5289  2775.1646       867.2389       32474.9324
## 5 2011-05-31         419.4589    40.2681        12.5838         472.3108
## 6 2011-05-31       24432.6088  2344.9921       732.8100       27510.4109

The one difference is that the SQL code returns a regular data frame and the dplyr code returns a tibble. Notice that the seconds are grayed out in the tibble display.

7.3 Translating `dplyr` code to `SQL` queries

Where did the translations we’ve shown above come from? The show_query function shows how dplyr is translating your query to the dialect of the target DBMS.

The show_query() function shows you what dplyr is sending to the DBMS. It might be handy for inspecting what dplyr is doing or for showing your code to someone who is more SQL- than R-literate. In general we have used the function extensively in writing this book but in the final product we will not use it unless there is something in the SQL or the translation process that needs to be explained.

salesorderheader_table %>%
  dplyr::tally() %>%
  dplyr::show_query()

## <SQL>
## SELECT COUNT(*) AS "n"
## FROM "salesorderheader"

Here is an extensive discussion of how dplyr code is translated into SQL:

https://dbplyr.tidyverse.org/articles/sql-translation.html

If you prefer to use SQL directly, rather than dplyr, you can submit SQL code to the DBMS through the DBI::dbGetQuery function:

DBI::dbGetQuery(
  con,
  'SELECT COUNT(*) AS "n"
     FROM "salesorderheader"   '
)

##       n
## 1 31465

When you create a report to run repeatedly, you might want to put that query into R markdown. That way you can also execute that SQL code in a chunk with the following header:

{sql, connection=con, output.var = "query_results"}

SELECT COUNT(*) AS "n"
     FROM "salesorderheader";

R markdown stores that query result in a tibble which can be printed by referring to it:

query_results

##       n
## 1 31465

7.4 Mixing dplyr and SQL

When dplyr finds code that it does not know how to translate into SQL, it will simply pass it along to the DBMS. Therefore you can interleave native commands that your DBMS will understand in the middle of dplyr code. Consider this example that’s derived from (Ruiz 2019):

salesorderheader_table %>%
  dplyr::select_at(vars(subtotal, contains("date"))) %>% 
  dplyr::mutate(today = now()) %>%
  dplyr::show_query()

## <SQL>
## SELECT "subtotal", "orderdate", "duedate", "shipdate", "modifieddate", CURRENT_TIMESTAMP AS "today"
## FROM "salesorderheader"

That is native to PostgreSQL, not ANSI standard SQL.

Verify that it works:

salesorderheader_table %>%
  dplyr::select_at(vars(subtotal, contains("date"))) %>% 
  head() %>% 
  dplyr::mutate(today = now()) %>%
  dplyr::collect()

## # A tibble: 6 x 6
##   subtotal orderdate           duedate             shipdate           
##      <dbl> <dttm>              <dttm>              <dttm>             
## 1   20566. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 2    1294. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 3   32726. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 4   28833. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 5     419. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 6   24433. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## # … with 2 more variables: modifieddate <dttm>, today <dttm>

7.5 Examining a single table with R

Dealing with a large, complex database highlights the utility of specific tools in R. We include brief examples that we find to be handy:

Base R structure: str
Printing out some of the data: datatable, kable, and View
Summary statistics: summary
glimpse in the tibble package, which is included in the tidyverse
skim in the skimr package

7.5.1 `str` - a base package workhorse

str is a workhorse function that lists variables, their type and a sample of the first few variable values.

str(salesorderheader_tibble)

## 'data.frame':    31465 obs. of  13 variables:
##  $ salesorderid       : int  43659 43660 43661 43662 43663 43664 43665 43666 43667 43668 ...
##  $ revisionnumber     : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ orderdate          : POSIXct, format: "2011-05-31" "2011-05-31" ...
##  $ duedate            : POSIXct, format: "2011-06-12" "2011-06-12" ...
##  $ shipdate           : POSIXct, format: "2011-06-07" "2011-06-07" ...
##  $ status             : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ onlineorderflag    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ purchaseordernumber: chr  "PO522145787" "PO18850127500" "PO18473189620" "PO18444174044" ...
##  $ accountnumber      : chr  "10-4020-000676" "10-4020-000117" "10-4020-000442" "10-4020-000227" ...
##  $ customerid         : int  29825 29672 29734 29994 29565 29898 29580 30052 29974 29614 ...
##  $ salespersonid      : int  279 279 282 282 276 280 283 276 277 282 ...
##  $ territoryid        : int  5 5 6 6 4 1 1 4 3 6 ...
##  $ billtoaddressid    : int  985 921 517 482 1073 876 849 1074 629 529 ...

7.5.2 Always look at your data with `head`, `View`, or `kable`

There is no substitute for looking at your data and R provides several ways to just browse it. The head function controls the number of rows that are displayed. Note that tail does not work against a database object. In every-day practice you would look at more than the default 6 rows, but here we wrap head around the data frame:

sqlpetr::sp_print_df(head(salesorderheader_tibble))

7.5.3 The `summary` function in `base`

The base package’s summary function provides basic statistics that serve a unique diagnostic purpose in this context. For example, the following output shows that:

* `businessentityid` is a number from 1 to 16,049. In a previous section, we ran the `str` function and saw that there are 16,044 observations in this table. Therefore, the `businessentityid` seems to be sequential from 1:16049, but there are 5 values missing from that sequence. _Exercise for the Reader_: Which 5 values from 1:16049 are missing from `businessentityid` values in the `salesorderheader` table? (_Hint_: In the chapter on SQL Joins, you will learn the functions needed to answer this question.)
* The number of NA's in the `return_date` column is a good first guess as to the number of DVDs rented out or lost as of 2005-09-02 02:35:22.

summary(salesorderheader_tibble)

##   salesorderid   revisionnumber    orderdate                  
##  Min.   :43659   Min.   :8.000   Min.   :2011-05-31 00:00:00  
##  1st Qu.:51525   1st Qu.:8.000   1st Qu.:2013-06-20 00:00:00  
##  Median :59391   Median :8.000   Median :2013-11-03 00:00:00  
##  Mean   :59391   Mean   :8.001   Mean   :2013-08-21 12:05:04  
##  3rd Qu.:67257   3rd Qu.:8.000   3rd Qu.:2014-02-28 00:00:00  
##  Max.   :75123   Max.   :9.000   Max.   :2014-06-30 00:00:00  
##                                                               
##     duedate                       shipdate                       status 
##  Min.   :2011-06-12 00:00:00   Min.   :2011-06-07 00:00:00   Min.   :5  
##  1st Qu.:2013-07-02 00:00:00   1st Qu.:2013-06-27 00:00:00   1st Qu.:5  
##  Median :2013-11-15 00:00:00   Median :2013-11-10 00:00:00   Median :5  
##  Mean   :2013-09-02 12:05:41   Mean   :2013-08-28 12:06:06   Mean   :5  
##  3rd Qu.:2014-03-13 00:00:00   3rd Qu.:2014-03-08 00:00:00   3rd Qu.:5  
##  Max.   :2014-07-12 00:00:00   Max.   :2014-07-07 00:00:00   Max.   :5  
##                                                                         
##  onlineorderflag purchaseordernumber accountnumber        customerid   
##  Mode :logical   Length:31465        Length:31465       Min.   :11000  
##  FALSE:3806      Class :character    Class :character   1st Qu.:14432  
##  TRUE :27659     Mode  :character    Mode  :character   Median :19452  
##                                                         Mean   :20170  
##                                                         3rd Qu.:25994  
##                                                         Max.   :30118  
##                                                                        
##  salespersonid    territoryid     billtoaddressid
##  Min.   :274.0   Min.   : 1.000   Min.   :  405  
##  1st Qu.:277.0   1st Qu.: 4.000   1st Qu.:14080  
##  Median :279.0   Median : 6.000   Median :19449  
##  Mean   :280.6   Mean   : 6.091   Mean   :18263  
##  3rd Qu.:284.0   3rd Qu.: 9.000   3rd Qu.:24678  
##  Max.   :290.0   Max.   :10.000   Max.   :29883  
##  NA's   :27659

So the summary function is surprisingly useful as we first start to look at the table contents.

7.5.4 The `glimpse` function in the `tibble` package

The tibble package’s glimpse function is a more compact version of str:

tibble::glimpse(salesorderheader_tibble)

## Observations: 31,465
## Variables: 13
## $ salesorderid        <int> 43659, 43660, 43661, 43662, 43663, 43664, 43665, …
## $ revisionnumber      <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
## $ orderdate           <dttm> 2011-05-31, 2011-05-31, 2011-05-31, 2011-05-31, …
## $ duedate             <dttm> 2011-06-12, 2011-06-12, 2011-06-12, 2011-06-12, …
## $ shipdate            <dttm> 2011-06-07, 2011-06-07, 2011-06-07, 2011-06-07, …
## $ status              <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ onlineorderflag     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ purchaseordernumber <chr> "PO522145787", "PO18850127500", "PO18473189620", …
## $ accountnumber       <chr> "10-4020-000676", "10-4020-000117", "10-4020-0004…
## $ customerid          <int> 29825, 29672, 29734, 29994, 29565, 29898, 29580, …
## $ salespersonid       <int> 279, 279, 282, 282, 276, 280, 283, 276, 277, 282,…
## $ territoryid         <int> 5, 5, 6, 6, 4, 1, 1, 4, 3, 6, 1, 3, 1, 6, 2, 6, 3…
## $ billtoaddressid     <int> 985, 921, 517, 482, 1073, 876, 849, 1074, 629, 52…

7.5.5 The `skim` function in the `skimr` package

The skimr package has several functions that make it easy to examine an unknown data frame and assess what it contains. It is also extensible.

skimr::skim(salesorderheader_tibble)

Table 7.1: Data summary
Name	salesorderheader_tibble
Number of rows	31465
Number of columns	13
_______________________
Column type frequency:
character	2
logical	1
numeric	7
POSIXct	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
purchaseordernumber	27659	0.12	10	13	0	3806	0
accountnumber	0	1.00	14	14	0	19119	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
onlineorderflag	0	1	0.88	TRU: 27659, FAL: 3806

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
salesorderid	0	1.00	59391.00	9083.31	43659	51525	59391	67257	75123	▇▇▇▇▇
revisionnumber	0	1.00	8.00	0.03	8	8	8	8	9	▇▁▁▁▁
status	0	1.00	5.00	0.00	5	5	5	5	5	▁▁▇▁▁
customerid	0	1.00	20170.18	6261.73	11000	14432	19452	25994	30118	▇▆▅▅▇
salespersonid	27659	0.12	280.61	4.85	274	277	279	284	290	▇▅▅▂▃
territoryid	0	1.00	6.09	2.96	1	4	6	9	10	▃▅▃▅▇
billtoaddressid	0	1.00	18263.15	8210.07	405	14080	19449	24678	29883	▃▁▇▇▇

Variable type: POSIXct

skim_variable	complete_rate	min	max	median	n_unique
orderdate	1	2011-05-31	2014-06-30	2013-11-03	1124
duedate	1	2011-06-12	2014-07-12	2013-11-15	1124
shipdate	1	2011-06-07	2014-07-07	2013-11-10	1124

skimr::skim_to_wide(salesorderheader_tibble) #skimr doesn't like certain kinds of columns

## Warning: 'skimr::skim_to_wide' is deprecated.
## Use 'skim()' instead.
## See help("Deprecated")

Table 7.1: Data summary
Name	.data
Number of rows	31465
Number of columns	13
_______________________
Column type frequency:
character	2
logical	1
numeric	7
POSIXct	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
purchaseordernumber	27659	0.12	10	13	0	3806	0
accountnumber	0	1.00	14	14	0	19119	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
onlineorderflag	0	1	0.88	TRU: 27659, FAL: 3806

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
salesorderid	0	1.00	59391.00	9083.31	43659	51525	59391	67257	75123	▇▇▇▇▇
revisionnumber	0	1.00	8.00	0.03	8	8	8	8	9	▇▁▁▁▁
status	0	1.00	5.00	0.00	5	5	5	5	5	▁▁▇▁▁
customerid	0	1.00	20170.18	6261.73	11000	14432	19452	25994	30118	▇▆▅▅▇
salespersonid	27659	0.12	280.61	4.85	274	277	279	284	290	▇▅▅▂▃
territoryid	0	1.00	6.09	2.96	1	4	6	9	10	▃▅▃▅▇
billtoaddressid	0	1.00	18263.15	8210.07	405	14080	19449	24678	29883	▃▁▇▇▇

Variable type: POSIXct

skim_variable	complete_rate	min	max	median	n_unique
orderdate	1	2011-05-31	2014-06-30	2013-11-03	1124
duedate	1	2011-06-12	2014-07-12	2013-11-15	1124
shipdate	1	2011-06-07	2014-07-07	2013-11-10	1124

7.6 Disconnect from the database and stop Docker

dbDisconnect(con)
# or if using the connections package, use:
# connection_close(con)

sp_docker_stop("adventureworks")

7.7 Additional reading

(Wickham 2018)
(Baumer 2018)

References

Baumer, Benjamin S. 2018. “A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data.” Journal of Computational and Graphical Statistics, August. Informa UK Limited, 1–9. https://doi.org/10.1080/10618600.2018.1512867.

Ruiz, Edgar. 2019. “Big Data with R: RStudio 2019 Workshop.” RStudio. January 16, 2019. https://github.com/rstudio/bigdataclass.

Wickham, Hadley. 2018. “A Grammar of Data Manipulation * Dplyr.” . https://dplyr.tidyverse.org/.