Chapter 7 Introduction to DBMS queries

This chapter demonstrates how to:

  • Download all or part of a table from the DBMS, including different kinds of subsets
  • See how dplyr code is translated into SQL commands and how they can be mixed
  • Get acquainted with some useful functions and packages for investigating a single table
  • Begin thinking about how to divide the work between your local R session and the DBMS

7.2 Methods for downloading a single table

For the moment, assume you know something about the database and specifically what table you need to retrieve. We return to the topic of investigating the whole database later on.

## [1] 0

7.2.1 Read the entire table

There are a few different methods of getting data from a DBMS, and we’ll explore the different ways of controlling each one of them.

DBI::dbReadTable will download an entire table into an R tibble.

## 'data.frame':    31465 obs. of  25 variables:
##  $ salesorderid          : int  43659 43660 43661 43662 43663 43664 43665 43666 43667 43668 ...
##  $ revisionnumber        : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ orderdate             : POSIXct, format: "2011-05-31" "2011-05-31" ...
##  $ duedate               : POSIXct, format: "2011-06-12" "2011-06-12" ...
##  $ shipdate              : POSIXct, format: "2011-06-07" "2011-06-07" ...
##  $ status                : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ onlineorderflag       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ purchaseordernumber   : chr  "PO522145787" "PO18850127500" "PO18473189620" "PO18444174044" ...
##  $ accountnumber         : chr  "10-4020-000676" "10-4020-000117" "10-4020-000442" "10-4020-000227" ...
##  $ customerid            : int  29825 29672 29734 29994 29565 29898 29580 30052 29974 29614 ...
##  $ salespersonid         : int  279 279 282 282 276 280 283 276 277 282 ...
##  $ territoryid           : int  5 5 6 6 4 1 1 4 3 6 ...
##  $ billtoaddressid       : int  985 921 517 482 1073 876 849 1074 629 529 ...
##  $ shiptoaddressid       : int  985 921 517 482 1073 876 849 1074 629 529 ...
##  $ shipmethodid          : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ creditcardid          : int  16281 5618 1346 10456 4322 806 15232 13349 10370 1566 ...
##  $ creditcardapprovalcode: chr  "105041Vi84182" "115213Vi29411" "85274Vi6854" "125295Vi53935" ...
##  $ currencyrateid        : int  NA NA 4 4 NA NA NA NA NA 4 ...
##  $ subtotal              : num  20566 1294 32726 28833 419 ...
##  $ taxamt                : num  1971.5 124.2 3153.8 2775.2 40.3 ...
##  $ freight               : num  616.1 38.8 985.6 867.2 12.6 ...
##  $ totaldue              : num  23153 1457 36866 32475 472 ...
##  $ comment               : chr  NA NA NA NA ...
##  $ rowguid               : chr  "79b65321-39ca-4115-9cba-8fe0903e12e6" "738dc42d-d03b-48a1-9822-f95a67ea7389" "d91b9131-18a4-4a11-bc3a-90b6f53e9d74" "4a1ecfc0-cc3a-4740-b028-1c50bb48711c" ...
##  $ modifieddate          : POSIXct, format: "2011-06-07" "2011-06-07" ...

That’s very simple, but if the table is very large it may not be a problem, since R is designed to keep the entire table in memory. The tables that are found in an enterprise database such as adventureworks may be large, they are most often records kept by people. That somewhat limits their size (relative to data generated by machines) and expands the possibilities for human error.

Note that the first line of the str() output reports the total number of observations.

Later on we’ll use this tibble to demonstrate several packages and functions, but use only the first 13 columns for simplicity.

7.2.2 Create a pointer to a table that can be reused

The dplyr::tbl function gives us more control over access to a table by enabling control over which columns and rows to download. It creates an object that might look like a data frame, but it’s actually a list object that dplyr uses for constructing queries and retrieving data from the DBMS.

## [1] "tbl_PqConnection" "tbl_dbi"          "tbl_sql"          "tbl_lazy"        
## [5] "tbl"

7.2.3 Controlling the number of rows returned with collect()

The collect function triggers the creation of a tibble and controls the number of rows that the DBMS sends to R. For more complex queries, the dplyr::collect() function provides a mechanism to indicate what’s processed on on the DBMS server and what’s processed by R on the local machine. The chapter on Lazy Evaluation and Execution Environment discusses this issue in detail.

## [1]  3 25
## [1] 500  25

7.2.4 Retrieving random rows from the DBMS

When the DBMS contains many rows, a sample of the data may be plenty for your purposes. Although dplyr has nice functions to sample a data frame that’s already in R (e.g., the sample_n and sample_frac functions), to get a sample from the DBMS we have to use dbGetQuery to send native SQL to the database. To peek ahead, here is one example of a query that retrieves 20 rows from a 1% sample:

##     orderdate   subtotal    taxamt   freight   totaldue
## 1  2011-06-22   699.0982   55.9279   17.4775   772.5036
## 2  2011-06-25  3578.2700  286.2616   89.4568  3953.9884
## 3  2011-06-29  3374.9900  269.9992   84.3748  3729.3640
## 4  2011-06-30  3578.2700  286.2616   89.4568  3953.9884
## 5  2011-07-01 32492.6040 3118.7048  974.5952 36585.9040
## 6  2011-07-03  3578.2700  286.2616   89.4568  3953.9884
## 7  2011-07-22  3578.2700  286.2616   89.4568  3953.9884
## 8  2011-08-01  2039.9940  195.8394   61.1998  2297.0332
## 9  2011-08-01  1362.3067  130.1463   40.6707  1533.1237
## 10 2011-08-07  3578.2700  286.2616   89.4568  3953.9884
## 11 2011-08-07  3578.2700  286.2616   89.4568  3953.9884
## 12 2011-08-14  3578.2700  286.2616   89.4568  3953.9884
## 13 2011-09-06  3578.2700  286.2616   89.4568  3953.9884
## 14 2011-09-08  3374.9900  269.9992   84.3748  3729.3640
## 15 2011-09-08   699.0982   55.9279   17.4775   772.5036
## 16 2011-09-10  3578.2700  286.2616   89.4568  3953.9884
## 17 2011-09-11  3578.2700  286.2616   89.4568  3953.9884
## 18 2011-09-12  3578.2700  286.2616   89.4568  3953.9884
## 19 2011-09-19  3578.2700  286.2616   89.4568  3953.9884
## 20 2011-10-01 35651.0339 3424.4400 1070.1375 40145.6114

Exact sample of 100 records

This technique depends on knowing the range of a record index, such as the businessentityid in the salesorderheader table of our adventureworks database.

Start by finding the min and max values.

##  [1] "salesorderid"           "revisionnumber"         "orderdate"             
##  [4] "duedate"                "shipdate"               "status"                
##  [7] "onlineorderflag"        "purchaseordernumber"    "accountnumber"         
## [10] "customerid"             "salespersonid"          "territoryid"           
## [13] "billtoaddressid"        "shiptoaddressid"        "shipmethodid"          
## [16] "creditcardid"           "creditcardapprovalcode" "currencyrateid"        
## [19] "subtotal"               "taxamt"                 "freight"               
## [22] "totaldue"               "comment"                "rowguid"               
## [25] "modifieddate"
## [1] 75123
## [1] 43659

Set the random number seed and draw the sample.

Run query with the filter verb listing the randomly sampled rows to be retrieved:

## Classes 'tbl_df', 'tbl' and 'data.frame':    7 obs. of  25 variables:
##  $ salesorderid          : int  45404 46435 51663 57870 62555 65161 68293
##  $ revisionnumber        : int  8 8 8 8 8 8 8
##  $ orderdate             : POSIXct, format: "2012-01-10" "2012-05-06" ...
##  $ duedate               : POSIXct, format: "2012-01-22" "2012-05-18" ...
##  $ shipdate              : POSIXct, format: "2012-01-17" "2012-05-13" ...
##  $ status                : int  5 5 5 5 5 5 5
##  $ onlineorderflag       : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ purchaseordernumber   : chr  NA NA NA NA ...
##  $ accountnumber         : chr  "10-4030-011217" "10-4030-012251" "10-4030-016327" "10-4030-018572" ...
##  $ customerid            : int  11217 12251 16327 18572 13483 29799 13239
##  $ salespersonid         : int  NA NA NA NA NA 281 NA
##  $ territoryid           : int  1 9 8 4 1 4 6
##  $ billtoaddressid       : int  19321 24859 19265 16902 15267 997 27923
##  $ shiptoaddressid       : int  19321 24859 19265 16902 15267 997 27923
##  $ shipmethodid          : int  1 1 1 1 1 5 1
##  $ creditcardid          : int  8241 13188 16357 1884 4409 12582 1529
##  $ creditcardapprovalcode: chr  "332581Vi42712" "635144Vi68383" "420152Vi84562" "1224478Vi9772" ...
##  $ currencyrateid        : int  NA 4121 NA NA NA NA 11581
##  $ subtotal              : num  3578 3375 2466 14 57 ...
##  $ taxamt                : num  286.26 270 197.31 1.12 4.56 ...
##  $ freight               : num  89.457 84.375 61.658 0.349 1.424 ...
##  $ totaldue              : num  3954 3729.4 2725.3 15.4 63 ...
##  $ comment               : chr  NA NA NA NA ...
##  $ rowguid               : chr  "358f91b2-dadd-4014-8d4f-7f9736cb664e" "eb312409-fcd5-4bac-bd3b-16d4bd7889db" "ddc60552-af98-4166-9249-d09d424d8430" "fe46e631-47b9-4e14-9da5-1e4a4a135364" ...
##  $ modifieddate          : POSIXct, format: "2012-01-17" "2012-05-13" ...

7.2.5 Sub-setting variables

A table in the DBMS may not only have many more rows than you want, but also many more columns. The select command controls which columns are retrieved.

## # Source:   lazy query [?? x 5]
## # Database: postgres [postgres@localhost:5432/adventureworks]
##   orderdate           subtotal taxamt freight totaldue
##   <dttm>                 <dbl>  <dbl>   <dbl>    <dbl>
## 1 2011-05-31 00:00:00   20566. 1972.    616.    23153.
## 2 2011-05-31 00:00:00    1294.  124.     38.8    1457.
## 3 2011-05-31 00:00:00   32726. 3154.    986.    36866.
## 4 2011-05-31 00:00:00   28833. 2775.    867.    32475.
## 5 2011-05-31 00:00:00     419.   40.3    12.6     472.
## 6 2011-05-31 00:00:00   24433. 2345.    733.    27510.

That’s exactly equivalent to submitting the following SQL commands directly:

##    orderdate   subtotal    taxamt  freight   totaldue
## 1 2011-05-31 20565.6206 1971.5149 616.0984 23153.2339
## 2 2011-05-31  1294.2529  124.2483  38.8276  1457.3288
## 3 2011-05-31 32726.4786 3153.7696 985.5530 36865.8012
## 4 2011-05-31 28832.5289 2775.1646 867.2389 32474.9324
## 5 2011-05-31   419.4589   40.2681  12.5838   472.3108
## 6 2011-05-31 24432.6088 2344.9921 732.8100 27510.4109

We won’t discuss dplyr methods for sub-setting variables, deriving new ones, or sub-setting rows based on the values found in the table, because they are covered well in other places, including:

In practice we find that, renaming variables is often quite important because the names in an SQL database might not meet your needs as an analyst. In “the wild”, you will find names that are ambiguous or overly specified, with spaces in them, and other problems that will make them difficult to use in R. It is good practice to do whatever renaming you are going to do in a predictable place like at the top of your code. The names in the adventureworks database are simple and clear, but if they were not, you might rename them for subsequent use in this way:

## <SQL>
## SELECT "orderdate" AS "order_date", "subtotal" AS "sub_total_amount", "taxamt" AS "tax_amount", "freight" AS "freight_amount", "totaldue" AS "total_due_amount"
## FROM "salesorderheader"

That’s equivalent to the following SQL code:

##   order_date sub_total_amount tax_amount freight_amount total_due_amount
## 1 2011-05-31       20565.6206  1971.5149       616.0984       23153.2339
## 2 2011-05-31        1294.2529   124.2483        38.8276        1457.3288
## 3 2011-05-31       32726.4786  3153.7696       985.5530       36865.8012
## 4 2011-05-31       28832.5289  2775.1646       867.2389       32474.9324
## 5 2011-05-31         419.4589    40.2681        12.5838         472.3108
## 6 2011-05-31       24432.6088  2344.9921       732.8100       27510.4109

The one difference is that the SQL code returns a regular data frame and the dplyr code returns a tibble. Notice that the seconds are grayed out in the tibble display.

7.3 Translating dplyr code to SQL queries

Where did the translations we’ve shown above come from? The show_query function shows how dplyr is translating your query to the dialect of the target DBMS.

The show_query() function shows you what dplyr is sending to the DBMS. It might be handy for inspecting what dplyr is doing or for showing your code to someone who is more SQL- than R-literate. In general we have used the function extensively in writing this book but in the final product we will not use it unless there is something in the SQL or the translation process that needs to be explained.

## <SQL>
## SELECT COUNT(*) AS "n"
## FROM "salesorderheader"

Here is an extensive discussion of how dplyr code is translated into SQL:

If you prefer to use SQL directly, rather than dplyr, you can submit SQL code to the DBMS through the DBI::dbGetQuery function:

##       n
## 1 31465

When you create a report to run repeatedly, you might want to put that query into R markdown. That way you can also execute that SQL code in a chunk with the following header:

{sql, connection=con, output.var = "query_results"}

R markdown stores that query result in a tibble which can be printed by referring to it:

##       n
## 1 31465

7.4 Mixing dplyr and SQL

When dplyr finds code that it does not know how to translate into SQL, it will simply pass it along to the DBMS. Therefore you can interleave native commands that your DBMS will understand in the middle of dplyr code. Consider this example that’s derived from (Ruiz 2019):

## <SQL>
## SELECT "subtotal", "orderdate", "duedate", "shipdate", "modifieddate", CURRENT_TIMESTAMP AS "today"
## FROM "salesorderheader"

That is native to PostgreSQL, not ANSI standard SQL.

Verify that it works:

## # A tibble: 6 x 6
##   subtotal orderdate           duedate             shipdate           
##      <dbl> <dttm>              <dttm>              <dttm>             
## 1   20566. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 2    1294. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 3   32726. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 4   28833. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 5     419. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 6   24433. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## # … with 2 more variables: modifieddate <dttm>, today <dttm>

7.5 Examining a single table with R

Dealing with a large, complex database highlights the utility of specific tools in R. We include brief examples that we find to be handy:

  • Base R structure: str
  • Printing out some of the data: datatable, kable, and View
  • Summary statistics: summary
  • glimpse in the tibble package, which is included in the tidyverse
  • skim in the skimr package

7.5.1 str - a base package workhorse

str is a workhorse function that lists variables, their type and a sample of the first few variable values.

## 'data.frame':    31465 obs. of  13 variables:
##  $ salesorderid       : int  43659 43660 43661 43662 43663 43664 43665 43666 43667 43668 ...
##  $ revisionnumber     : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ orderdate          : POSIXct, format: "2011-05-31" "2011-05-31" ...
##  $ duedate            : POSIXct, format: "2011-06-12" "2011-06-12" ...
##  $ shipdate           : POSIXct, format: "2011-06-07" "2011-06-07" ...
##  $ status             : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ onlineorderflag    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ purchaseordernumber: chr  "PO522145787" "PO18850127500" "PO18473189620" "PO18444174044" ...
##  $ accountnumber      : chr  "10-4020-000676" "10-4020-000117" "10-4020-000442" "10-4020-000227" ...
##  $ customerid         : int  29825 29672 29734 29994 29565 29898 29580 30052 29974 29614 ...
##  $ salespersonid      : int  279 279 282 282 276 280 283 276 277 282 ...
##  $ territoryid        : int  5 5 6 6 4 1 1 4 3 6 ...
##  $ billtoaddressid    : int  985 921 517 482 1073 876 849 1074 629 529 ...

7.5.2 Always look at your data with head, View, or kable

There is no substitute for looking at your data and R provides several ways to just browse it. The head function controls the number of rows that are displayed. Note that tail does not work against a database object. In every-day practice you would look at more than the default 6 rows, but here we wrap head around the data frame:

7.5.3 The summary function in base

The base package’s summary function provides basic statistics that serve a unique diagnostic purpose in this context. For example, the following output shows that:

* `businessentityid` is a number from 1 to 16,049. In a previous section, we ran the `str` function and saw that there are 16,044 observations in this table. Therefore, the `businessentityid` seems to be sequential from 1:16049, but there are 5 values missing from that sequence. _Exercise for the Reader_: Which 5 values from 1:16049 are missing from `businessentityid` values in the `salesorderheader` table? (_Hint_: In the chapter on SQL Joins, you will learn the functions needed to answer this question.)
* The number of NA's in the `return_date` column is a good first guess as to the number of DVDs rented out or lost as of 2005-09-02 02:35:22.
##   salesorderid   revisionnumber    orderdate                  
##  Min.   :43659   Min.   :8.000   Min.   :2011-05-31 00:00:00  
##  1st Qu.:51525   1st Qu.:8.000   1st Qu.:2013-06-20 00:00:00  
##  Median :59391   Median :8.000   Median :2013-11-03 00:00:00  
##  Mean   :59391   Mean   :8.001   Mean   :2013-08-21 12:05:04  
##  3rd Qu.:67257   3rd Qu.:8.000   3rd Qu.:2014-02-28 00:00:00  
##  Max.   :75123   Max.   :9.000   Max.   :2014-06-30 00:00:00  
##                                                               
##     duedate                       shipdate                       status 
##  Min.   :2011-06-12 00:00:00   Min.   :2011-06-07 00:00:00   Min.   :5  
##  1st Qu.:2013-07-02 00:00:00   1st Qu.:2013-06-27 00:00:00   1st Qu.:5  
##  Median :2013-11-15 00:00:00   Median :2013-11-10 00:00:00   Median :5  
##  Mean   :2013-09-02 12:05:41   Mean   :2013-08-28 12:06:06   Mean   :5  
##  3rd Qu.:2014-03-13 00:00:00   3rd Qu.:2014-03-08 00:00:00   3rd Qu.:5  
##  Max.   :2014-07-12 00:00:00   Max.   :2014-07-07 00:00:00   Max.   :5  
##                                                                         
##  onlineorderflag purchaseordernumber accountnumber        customerid   
##  Mode :logical   Length:31465        Length:31465       Min.   :11000  
##  FALSE:3806      Class :character    Class :character   1st Qu.:14432  
##  TRUE :27659     Mode  :character    Mode  :character   Median :19452  
##                                                         Mean   :20170  
##                                                         3rd Qu.:25994  
##                                                         Max.   :30118  
##                                                                        
##  salespersonid    territoryid     billtoaddressid
##  Min.   :274.0   Min.   : 1.000   Min.   :  405  
##  1st Qu.:277.0   1st Qu.: 4.000   1st Qu.:14080  
##  Median :279.0   Median : 6.000   Median :19449  
##  Mean   :280.6   Mean   : 6.091   Mean   :18263  
##  3rd Qu.:284.0   3rd Qu.: 9.000   3rd Qu.:24678  
##  Max.   :290.0   Max.   :10.000   Max.   :29883  
##  NA's   :27659

So the summary function is surprisingly useful as we first start to look at the table contents.

7.5.4 The glimpse function in the tibble package

The tibble package’s glimpse function is a more compact version of str:

## Observations: 31,465
## Variables: 13
## $ salesorderid        <int> 43659, 43660, 43661, 43662, 43663, 43664, 43665, …
## $ revisionnumber      <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
## $ orderdate           <dttm> 2011-05-31, 2011-05-31, 2011-05-31, 2011-05-31, …
## $ duedate             <dttm> 2011-06-12, 2011-06-12, 2011-06-12, 2011-06-12, …
## $ shipdate            <dttm> 2011-06-07, 2011-06-07, 2011-06-07, 2011-06-07, …
## $ status              <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ onlineorderflag     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ purchaseordernumber <chr> "PO522145787", "PO18850127500", "PO18473189620", …
## $ accountnumber       <chr> "10-4020-000676", "10-4020-000117", "10-4020-0004…
## $ customerid          <int> 29825, 29672, 29734, 29994, 29565, 29898, 29580, …
## $ salespersonid       <int> 279, 279, 282, 282, 276, 280, 283, 276, 277, 282,…
## $ territoryid         <int> 5, 5, 6, 6, 4, 1, 1, 4, 3, 6, 1, 3, 1, 6, 2, 6, 3…
## $ billtoaddressid     <int> 985, 921, 517, 482, 1073, 876, 849, 1074, 629, 52…

7.5.5 The skim function in the skimr package

The skimr package has several functions that make it easy to examine an unknown data frame and assess what it contains. It is also extensible.

Table 7.1: Data summary
Name salesorderheader_tibble
Number of rows 31465
Number of columns 13
_______________________
Column type frequency:
character 2
logical 1
numeric 7
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
purchaseordernumber 27659 0.12 10 13 0 3806 0
accountnumber 0 1.00 14 14 0 19119 0

Variable type: logical

skim_variable n_missing complete_rate mean count
onlineorderflag 0 1 0.88 TRU: 27659, FAL: 3806

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salesorderid 0 1.00 59391.00 9083.31 43659 51525 59391 67257 75123 ▇▇▇▇▇
revisionnumber 0 1.00 8.00 0.03 8 8 8 8 9 ▇▁▁▁▁
status 0 1.00 5.00 0.00 5 5 5 5 5 ▁▁▇▁▁
customerid 0 1.00 20170.18 6261.73 11000 14432 19452 25994 30118 ▇▆▅▅▇
salespersonid 27659 0.12 280.61 4.85 274 277 279 284 290 ▇▅▅▂▃
territoryid 0 1.00 6.09 2.96 1 4 6 9 10 ▃▅▃▅▇
billtoaddressid 0 1.00 18263.15 8210.07 405 14080 19449 24678 29883 ▃▁▇▇▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
orderdate 0 1 2011-05-31 2014-06-30 2013-11-03 1124
duedate 0 1 2011-06-12 2014-07-12 2013-11-15 1124
shipdate 0 1 2011-06-07 2014-07-07 2013-11-10 1124
## Warning: 'skimr::skim_to_wide' is deprecated.
## Use 'skim()' instead.
## See help("Deprecated")
Table 7.1: Data summary
Name .data
Number of rows 31465
Number of columns 13
_______________________
Column type frequency:
character 2
logical 1
numeric 7
POSIXct 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
purchaseordernumber 27659 0.12 10 13 0 3806 0
accountnumber 0 1.00 14 14 0 19119 0

Variable type: logical

skim_variable n_missing complete_rate mean count
onlineorderflag 0 1 0.88 TRU: 27659, FAL: 3806

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
salesorderid 0 1.00 59391.00 9083.31 43659 51525 59391 67257 75123 ▇▇▇▇▇
revisionnumber 0 1.00 8.00 0.03 8 8 8 8 9 ▇▁▁▁▁
status 0 1.00 5.00 0.00 5 5 5 5 5 ▁▁▇▁▁
customerid 0 1.00 20170.18 6261.73 11000 14432 19452 25994 30118 ▇▆▅▅▇
salespersonid 27659 0.12 280.61 4.85 274 277 279 284 290 ▇▅▅▂▃
territoryid 0 1.00 6.09 2.96 1 4 6 9 10 ▃▅▃▅▇
billtoaddressid 0 1.00 18263.15 8210.07 405 14080 19449 24678 29883 ▃▁▇▇▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
orderdate 0 1 2011-05-31 2014-06-30 2013-11-03 1124
duedate 0 1 2011-06-12 2014-07-12 2013-11-15 1124
shipdate 0 1 2011-06-07 2014-07-07 2013-11-10 1124

7.7 Additional reading

References

Baumer, Benjamin S. 2018. “A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data.” Journal of Computational and Graphical Statistics, August. Informa UK Limited, 1–9. https://doi.org/10.1080/10618600.2018.1512867.

Ruiz, Edgar. 2019. “Big Data with R: RStudio 2019 Workshop.” RStudio. January 16, 2019. https://github.com/rstudio/bigdataclass.

Wickham, Hadley. 2018. “A Grammar of Data Manipulation * Dplyr.” . https://dplyr.tidyverse.org/.