Chapter 7 Introduction to DBMS queries
This chapter demonstrates how to:
- Download all or part of a table from the DBMS, including different kinds of subsets
- See how
dplyr
code is translated intoSQL
commands and how they can be mixed- Get acquainted with some useful functions and packages for investigating a single table
- Begin thinking about how to divide the work between your local R session and the DBMS
7.1 Setup
The following packages are used in this chapter:
library(tidyverse)
library(DBI)
library(RPostgres)
library(dbplyr)
require(knitr)
library(bookdown)
library(sqlpetr)
library(skimr)
library(connections)
sleep_default <- 3
Assume that the Docker container with PostgreSQL and the adventureworks database are ready to go. If not go back to [Chapter 6][#chapter_setup-adventureworks-db]
Connect to the database:
# con <- connection_open( # use in an interactive session
con <- dbConnect( # use in other settings
RPostgres::Postgres(),
# without the previous and next lines, some functions fail with bigint data
# so change int64 to integer
bigint = "integer",
host = "localhost",
user = Sys.getenv("DEFAULT_POSTGRES_USER_NAME"),
password = Sys.getenv("DEFAULT_POSTGRES_PASSWORD"),
dbname = "adventureworks",
port = 5432
)
7.2 Methods for downloading a single table
For the moment, assume you know something about the database and specifically what table you need to retrieve. We return to the topic of investigating the whole database later on.
## [1] 0
7.2.1 Read the entire table
There are a few different methods of getting data from a DBMS, and we’ll explore the different ways of controlling each one of them.
DBI::dbReadTable
will download an entire table into an R tibble.
## 'data.frame': 31465 obs. of 25 variables:
## $ salesorderid : int 43659 43660 43661 43662 43663 43664 43665 43666 43667 43668 ...
## $ revisionnumber : int 8 8 8 8 8 8 8 8 8 8 ...
## $ orderdate : POSIXct, format: "2011-05-31" "2011-05-31" ...
## $ duedate : POSIXct, format: "2011-06-12" "2011-06-12" ...
## $ shipdate : POSIXct, format: "2011-06-07" "2011-06-07" ...
## $ status : int 5 5 5 5 5 5 5 5 5 5 ...
## $ onlineorderflag : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ purchaseordernumber : chr "PO522145787" "PO18850127500" "PO18473189620" "PO18444174044" ...
## $ accountnumber : chr "10-4020-000676" "10-4020-000117" "10-4020-000442" "10-4020-000227" ...
## $ customerid : int 29825 29672 29734 29994 29565 29898 29580 30052 29974 29614 ...
## $ salespersonid : int 279 279 282 282 276 280 283 276 277 282 ...
## $ territoryid : int 5 5 6 6 4 1 1 4 3 6 ...
## $ billtoaddressid : int 985 921 517 482 1073 876 849 1074 629 529 ...
## $ shiptoaddressid : int 985 921 517 482 1073 876 849 1074 629 529 ...
## $ shipmethodid : int 5 5 5 5 5 5 5 5 5 5 ...
## $ creditcardid : int 16281 5618 1346 10456 4322 806 15232 13349 10370 1566 ...
## $ creditcardapprovalcode: chr "105041Vi84182" "115213Vi29411" "85274Vi6854" "125295Vi53935" ...
## $ currencyrateid : int NA NA 4 4 NA NA NA NA NA 4 ...
## $ subtotal : num 20566 1294 32726 28833 419 ...
## $ taxamt : num 1971.5 124.2 3153.8 2775.2 40.3 ...
## $ freight : num 616.1 38.8 985.6 867.2 12.6 ...
## $ totaldue : num 23153 1457 36866 32475 472 ...
## $ comment : chr NA NA NA NA ...
## $ rowguid : chr "79b65321-39ca-4115-9cba-8fe0903e12e6" "738dc42d-d03b-48a1-9822-f95a67ea7389" "d91b9131-18a4-4a11-bc3a-90b6f53e9d74" "4a1ecfc0-cc3a-4740-b028-1c50bb48711c" ...
## $ modifieddate : POSIXct, format: "2011-06-07" "2011-06-07" ...
That’s very simple, but if the table is very large it may not be a problem, since R is designed to keep the entire table in memory. The tables that are found in an enterprise database such as adventureworks
may be large, they are most often records kept by people. That somewhat limits their size (relative to data generated by machines) and expands the possibilities for human error.
Note that the first line of the str() output reports the total number of observations.
Later on we’ll use this tibble to demonstrate several packages and functions, but use only the first 13 columns for simplicity.
7.2.2 Create a pointer to a table that can be reused
The dplyr::tbl
function gives us more control over access to a table by enabling control over which columns and rows to download. It creates an object that might look like a data frame, but it’s actually a list object that dplyr
uses for constructing queries and retrieving data from the DBMS.
## [1] "tbl_PqConnection" "tbl_dbi" "tbl_sql" "tbl_lazy"
## [5] "tbl"
7.2.3 Controlling the number of rows returned with collect()
The collect
function triggers the creation of a tibble and controls the number of rows that the DBMS sends to R. For more complex queries, the dplyr::collect()
function provides a mechanism to indicate what’s processed on on the DBMS server and what’s processed by R on the local machine. The chapter on Lazy Evaluation and Execution Environment discusses this issue in detail.
## [1] 3 25
## [1] 500 25
7.2.4 Retrieving random rows from the DBMS
When the DBMS contains many rows, a sample of the data may be plenty for your purposes. Although dplyr
has nice functions to sample a data frame that’s already in R (e.g., the sample_n
and sample_frac
functions), to get a sample from the DBMS we have to use dbGetQuery
to send native SQL to the database. To peek ahead, here is one example of a query that retrieves 20 rows from a 1% sample:
one_percent_sample <- DBI::dbGetQuery(
con,
"SELECT orderdate, subtotal, taxamt, freight, totaldue
FROM salesorderheader TABLESAMPLE BERNOULLI(3) LIMIT 20;
"
)
one_percent_sample
## orderdate subtotal taxamt freight totaldue
## 1 2011-06-22 699.0982 55.9279 17.4775 772.5036
## 2 2011-06-25 3578.2700 286.2616 89.4568 3953.9884
## 3 2011-06-29 3374.9900 269.9992 84.3748 3729.3640
## 4 2011-06-30 3578.2700 286.2616 89.4568 3953.9884
## 5 2011-07-01 32492.6040 3118.7048 974.5952 36585.9040
## 6 2011-07-03 3578.2700 286.2616 89.4568 3953.9884
## 7 2011-07-22 3578.2700 286.2616 89.4568 3953.9884
## 8 2011-08-01 2039.9940 195.8394 61.1998 2297.0332
## 9 2011-08-01 1362.3067 130.1463 40.6707 1533.1237
## 10 2011-08-07 3578.2700 286.2616 89.4568 3953.9884
## 11 2011-08-07 3578.2700 286.2616 89.4568 3953.9884
## 12 2011-08-14 3578.2700 286.2616 89.4568 3953.9884
## 13 2011-09-06 3578.2700 286.2616 89.4568 3953.9884
## 14 2011-09-08 3374.9900 269.9992 84.3748 3729.3640
## 15 2011-09-08 699.0982 55.9279 17.4775 772.5036
## 16 2011-09-10 3578.2700 286.2616 89.4568 3953.9884
## 17 2011-09-11 3578.2700 286.2616 89.4568 3953.9884
## 18 2011-09-12 3578.2700 286.2616 89.4568 3953.9884
## 19 2011-09-19 3578.2700 286.2616 89.4568 3953.9884
## 20 2011-10-01 35651.0339 3424.4400 1070.1375 40145.6114
Exact sample of 100 records
This technique depends on knowing the range of a record index, such as the businessentityid
in the salesorderheader
table of our adventureworks
database.
Start by finding the min and max values.
## [1] "salesorderid" "revisionnumber" "orderdate"
## [4] "duedate" "shipdate" "status"
## [7] "onlineorderflag" "purchaseordernumber" "accountnumber"
## [10] "customerid" "salespersonid" "territoryid"
## [13] "billtoaddressid" "shiptoaddressid" "shipmethodid"
## [16] "creditcardid" "creditcardapprovalcode" "currencyrateid"
## [19] "subtotal" "taxamt" "freight"
## [22] "totaldue" "comment" "rowguid"
## [25] "modifieddate"
salesorderheader_df <- DBI::dbReadTable(con, "salesorderheader")
(max_id <- max(salesorderheader_df$salesorderid))
## [1] 75123
## [1] 43659
Set the random number seed and draw the sample.
set.seed(123)
sample_rows <- sample(1:max(salesorderheader_df$salesorderid), 10)
salesorderheader_table <- dplyr::tbl(con, "salesorderheader")
Run query with the filter verb listing the randomly sampled rows to be retrieved:
salesorderheader_sample <- salesorderheader_table %>%
dplyr::filter(salesorderid %in% sample_rows) %>%
dplyr::collect()
str(salesorderheader_sample)
## Classes 'tbl_df', 'tbl' and 'data.frame': 7 obs. of 25 variables:
## $ salesorderid : int 45404 46435 51663 57870 62555 65161 68293
## $ revisionnumber : int 8 8 8 8 8 8 8
## $ orderdate : POSIXct, format: "2012-01-10" "2012-05-06" ...
## $ duedate : POSIXct, format: "2012-01-22" "2012-05-18" ...
## $ shipdate : POSIXct, format: "2012-01-17" "2012-05-13" ...
## $ status : int 5 5 5 5 5 5 5
## $ onlineorderflag : logi TRUE TRUE TRUE TRUE TRUE FALSE ...
## $ purchaseordernumber : chr NA NA NA NA ...
## $ accountnumber : chr "10-4030-011217" "10-4030-012251" "10-4030-016327" "10-4030-018572" ...
## $ customerid : int 11217 12251 16327 18572 13483 29799 13239
## $ salespersonid : int NA NA NA NA NA 281 NA
## $ territoryid : int 1 9 8 4 1 4 6
## $ billtoaddressid : int 19321 24859 19265 16902 15267 997 27923
## $ shiptoaddressid : int 19321 24859 19265 16902 15267 997 27923
## $ shipmethodid : int 1 1 1 1 1 5 1
## $ creditcardid : int 8241 13188 16357 1884 4409 12582 1529
## $ creditcardapprovalcode: chr "332581Vi42712" "635144Vi68383" "420152Vi84562" "1224478Vi9772" ...
## $ currencyrateid : int NA 4121 NA NA NA NA 11581
## $ subtotal : num 3578 3375 2466 14 57 ...
## $ taxamt : num 286.26 270 197.31 1.12 4.56 ...
## $ freight : num 89.457 84.375 61.658 0.349 1.424 ...
## $ totaldue : num 3954 3729.4 2725.3 15.4 63 ...
## $ comment : chr NA NA NA NA ...
## $ rowguid : chr "358f91b2-dadd-4014-8d4f-7f9736cb664e" "eb312409-fcd5-4bac-bd3b-16d4bd7889db" "ddc60552-af98-4166-9249-d09d424d8430" "fe46e631-47b9-4e14-9da5-1e4a4a135364" ...
## $ modifieddate : POSIXct, format: "2012-01-17" "2012-05-13" ...
7.2.5 Sub-setting variables
A table in the DBMS may not only have many more rows than you want, but also many more columns. The select
command controls which columns are retrieved.
## # Source: lazy query [?? x 5]
## # Database: postgres [postgres@localhost:5432/adventureworks]
## orderdate subtotal taxamt freight totaldue
## <dttm> <dbl> <dbl> <dbl> <dbl>
## 1 2011-05-31 00:00:00 20566. 1972. 616. 23153.
## 2 2011-05-31 00:00:00 1294. 124. 38.8 1457.
## 3 2011-05-31 00:00:00 32726. 3154. 986. 36866.
## 4 2011-05-31 00:00:00 28833. 2775. 867. 32475.
## 5 2011-05-31 00:00:00 419. 40.3 12.6 472.
## 6 2011-05-31 00:00:00 24433. 2345. 733. 27510.
That’s exactly equivalent to submitting the following SQL commands directly:
DBI::dbGetQuery(
con,
'SELECT "orderdate", "subtotal", "taxamt", "freight", "totaldue"
FROM "salesorderheader"
LIMIT 6')
## orderdate subtotal taxamt freight totaldue
## 1 2011-05-31 20565.6206 1971.5149 616.0984 23153.2339
## 2 2011-05-31 1294.2529 124.2483 38.8276 1457.3288
## 3 2011-05-31 32726.4786 3153.7696 985.5530 36865.8012
## 4 2011-05-31 28832.5289 2775.1646 867.2389 32474.9324
## 5 2011-05-31 419.4589 40.2681 12.5838 472.3108
## 6 2011-05-31 24432.6088 2344.9921 732.8100 27510.4109
We won’t discuss dplyr
methods for sub-setting variables, deriving new ones, or sub-setting rows based on the values found in the table, because they are covered well in other places, including:
- Comprehensive reference: https://dplyr.tidyverse.org/
- Good tutorial: https://suzan.rbind.io/tags/dplyr/
In practice we find that, renaming variables is often quite important because the names in an SQL database might not meet your needs as an analyst. In “the wild”, you will find names that are ambiguous or overly specified, with spaces in them, and other problems that will make them difficult to use in R. It is good practice to do whatever renaming you are going to do in a predictable place like at the top of your code. The names in the adventureworks
database are simple and clear, but if they were not, you might rename them for subsequent use in this way:
tbl(con, "salesorderheader") %>%
dplyr::rename(order_date = orderdate, sub_total_amount = subtotal,
tax_amount = taxamt, freight_amount = freight, total_due_amount = totaldue) %>%
dplyr::select(order_date, sub_total_amount, tax_amount, freight_amount, total_due_amount ) %>%
show_query()
## <SQL>
## SELECT "orderdate" AS "order_date", "subtotal" AS "sub_total_amount", "taxamt" AS "tax_amount", "freight" AS "freight_amount", "totaldue" AS "total_due_amount"
## FROM "salesorderheader"
That’s equivalent to the following SQL code:
DBI::dbGetQuery(
con,
'SELECT "orderdate" AS "order_date",
"subtotal" AS "sub_total_amount",
"taxamt" AS "tax_amount",
"freight" AS "freight_amount",
"totaldue" AS "total_due_amount"
FROM "salesorderheader"' ) %>%
head()
## order_date sub_total_amount tax_amount freight_amount total_due_amount
## 1 2011-05-31 20565.6206 1971.5149 616.0984 23153.2339
## 2 2011-05-31 1294.2529 124.2483 38.8276 1457.3288
## 3 2011-05-31 32726.4786 3153.7696 985.5530 36865.8012
## 4 2011-05-31 28832.5289 2775.1646 867.2389 32474.9324
## 5 2011-05-31 419.4589 40.2681 12.5838 472.3108
## 6 2011-05-31 24432.6088 2344.9921 732.8100 27510.4109
The one difference is that the SQL
code returns a regular data frame and the dplyr
code returns a tibble
. Notice that the seconds are grayed out in the tibble
display.
7.3 Translating dplyr
code to SQL
queries
Where did the translations we’ve shown above come from? The show_query
function shows how dplyr
is translating your query to the dialect of the target DBMS.
The
show_query()
function shows you what dplyr is sending to the DBMS. It might be handy for inspecting what dplyr is doing or for showing your code to someone who is more SQL- than R-literate. In general we have used the function extensively in writing this book but in the final product we will not use it unless there is something in the SQL or the translation process that needs to be explained.
## <SQL>
## SELECT COUNT(*) AS "n"
## FROM "salesorderheader"
Here is an extensive discussion of how dplyr
code is translated into SQL:
If you prefer to use SQL directly, rather than dplyr
, you can submit SQL code to the DBMS through the DBI::dbGetQuery
function:
## n
## 1 31465
When you create a report to run repeatedly, you might want to put that query into R markdown. That way you can also execute that SQL code in a chunk with the following header:
{sql, connection=con, output.var = "query_results"
}
R markdown stores that query result in a tibble which can be printed by referring to it:
## n
## 1 31465
7.4 Mixing dplyr and SQL
When dplyr finds code that it does not know how to translate into SQL, it will simply pass it along to the DBMS. Therefore you can interleave native commands that your DBMS will understand in the middle of dplyr code. Consider this example that’s derived from (Ruiz 2019):
salesorderheader_table %>%
dplyr::select_at(vars(subtotal, contains("date"))) %>%
dplyr::mutate(today = now()) %>%
dplyr::show_query()
## <SQL>
## SELECT "subtotal", "orderdate", "duedate", "shipdate", "modifieddate", CURRENT_TIMESTAMP AS "today"
## FROM "salesorderheader"
That is native to PostgreSQL, not ANSI standard SQL.
Verify that it works:
salesorderheader_table %>%
dplyr::select_at(vars(subtotal, contains("date"))) %>%
head() %>%
dplyr::mutate(today = now()) %>%
dplyr::collect()
## # A tibble: 6 x 6
## subtotal orderdate duedate shipdate
## <dbl> <dttm> <dttm> <dttm>
## 1 20566. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 2 1294. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 3 32726. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 4 28833. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 5 419. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## 6 24433. 2011-05-31 00:00:00 2011-06-12 00:00:00 2011-06-07 00:00:00
## # … with 2 more variables: modifieddate <dttm>, today <dttm>
7.5 Examining a single table with R
Dealing with a large, complex database highlights the utility of specific tools in R. We include brief examples that we find to be handy:
- Base R structure:
str
- Printing out some of the data:
datatable
,kable
, andView
- Summary statistics:
summary
glimpse
in thetibble
package, which is included in thetidyverse
skim
in theskimr
package
7.5.1 str
- a base package workhorse
str
is a workhorse function that lists variables, their type and a sample of the first few variable values.
## 'data.frame': 31465 obs. of 13 variables:
## $ salesorderid : int 43659 43660 43661 43662 43663 43664 43665 43666 43667 43668 ...
## $ revisionnumber : int 8 8 8 8 8 8 8 8 8 8 ...
## $ orderdate : POSIXct, format: "2011-05-31" "2011-05-31" ...
## $ duedate : POSIXct, format: "2011-06-12" "2011-06-12" ...
## $ shipdate : POSIXct, format: "2011-06-07" "2011-06-07" ...
## $ status : int 5 5 5 5 5 5 5 5 5 5 ...
## $ onlineorderflag : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ purchaseordernumber: chr "PO522145787" "PO18850127500" "PO18473189620" "PO18444174044" ...
## $ accountnumber : chr "10-4020-000676" "10-4020-000117" "10-4020-000442" "10-4020-000227" ...
## $ customerid : int 29825 29672 29734 29994 29565 29898 29580 30052 29974 29614 ...
## $ salespersonid : int 279 279 282 282 276 280 283 276 277 282 ...
## $ territoryid : int 5 5 6 6 4 1 1 4 3 6 ...
## $ billtoaddressid : int 985 921 517 482 1073 876 849 1074 629 529 ...
7.5.2 Always look at your data with head
, View
, or kable
There is no substitute for looking at your data and R provides several ways to just browse it. The head
function controls the number of rows that are displayed. Note that tail does not work against a database object. In every-day practice you would look at more than the default 6 rows, but here we wrap head
around the data frame:
7.5.3 The summary
function in base
The base
package’s summary
function provides basic statistics that serve a unique diagnostic purpose in this context. For example, the following output shows that:
* `businessentityid` is a number from 1 to 16,049. In a previous section, we ran the `str` function and saw that there are 16,044 observations in this table. Therefore, the `businessentityid` seems to be sequential from 1:16049, but there are 5 values missing from that sequence. _Exercise for the Reader_: Which 5 values from 1:16049 are missing from `businessentityid` values in the `salesorderheader` table? (_Hint_: In the chapter on SQL Joins, you will learn the functions needed to answer this question.)
* The number of NA's in the `return_date` column is a good first guess as to the number of DVDs rented out or lost as of 2005-09-02 02:35:22.
## salesorderid revisionnumber orderdate
## Min. :43659 Min. :8.000 Min. :2011-05-31 00:00:00
## 1st Qu.:51525 1st Qu.:8.000 1st Qu.:2013-06-20 00:00:00
## Median :59391 Median :8.000 Median :2013-11-03 00:00:00
## Mean :59391 Mean :8.001 Mean :2013-08-21 12:05:04
## 3rd Qu.:67257 3rd Qu.:8.000 3rd Qu.:2014-02-28 00:00:00
## Max. :75123 Max. :9.000 Max. :2014-06-30 00:00:00
##
## duedate shipdate status
## Min. :2011-06-12 00:00:00 Min. :2011-06-07 00:00:00 Min. :5
## 1st Qu.:2013-07-02 00:00:00 1st Qu.:2013-06-27 00:00:00 1st Qu.:5
## Median :2013-11-15 00:00:00 Median :2013-11-10 00:00:00 Median :5
## Mean :2013-09-02 12:05:41 Mean :2013-08-28 12:06:06 Mean :5
## 3rd Qu.:2014-03-13 00:00:00 3rd Qu.:2014-03-08 00:00:00 3rd Qu.:5
## Max. :2014-07-12 00:00:00 Max. :2014-07-07 00:00:00 Max. :5
##
## onlineorderflag purchaseordernumber accountnumber customerid
## Mode :logical Length:31465 Length:31465 Min. :11000
## FALSE:3806 Class :character Class :character 1st Qu.:14432
## TRUE :27659 Mode :character Mode :character Median :19452
## Mean :20170
## 3rd Qu.:25994
## Max. :30118
##
## salespersonid territoryid billtoaddressid
## Min. :274.0 Min. : 1.000 Min. : 405
## 1st Qu.:277.0 1st Qu.: 4.000 1st Qu.:14080
## Median :279.0 Median : 6.000 Median :19449
## Mean :280.6 Mean : 6.091 Mean :18263
## 3rd Qu.:284.0 3rd Qu.: 9.000 3rd Qu.:24678
## Max. :290.0 Max. :10.000 Max. :29883
## NA's :27659
So the summary
function is surprisingly useful as we first start to look at the table contents.
7.5.4 The glimpse
function in the tibble
package
The tibble
package’s glimpse
function is a more compact version of str
:
## Observations: 31,465
## Variables: 13
## $ salesorderid <int> 43659, 43660, 43661, 43662, 43663, 43664, 43665, …
## $ revisionnumber <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
## $ orderdate <dttm> 2011-05-31, 2011-05-31, 2011-05-31, 2011-05-31, …
## $ duedate <dttm> 2011-06-12, 2011-06-12, 2011-06-12, 2011-06-12, …
## $ shipdate <dttm> 2011-06-07, 2011-06-07, 2011-06-07, 2011-06-07, …
## $ status <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ onlineorderflag <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ purchaseordernumber <chr> "PO522145787", "PO18850127500", "PO18473189620", …
## $ accountnumber <chr> "10-4020-000676", "10-4020-000117", "10-4020-0004…
## $ customerid <int> 29825, 29672, 29734, 29994, 29565, 29898, 29580, …
## $ salespersonid <int> 279, 279, 282, 282, 276, 280, 283, 276, 277, 282,…
## $ territoryid <int> 5, 5, 6, 6, 4, 1, 1, 4, 3, 6, 1, 3, 1, 6, 2, 6, 3…
## $ billtoaddressid <int> 985, 921, 517, 482, 1073, 876, 849, 1074, 629, 52…
7.5.5 The skim
function in the skimr
package
The skimr
package has several functions that make it easy to examine an unknown data frame and assess what it contains. It is also extensible.
Name | salesorderheader_tibble |
Number of rows | 31465 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
character | 2 |
logical | 1 |
numeric | 7 |
POSIXct | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
purchaseordernumber | 27659 | 0.12 | 10 | 13 | 0 | 3806 | 0 |
accountnumber | 0 | 1.00 | 14 | 14 | 0 | 19119 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
onlineorderflag | 0 | 1 | 0.88 | TRU: 27659, FAL: 3806 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
salesorderid | 0 | 1.00 | 59391.00 | 9083.31 | 43659 | 51525 | 59391 | 67257 | 75123 | ▇▇▇▇▇ |
revisionnumber | 0 | 1.00 | 8.00 | 0.03 | 8 | 8 | 8 | 8 | 9 | ▇▁▁▁▁ |
status | 0 | 1.00 | 5.00 | 0.00 | 5 | 5 | 5 | 5 | 5 | ▁▁▇▁▁ |
customerid | 0 | 1.00 | 20170.18 | 6261.73 | 11000 | 14432 | 19452 | 25994 | 30118 | ▇▆▅▅▇ |
salespersonid | 27659 | 0.12 | 280.61 | 4.85 | 274 | 277 | 279 | 284 | 290 | ▇▅▅▂▃ |
territoryid | 0 | 1.00 | 6.09 | 2.96 | 1 | 4 | 6 | 9 | 10 | ▃▅▃▅▇ |
billtoaddressid | 0 | 1.00 | 18263.15 | 8210.07 | 405 | 14080 | 19449 | 24678 | 29883 | ▃▁▇▇▇ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
orderdate | 0 | 1 | 2011-05-31 | 2014-06-30 | 2013-11-03 | 1124 |
duedate | 0 | 1 | 2011-06-12 | 2014-07-12 | 2013-11-15 | 1124 |
shipdate | 0 | 1 | 2011-06-07 | 2014-07-07 | 2013-11-10 | 1124 |
## Warning: 'skimr::skim_to_wide' is deprecated.
## Use 'skim()' instead.
## See help("Deprecated")
Name | .data |
Number of rows | 31465 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
character | 2 |
logical | 1 |
numeric | 7 |
POSIXct | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
purchaseordernumber | 27659 | 0.12 | 10 | 13 | 0 | 3806 | 0 |
accountnumber | 0 | 1.00 | 14 | 14 | 0 | 19119 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
onlineorderflag | 0 | 1 | 0.88 | TRU: 27659, FAL: 3806 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
salesorderid | 0 | 1.00 | 59391.00 | 9083.31 | 43659 | 51525 | 59391 | 67257 | 75123 | ▇▇▇▇▇ |
revisionnumber | 0 | 1.00 | 8.00 | 0.03 | 8 | 8 | 8 | 8 | 9 | ▇▁▁▁▁ |
status | 0 | 1.00 | 5.00 | 0.00 | 5 | 5 | 5 | 5 | 5 | ▁▁▇▁▁ |
customerid | 0 | 1.00 | 20170.18 | 6261.73 | 11000 | 14432 | 19452 | 25994 | 30118 | ▇▆▅▅▇ |
salespersonid | 27659 | 0.12 | 280.61 | 4.85 | 274 | 277 | 279 | 284 | 290 | ▇▅▅▂▃ |
territoryid | 0 | 1.00 | 6.09 | 2.96 | 1 | 4 | 6 | 9 | 10 | ▃▅▃▅▇ |
billtoaddressid | 0 | 1.00 | 18263.15 | 8210.07 | 405 | 14080 | 19449 | 24678 | 29883 | ▃▁▇▇▇ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
orderdate | 0 | 1 | 2011-05-31 | 2014-06-30 | 2013-11-03 | 1124 |
duedate | 0 | 1 | 2011-06-12 | 2014-07-12 | 2013-11-15 | 1124 |
shipdate | 0 | 1 | 2011-06-07 | 2014-07-07 | 2013-11-10 | 1124 |
7.6 Disconnect from the database and stop Docker
References
Baumer, Benjamin S. 2018. “A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data.” Journal of Computational and Graphical Statistics, August. Informa UK Limited, 1–9. https://doi.org/10.1080/10618600.2018.1512867.
Ruiz, Edgar. 2019. “Big Data with R: RStudio 2019 Workshop.” RStudio. January 16, 2019. https://github.com/rstudio/bigdataclass.
Wickham, Hadley. 2018. “A Grammar of Data Manipulation * Dplyr.” . https://dplyr.tidyverse.org/.