Chapter 8 Asking Business Questions From a Single Table

This chapter explores:

  • Issues that come up when investigating a single table from a business perspective
  • Show the multiple data anomalies found in a single AdventureWorks table (salesorderheader)
  • The interplay between “data questions” and “business questions”

The previous chapter has demonstrated some of the automated techniques for showing what’s in a table using some standard R functions and packages. Now we demonstrate a step-by-step process of making sense of what’s in one table with more of a business perspective. We illustrate the kind of detective work that’s often involved as we investigate the organizational meaning of the data in a table. We’ll investigate the salesorderheader table in the sales schema in this example to understand the sales profile of the “AdventureWorks” business. We show that there are quite a few interpretation issues even when we are examining just 3 out of the 25 columns in one table.

For this kind of detective work we are seeking to understand the following elements separately and as they interact with each other:

  • What data is stored in the database
  • How information is represented
  • How the data is entered at a day-to-day level to represent business activities
  • How the business itself is changing over time

8.1 Setup our standard working environment

Use these libraries:

Connect to adventureworks. In an interactive session we prefer to use connections::connection_open instead of dbConnect

Some queries generate big integers, so we need to include RPostgres::Postgres() and bigint = "integer" in the connections statement because some functions in the tidyverse object to the bigint datatype.

8.2 A word on naming

You will find that many tables will have columns with the same name in an enterprise database. For example, in the AdventureWorks database, almost all tables have columns named rowguid and modifieddate and there are many other examples of names that are reused throughout the database. Duplicate columns are best renamed or deliberately dropped. The meaning of a column depends on the table that contains it, so as you pull a column out of a table, when renaming it the collumns provenance should be reflected in the new name.

Naming columns carefully (whether retrieved from the database or calculated) will pay off, especially as our queries become more complex. Using soh as an abbreviation of sales order header to tag columns or statistics that are derived from the salesorderheader table, as we do in this book, is one example of an intentional naming strategy: it reminds us of the original source of the data. You, future you, and your collaborators will appreciate the effort no matter what naming convention you adopt. And a naming convention when rigidly applied can yield some long and ugly names.

In the following example soh appears in different positions in the column name but it is easy to guess at a glance that the data comes from the salesorderheader table.

Naming derived tables is just as important as naming derived columns.

8.3 The overall AdventureWorks sales picture

We begin by looking at Sales on a yearly basis, then consider monthly sales. We discover that half way through the period represented in the database, the business appears to begin selling online, which has very different characteristics than sales by Sales Reps. We then look at the details of how Sales Rep sales are recorded in the system and discover a data anomaly that we can correct.

8.4 Annual sales

On an annual basis, are sales dollars trending up, down or flat? We begin with annual revenue and number of orders.

Note that all of this query is running on the server since the collect() statement is at the very end.

## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  6 variables:
##  $ year                 : chr  "2011" "2012" "2013" "2014"
##  $ min_soh_orderdate    : POSIXct, format: "2011-05-31" "2012-01-01" ...
##  $ max_soh_orderdate    : POSIXct, format: "2011-12-31" "2012-12-31" ...
##  $ total_soh_dollars    : num  12641672 33524301 43622479 20057929
##  $ avg_total_soh_dollars: num  7867 8563 3076 1705
##  $ soh_count            : int  1607 3915 14182 11761

We hang on to some date information for later use in plot titles.

8.4.1 Annual summary of sales, number of transactions and average sale

Both 2011 and 2014 turn out to be are shorter time spans than the other two years, making comparison interpretation difficult. Still, it’s clear that 2013 was the best year for annual sales dollars.

Comparing the number of orders per year has roughly the same overall pattern (2013 ranks highest, etc.) but the proportions between the years are quite different.

Although 2013 was the best year in terms of total number of orders, there were many more in 2014 compared with 2012. That suggests looking at the average dollars per sale for each year.

8.4.2 Average dollars per sale

AdventureWorks sales performance

Figure 8.1: AdventureWorks sales performance

That’s a big drop between average sale of more than $7,000 in the first two years down to the $3,000 range in the last two. There has been a remarkable change in this business. At the same time the total number of orders shot up from less than 4,000 a year to more than 14,000. Why are the number of orders increasing, but the average dollar amount of a sale is dropping?

Perhaps monthly monthly sales has the answer. We adapt the first query to group by month and year.

8.5 Monthly Sales

Our next iteration drills down from annual sales dollars to monthly sales dollars. For that we download the orderdate as a date, rather than a character variable for the year. R handles the conversion from the PostgreSQL date-time to an R date-time. We then convert it to a simple date with a lubridate function.

The following query uses the postgreSQL function date_trunc, which is equivalent to lubridate’s round_date function in R. If you want to push as much of the processing as possible onto the database server and thus possibly deal with smaller datasets in R, interleaving postgreSQL functions into your dplyr code will help.

## <SQL>
## SELECT "orderdate", ROUND((SUM("subtotal")) :: numeric, 2) AS "total_soh_dollars", ROUND((AVG("subtotal")) :: numeric, 2) AS "avg_total_soh_dollars", COUNT(*) AS "soh_count"
## FROM (SELECT date_trunc('month', "orderdate") AS "orderdate", "subtotal"
## FROM sales.salesorderheader) "dbplyr_004"
## GROUP BY "orderdate"

Note that date_trunc('month', orderdate) gets passed through exactly “as is.”

In many cases we don’t really care whether our queries are executed by R or by the SQL server, but if we do care we need to substitute the postgreSQL equivalent for the R functions we might ordinarily use. In those cases we have to check whether functions from R packages like lubridate and the equivalent postgreSQL functions are exactly alike. Often they are subtly different: in the previous query the postgreSQL function produces a POSIXct column, not a Date so we need to tack on a mutate function once the data is on the R side as shown here:

Next let’s plot the monthly sales data:

Total Monthly Sales

Figure 8.2: Total Monthly Sales

That graph doesn’t show how the business might have changed, but it is remarkable how much variation there is from one month to another – particularly in 2012 and 2014.

8.5.1 Check lagged monthly data

Because of the month-over-month sales variation. We’ll use dplyr::lag to help find the delta and later visualize just how much month-to-month difference there is.

## [1] -221690.505
##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
## -5879806.05 -1172995.19  -221690.51    11968.42  1159252.70  5420357.17

The average month over month change in sales looks OK ($ 11,968) although the Median is negative: $ 11,968. There is a very wide spread in our month-over-month sales data between the lower and upper quartile. We can plot the variation as follows:

Monthly Sales Change

Figure 8.3: Monthly Sales Change

It looks like the big change in the business occurred in the summer of 2013 when the number of orders jumped but the dollar volume just continued to bump along.

8.6 The effect of online sales

We suspect that the business has changed a lot with the advent of online orders so we check the impact of onlineorderflag on annual sales. The onlineorderflag indicates which sales channel accounted for the sale, Sales Reps or Online.

8.7 Impact of order type on monthly sales

To dig into the difference between Sales Rep and Online sales we can look at monthly data.

8.7.1 Retrieve monthly sales with the onlineorderflag

This query puts the collect statement earlier than the previous queries.

Sales Channel unique_dates start_date end_date total_sales days_span
Online 38 2011-05-01 2014-06-01 29358677 1127 days
Sales Rep 34 2011-05-01 2014-05-01 80487704 1096 days

As this table shows, the Sales Rep dates don’t match the Online dates. They start on the same date, but have a different end. The Online dates include 2 months that are not included in the Sales Rep sales (which are the main sales channel by dollar volume).

8.7.2 Monthly variation compared to a trend line

Jumping to the trend line comparison, we see that the big the source of variation is on the Sales Rep side.

Monthly Sales Trend

Figure 8.7: Monthly Sales Trend

The monthly gyrations are much larger on the Sales Rep side, amounting to differences in a million dollars compared to small monthly variations of around $25,000 for the Online orders.

8.7.3 Compare monthly lagged data by Sales Channel

First consider month-to-month change.

The following table shows some wild changes in dollar amounts and number of sales from one month to the next.

orderdate Channel $ this Month $ last Month # this Month # last Month $ change # change
2011-06-01 Online 458,911 14,477 141 5 307,000% 272,000%
2013-07-01 Online 847,139 860,141 1564 533 −200% 19,300%
2011-07-01 Sales Rep 1,538,408 489,329 75 38 21,400% 9,700%
2012-01-01 Sales Rep 3,356,069 713,117 143 40 37,100% 25,800%
2012-03-01 Sales Rep 2,269,117 882,900 85 37 15,700% 13,000%
2014-03-01 Sales Rep 5,526,352 3,231 271 3 17,096,000% 893,300%
2014-05-01 Sales Rep 3,415,479 1,285 179 2 26,573,900% 885,000%

We suspect that the business has changed a lot with the advent of Online orders.

8.8 Detect and diagnose the day of the month problem

There have been several indications that Sales Rep sales are recorded once a month while Online sales are recorded on a daily basis.

8.8.1 Sales Rep Orderdate Distribution

Look at the dates when sales are entered for sales by Sales Reps. The following query / plot combination shows this pattern. and the exception for transactions entered on the first day of the month.

## Joining, by = "orderday"
## Warning: Removed 26 rows containing missing values (position_stack).
Days of the month with Sales Rep activity recorded

Figure 8.8: Days of the month with Sales Rep activity recorded

We can check on which months have orders entered on the first of the month.

##   year month day_1 day_28 day_29 day_30 day_31
## 1 2011     7    75     NA     NA     NA     NA
## 2 2011     8    60     NA     NA     NA     40
## 3 2011    10    90     NA     NA     NA     63
## 4 2011    12    40     NA     NA     NA     NA
## 5 2012     1    79     NA     64     NA     NA
## 6 2014     3    91     NA     NA      2    178
## 7 2014     5   179     NA     NA     NA     NA

There are two months with multiple sales rep order days for 2011, (11/08 and 11/10), one for 2012, (1201), and two in 2014, (14/01 and 14/03). The 14/03 is the only three day sales rep order month.

Are there months where there were no sales recorded for the sales reps?

There are two approaches. The first is to generate a list of months between the beginning and end of history and compare that to the Sales Rep records

## Classes 'tbl_df', 'tbl' and 'data.frame':    34 obs. of  2 variables:
##  $ orderdate: Date, format: "2011-05-01" "2011-07-01" ...
##  $ n        : int  1 1 1 1 1 1 1 1 1 1 ...
## # A tibble: 4 x 2
##   month_date date_exists
##   <date>     <lgl>      
## 1 2011-06-01 FALSE      
## 2 2011-09-01 FALSE      
## 3 2011-11-01 FALSE      
## 4 2014-06-01 FALSE
  • June, September, and November are missing for 2011.
  • June for 2014

The second approach is to use the dates found in the database for online orders. Defining “complete” may not always be as simple as generating a complete list of months.

## Joining, by = "orderdate"
## <SQL>
## SELECT COALESCE("LHS"."orderdate", "RHS"."orderdate") AS "orderdate", "LHS"."sales_rep_count" AS "sales_rep_count", "RHS"."online_count" AS "online_count"
## FROM (SELECT "orderdate", COUNT(*) AS "sales_rep_count"
## FROM (SELECT "salesorderid", "revisionnumber", date_trunc('month', "orderdate") AS "orderdate", "duedate", "shipdate", "status", "onlineorderflag", "purchaseordernumber", "accountnumber", "customerid", "salespersonid", "territoryid", "billtoaddressid", "shiptoaddressid", "shipmethodid", "creditcardid", "creditcardapprovalcode", "currencyrateid", "subtotal", "taxamt", "freight", "totaldue", "comment", "rowguid", "modifieddate"
## FROM (SELECT *
## FROM sales.salesorderheader
## WHERE ("onlineorderflag" = FALSE)) "dbplyr_010") "dbplyr_011"
## GROUP BY "orderdate") "LHS"
## FULL JOIN (SELECT "orderdate", COUNT(*) AS "online_count"
## FROM (SELECT "salesorderid", "revisionnumber", date_trunc('month', "orderdate") AS "orderdate", "duedate", "shipdate", "status", "onlineorderflag", "purchaseordernumber", "accountnumber", "customerid", "salespersonid", "territoryid", "billtoaddressid", "shiptoaddressid", "shipmethodid", "creditcardid", "creditcardapprovalcode", "currencyrateid", "subtotal", "taxamt", "freight", "totaldue", "comment", "rowguid", "modifieddate"
## FROM (SELECT *
## FROM sales.salesorderheader
## WHERE ("onlineorderflag" = TRUE)) "dbplyr_012") "dbplyr_013"
## GROUP BY "orderdate") "RHS"
## ON ("LHS"."orderdate" = "RHS"."orderdate")
## Joining, by = "orderdate"
## # A tibble: 4 x 2
##   orderdate           online_count
##   <dttm>                     <int>
## 1 2011-06-01 00:00:00          141
## 2 2011-09-01 00:00:00          157
## 3 2011-11-01 00:00:00          230
## 4 2014-06-01 00:00:00          939
## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  2 variables:
##  $ orderdate   : POSIXct, format: "2011-06-01" "2011-09-01" ...
##  $ online_count: int  141 157 230 939

And in this case they agree!

discuss February issues. and stuff.

look at each year sepraately as a diagnostic

Use the same pivot strategy on the corrected data.

difference between detective work with a graph and just print it out. “now I see what’s driving the hint.”

We have xx months when we add the month before and the month after the suspicious months. We don’t know whether the problem postings have been carried forward or backward. We check for and eliminate duplicates as well.

  • Most of the Sales Reps’ orders are entered on a single day of the month, unique days = 1. It is possible that these are monthly recurring orders that get released on a given day of the month. If that is the case, what are the Sales Reps doing the rest of the month?
  • ** ?? The lines with multiple days, unique_days > 1, have a noticeable higher number of orders, so_cnt, and associated so dollars.?? **

8.9 Correcting the order date for Sales Reps

8.9.1 Define a date correction function in R

This code does the date-correction work on the R side:

Inspect:

## Classes 'tbl_df', 'tbl' and 'data.frame':    36 obs. of  3 variables:
##  $ year_month       : Date, format: "2011-05-01" "2011-06-01" ...
##  $ total_soh_dollars: num  489329 1538408 1165897 844721 2324136 ...
##  $ soh_count        : int  38 75 60 40 90 63 40 79 64 37 ...
## # A tibble: 12 x 3
##    year_month total_soh_dollars soh_count
##    <date>                 <dbl>     <int>
##  1 2011-05-01           489329.        38
##  2 2011-06-01          1538408.        75
##  3 2011-07-01          1165897.        60
##  4 2011-08-01           844721         40
##  5 2011-09-01          2324136.        90
##  6 2011-10-01          1702945.        63
##  7 2011-11-01           713117.        40
##  8 2011-12-01          1900789.        79
##  9 2014-01-01          2738752.       175
## 10 2014-02-01          2207772.        94
## 11 2014-03-01          3321810.       180
## 12 2014-04-01          3416764.       181

8.9.3 Use the PostgreSQL function

If you can do the heavy lifting on the database side, that’s good. R can do it, but it’s best for finding the issues.

## # A tibble: 14 x 4
##    adjusted_orderdate total_soh_dollars soh_count year_month
##    <date>                         <dbl>     <int> <date>    
##  1 2011-05-31                   489329.        38 2011-05-01
##  2 2011-06-30                  1538408.        75 2011-06-01
##  3 2011-07-31                  1165897.        60 2011-07-01
##  4 2011-08-31                   844721         40 2011-08-01
##  5 2011-09-30                  2324136.        90 2011-09-01
##  6 2011-10-31                  1702945.        63 2011-10-01
##  7 2011-11-30                   713117.        40 2011-11-01
##  8 2011-12-31                  1900789.        79 2011-12-01
##  9 2014-01-28                     1565.         2 2014-01-01
## 10 2014-01-29                  2737188.       173 2014-01-01
## 11 2014-02-28                  2207772.        94 2014-02-01
## 12 2014-03-30                     7291.         2 2014-03-01
## 13 2014-03-31                  3314519.       178 2014-03-01
## 14 2014-04-30                  3416764.       181 2014-04-01

There’s one minor difference between the two:

## [1] "Cols in y but not x: `adjusted_orderdate`. "

8.9.4 Monthly Sales by Order Type with corrected dates – relative to a trend line

## # A tibble: 10 x 3
##    orderdate  total_soh_dollars soh_count
##    <date>                 <dbl>     <int>
##  1 2011-05-01           489329.        38
##  2 2011-07-01          1538408.        75
##  3 2011-08-01          2010618.       100
##  4 2011-10-01          4027080.       153
##  5 2011-12-01           713117.        40
##  6 2014-01-01          2738752.       175
##  7 2014-02-01             3231.         3
##  8 2014-03-01          5526352.       271
##  9 2014-04-01             1285.         2
## 10 2014-05-01          3415479.       179
Comparing monthly_sales_rep_adjusted and monthly_sales_rep_as_is

Figure 8.9: Comparing monthly_sales_rep_adjusted and monthly_sales_rep_as_is

Sales still seem to gyrate! We have found that sales rep sales data is often very strange.

8.10 Disconnect from the database and stop Docker

## Warning in connection_release(conn@ptr): Already disconnected