Chapter 1 Introduction

This chapter introduces:

The motivation for this book and the strategies we have adopted

Our approach to exploring issues “behind the enterprise firewall” using Docker to demonstrate access to a service like PostgreSQL from R

Our team and how this project came about

1.1 Using R to query a DBMS in your organization

Many R users (or useRs) live a dual life: in the vibrant open-source R community where R is created, improved, discussed, and taught. And then they go to work in a secured, complex, closed organizational environment where they may be on their own. Here is a request on the Rstudio community site for help that has been lightly edited to emphasize the generality that we see:

I’m trying to migrate some inherited scripts that […] to connect to a […] database to […] instead. I’ve reviewed the https://db.rstudio.com docs and tried a number of configurations but haven’t been able to connect. I’m in uncharted territory within my org, so haven’t been able to get much help internally.

This book will help you create a hybrid environment on your machine that can mimic some of the uncharted territory in your organization. It goes far beyond the basic connection issues and covers issues that you face when you are finding your way around or writing queries to your organization’s databases, not just when maintaining inherited scripts.

Technology hurdles. The interfaces (passwords, packages, etc.) and gaps between R and a back end database are hidden from public view as a matter of security, so pinpointing exactly where a problem is can be difficult. A simulated environment such as we offer here can be an important learning resource.
Scale issues. We see at least two types of scale issues. Handling large volumes of data so that performance issues must be a consideration requires a basic understanding of what’s happening in “the back end” (which is necessarily hidden from view). Therefore mastering techniques for drawing samples or small batches of data are essential. In addition to their size, your organization’s databases will often have structural characteristics that are complex and obscure. Data documentation is often incomplete and emphasizes operational characteristics, rather than analytic opportunities. A careful useR often needs to confirm the documentation on the fly and de-normalize data carefully.
Use cases. R users frequently need to make sense of an organization’s complex data structures and coding schemes to address incompletely formed questions so that informal exploratory data analysis has to be intuitive and fast. The technology details should not get in the way. Sharing and discussing exploratory and diagnostic retrieval techniquesis best in public, but is constrained by organizational requirements.

We have found that PostgreSQL in a Docker container solves many of the foregoing problems.

1.2 Docker as a tool for UseRs

Noam Ross’s “Docker for the UseR” (Ross 2018 a) suggests that there are four distinct Docker use-cases for useRs.

Make a fixed working environment for reproducible analysis
Access a service outside of R (e.g., PostgreSQL)
Create an R based service (e.g., with plumber)
Send our compute jobs to the cloud with minimal reconfiguration or revision

This book explores #2 because it allows us to work on the database access issues described above and to practice on an industrial-scale DBMS.

Docker is a comparatively easy way to simulate the relationship between an R/RStudio session and a database – all on on your machine (provided you have Docker installed and running).
Running PostgreSQL on a Docker container avoids OS or system dependencies or conflicts that cause confusion and limit reproducibility.
A Docker environment consumes relatively few resources. Our sandbox does much less but only includes PostgreSQL and sample data, so it takes up about 5% of the space taken up by the Vagrant environment that inspired this project. (Makubuya 2018)
A simple Docker container such as the one used in our sandbox is easy to use and could be extended for other uses.
Docker is a widely used technology for deploying applications in the cloud, so for many useRs it’s worth mastering.

1.3 Alternatives to Docker

We have found Docker to be a great tool for simulating the complexities of an enterprise environment. However, installing Docker can be challenging, especially for Windows users. Therefore the code in this book depends on PostgreSQL(Group 2019) in a Docker container, but it can all be readily adapted to either SQLite(Consortium 2019), PostgreSQL running natively on your computer, or even PostgreSQL running in the cloud. The technical details of these alternatives are all in separate chapters.

1.4 Packages used in this book

The following packages are used in this book:

bookdown
DBI
dbplyr
devtools
DiagrammeR
downloader
glue
here
knitr
RPostgres
skimr
sqlpetr (installs with: remotes::install_github("smithjd/sqlpetr", force = TRUE, quiet = TRUE, build = TRUE, build_opts = ""))
tidyverse

Note that when you install sqlpetr, it will install all the other packages you need as dependencies.

1.5 Who are we?

We have been collaborating on this book since the Summer of 2018, each of us chipping into the project as time permits:

Ian Franz - @ianfrantz
Jim Tyhurst - @jimtyhurst
John David Smith - @smithjd
M. Edward (Ed) Borasky - @znmeb
Maryanne Thygesen @maryannet
Scott Came - @scottcame
Sophie Yang - @SophieMYang

1.6 How did this project come about?

We trace this book back to the June 2, 2018 Cascadia R Conf where Aaron Makubuya gave a presentation using Vagrant hosting (Makubuya 2018). After that John Smith, Ian Franz, and Sophie Yang had discussions after the monthly Data Discussion Meetups about the difficulties around setting up Vagrant (a virtual environment), connecting to an enterprise database, and having realistic public environment to demo or practice the issues that come up behind corporate firewalls. Scott Came’s tutorial on R and Docker (Came 2018) (an alternative to Vagrant) at the 2018 UseR Conference in Melbourne was provocative and it turned out he lived nearby. We re-connected with M. Edward (Ed) Borasky who had done extensive development for a Hack Oregon data science containerization project (Borasky 2018).

1.7 Navigation

If this is the first bookdown (Xie 2016) book you’ve read, here’s how to navigate the website.

The controls on the upper left: there are four controls on the upper left.
- A “hamburger” menu: this toggles the table of contents on the left side of the page on or off.
- A magnifying glass: this toggles a search box on or off.
- A letter “A”: this lets you pick how you want the site to display. You have your choice of small or large text, a serif or sans-serif font, and a white, sepia or night theme.
- A pencil: this is the “Edit” button. This will take you to a GitHub edit dialog for the chapter you’re reading. If you’re a committer to the repository, you’ll be able to edit the source directly.
  
  If not, GitHub will fork a copy of the repository to your own account and you’ll be able to edit that version. Then you can make a pull request.
The share buttons in the upper right hand corner. There’s one for Twitter, one for Facebook, and one that gives a menu of options, including LinkedIn.

References

Borasky, M. Edward (Ed). 2018. “Data Science Pet Containers.” 2018. https://github.com/znmeb/data-science-pet-containers.

Came, Scott. 2018. “Running R in Docker - Part 1 - Getting Started.” July 21, 2018. http://www.cascadia-analytics.com/2018/07/21/docker-r-p1.html.

Consortium, The SQLite. 2019. “SQLite.” https://sqlite.com/index.html.

Group, The PostgreSQL Global Development. 2019. https://www.postgresql.org.

Makubuya, Aaron. 2018. “Using R with Databases.” 2018. https://github.com/Cascadia-R/Using_R_With_Databases.

Ross, Noam. 2018a. “Docker for the UseR - Slides.” July 10, 2018. https://github.com/noamross/nyhackr-docker-talk/blob/master/Noam_Ross_DockerForTheUseR_nyhackr_2018-07-10.pdf.

Xie, Yihui. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://github.com/rstudio/bookdown.

Exploring Enterprise Databases with R: A Tidyverse Approach