Exploring Enterprise Databases with R: A Tidyverse Approach
Chapter 1 Introduction
This chapter introduces:
- The motivation for this book and the strategies we have adopted
- Our approach to exploring issues “behind the enterprise firewall” using Docker to demonstrate access to a service like PostgreSQL from R
- Our team and how this project came about
1.1 Using R to query a DBMS in your organization
Many R users (or useRs) live a dual life: in the vibrant open-source R community where R is created, improved, discussed, and taught. And then they go to work in a secured, complex, closed organizational environment where they may be on their own. Here is a request on the Rstudio community site for help that has been lightly edited to emphasize the generality that we see:
I’m trying to migrate some inherited scripts that […] to connect to a […] database to […] instead. I’ve reviewed the https://db.rstudio.com docs and tried a number of configurations but haven’t been able to connect. I’m in uncharted territory within my org, so haven’t been able to get much help internally.
This book will help you create a hybrid environment on your machine that can mimic some of the uncharted territory in your organization. It goes far beyond the basic connection issues and covers issues that you face when you are finding your way around or writing queries to your organization’s databases, not just when maintaining inherited scripts.
Technology hurdles. The interfaces (passwords, packages, etc.) and gaps between R and a back end database are hidden from public view as a matter of security, so pinpointing exactly where a problem is can be difficult. A simulated environment such as we offer here can be an important learning resource.
Scale issues. We see at least two types of scale issues. Handling large volumes of data so that performance issues must be a consideration requires a basic understanding of what’s happening in “the back end” (which is necessarily hidden from view). Therefore mastering techniques for drawing samples or small batches of data are essential. In addition to their size, your organization’s databases will often have structural characteristics that are complex and obscure. Data documentation is often incomplete and emphasizes operational characteristics, rather than analytic opportunities. A careful useR often needs to confirm the documentation on the fly and de-normalize data carefully.
Use cases. R users frequently need to make sense of an organization’s complex data structures and coding schemes to address incompletely formed questions so that informal exploratory data analysis has to be intuitive and fast. The technology details should not get in the way. Sharing and discussing exploratory and diagnostic retrieval techniquesis best in public, but is constrained by organizational requirements.
We have found that PostgreSQL in a Docker container solves many of the foregoing problems.
1.2 Docker as a tool for UseRs
- Make a fixed working environment for reproducible analysis
- Access a service outside of R (e.g., PostgreSQL)
- Create an R based service (e.g., with
- Send our compute jobs to the cloud with minimal reconfiguration or revision
This book explores #2 because it allows us to work on the database access issues described above and to practice on an industrial-scale DBMS.
- Docker is a comparatively easy way to simulate the relationship between an R/RStudio session and a database – all on on your machine (provided you have Docker installed and running).
- Running PostgreSQL on a Docker container avoids OS or system dependencies or conflicts that cause confusion and limit reproducibility.
- A Docker environment consumes relatively few resources. Our sandbox does much less but only includes PostgreSQL and sample data, so it takes up about 5% of the space taken up by the Vagrant environment that inspired this project. (Makubuya 2018)
- A simple Docker container such as the one used in our sandbox is easy to use and could be extended for other uses.
- Docker is a widely used technology for deploying applications in the cloud, so for many useRs it’s worth mastering.
1.3 Alternatives to Docker
We have found Docker to be a great tool for simulating the complexities of an enterprise environment. However, installing Docker can be challenging, especially for Windows users. Therefore the code in this book depends on PostgreSQL(Group 2019) in a Docker container, but it can all be readily adapted to either SQLite(Consortium 2019), PostgreSQL running natively on your computer, or even PostgreSQL running in the cloud. The technical details of these alternatives are all in separate chapters.
1.4 Packages used in this book
The following packages are used in this book:
- sqlpetr (installs with:
remotes::install_github("smithjd/sqlpetr", force = TRUE, quiet = TRUE, build = TRUE, build_opts = ""))
Note that when you install
sqlpetr, it will install all the other packages you need as dependencies.
1.5 Who are we?
We have been collaborating on this book since the Summer of 2018, each of us chipping into the project as time permits:
1.6 How did this project come about?
We trace this book back to the June 2, 2018 Cascadia R Conf where Aaron Makubuya gave a presentation using Vagrant hosting (Makubuya 2018). After that John Smith, Ian Franz, and Sophie Yang had discussions after the monthly Data Discussion Meetups about the difficulties around setting up Vagrant (a virtual environment), connecting to an enterprise database, and having realistic public environment to demo or practice the issues that come up behind corporate firewalls. Scott Came’s tutorial on R and Docker (Came 2018) (an alternative to Vagrant) at the 2018 UseR Conference in Melbourne was provocative and it turned out he lived nearby. We re-connected with M. Edward (Ed) Borasky who had done extensive development for a Hack Oregon data science containerization project (Borasky 2018).
Borasky, M. Edward (Ed). 2018. “Data Science Pet Containers.” 2018. https://github.com/znmeb/data-science-pet-containers.
Came, Scott. 2018. “Running R in Docker - Part 1 - Getting Started.” July 21, 2018. http://www.cascadia-analytics.com/2018/07/21/docker-r-p1.html.
Consortium, The SQLite. 2019. “SQLite.” https://sqlite.com/index.html.
Group, The PostgreSQL Global Development. 2019. https://www.postgresql.org.
Makubuya, Aaron. 2018. “Using R with Databases.” 2018. https://github.com/Cascadia-R/Using_R_With_Databases.
Ross, Noam. 2018a. “Docker for the UseR - Slides.” July 10, 2018. https://github.com/noamross/nyhackr-docker-talk/blob/master/Noam_Ross_DockerForTheUseR_nyhackr_2018-07-10.pdf.
Xie, Yihui. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://github.com/rstudio/bookdown.