Chapter 12 Getting metadata about and from PostgreSQL

This chapter demonstrates:

  • What kind of data about the database is contained in a dbms
  • Several methods for obtaining metadata from the dbms

The following packages are used in this chapter:

Assume that the Docker container with PostgreSQL and the dvdrental database are ready to go.

Connect to the database:

12.1 Views trick parked here for the time being

12.1.1 Explore the vsalelsperson and vsalespersonsalesbyfiscalyearsdata views

The following trick goes later in the book, where it’s used to prove the finding that to make sense of othe data you need to

##  SELECT s.businessentityid,
##     p.title,
##     p.firstname,
##     p.middlename,
##     p.lastname,
##     p.suffix,
##     e.jobtitle,
##     pp.phonenumber,
##     pnt.name AS phonenumbertype,
##     ea.emailaddress,
##     p.emailpromotion,
##     a.addressline1,
##     a.addressline2,
##     a.city,
##     sp.name AS stateprovincename,
##     a.postalcode,
##     cr.name AS countryregionname,
##     st.name AS territoryname,
##     st."group" AS territorygroup,
##     s.salesquota,
##     s.salesytd,
##     s.saleslastyear
##    FROM sales.salesperson s
##      JOIN humanresources.employee e ON e.businessentityid = s.businessentityid
##      JOIN person.person p ON p.businessentityid = s.businessentityid
##      JOIN person.businessentityaddress bea ON bea.businessentityid = s.businessentityid
##      JOIN person.address a ON a.addressid = bea.addressid
##      JOIN person.stateprovince sp ON sp.stateprovinceid = a.stateprovinceid
##      JOIN person.countryregion cr ON cr.countryregioncode::text = sp.countryregioncode::text
##      LEFT JOIN sales.salesterritory st ON st.territoryid = s.territoryid
##      LEFT JOIN person.emailaddress ea ON ea.businessentityid = p.businessentityid
##      LEFT JOIN person.personphone pp ON pp.businessentityid = p.businessentityid
##      LEFT JOIN person.phonenumbertype pnt ON pnt.phonenumbertypeid = pp.phonenumbertypeid;
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           pg_get_viewdef
## 1  SELECT granular.salespersonid,\n    granular.fullname,\n    granular.jobtitle,\n    granular.salesterritory,\n    sum(granular.subtotal) AS salestotal,\n    granular.fiscalyear\n   FROM ( SELECT soh.salespersonid,\n            ((p.firstname::text || ' '::text) || COALESCE(p.middlename::text || ' '::text, ''::text)) || p.lastname::text AS fullname,\n            e.jobtitle,\n            st.name AS salesterritory,\n            soh.subtotal,\n            date_part('year'::text, soh.orderdate + '6 mons'::interval) AS fiscalyear\n           FROM sales.salesperson sp\n             JOIN sales.salesorderheader soh ON sp.businessentityid = soh.salespersonid\n             JOIN sales.salesterritory st ON sp.territoryid = st.territoryid\n             JOIN humanresources.employee e ON soh.salespersonid = e.businessentityid\n             JOIN person.person p ON p.businessentityid = sp.businessentityid) granular\n  GROUP BY granular.salespersonid, granular.fullname, granular.jobtitle, granular.salesterritory, granular.fiscalyear;

12.2 Database contents and structure

After just looking at the data you seek, it might be worthwhile stepping back and looking at the big picture.

12.2.1 Database structure

For large or complex databases you need to use both the available documentation for your database (e.g., the dvdrental database) and the other empirical tools that are available. For example it’s worth learning to interpret the symbols in an Entity Relationship Diagram:

The information_schema is a trove of information about the database. Its format is more or less consistent across the different SQL implementations that are available. Here we explore some of what’s available using several different methods. PostgreSQL stores a lot of metadata.

12.2.2 Contents of the information_schema

For this chapter R needs the dbplyr package to access alternate schemas. A schema is an object that contains one or more tables. Most often there will be a default schema, but to access the metadata, you need to explicitly specify which schema contains the data you want.

12.2.3 What tables are in the database?

The simplest way to get a list of tables is with … NO LONGER WORKS:

### Digging into the information_schema

We usually need more detail than just a list of tables. Most SQL databases have an information_schema that has a standard structure to describe and control the database.

The information_schema is in a different schema from the default, so to connect to the tables table in the information_schema we connect to the database in a different way:

The information_schema is large and complex and contains 343 tables. So it’s easy to get lost in it.

This query retrieves a list of the tables in the database that includes additional detail, not just the name of the table.

In this context table_catalog is synonymous with database.

Notice that VIEWS are composites made up of one or more BASE TABLES.

The SQL world has its own terminology. For example rs is shorthand for result set. That’s equivalent to using df for a data frame. The following SQL query returns the same information as the previous dplyr code.

12.3 What columns do those tables contain?

Of course, the DBI package has a dbListFields function that provides the simplest way to get the minimum, a list of column names:

But the information_schema has a lot more useful information that we can use.

Since the information_schema contains 2961 columns, we are narrowing our focus to just one table. This query retrieves more information about the rental table:

## Observations: 2,961
## Variables: 7
## $ table_catalog            <chr> "adventureworks", "adventureworks", "adventu…
## $ table_name               <chr> "pg_proc", "pg_proc", "pg_proc", "pg_proc", …
## $ column_name              <chr> "proname", "pronamespace", "proowner", "prol…
## $ data_type                <chr> "name", "oid", "oid", "oid", "real (24,2)", …
## $ ordinal_position         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ character_maximum_length <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ column_default           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

12.3.1 What is the difference between a VIEW and a BASE TABLE?

The BASE TABLE has the underlying data in the database

Probably should explore how the VIEW is made up of data from BASE TABLEs.

12.3.2 What data types are found in the database?

12.4 Characterizing how things are named

Names are the handle for accessing the data. Tables and columns may or may not be named consistently or in a way that makes sense to you. You should look at these names as data.

12.4.1 Counting columns and name reuse

Pull out some rough-and-ready but useful statistics about your database. Since we are in SQL-land we talk about variables as columns.

this is wrong!

How many column names are shared across tables (or duplicated)?

## # A tibble: 6 x 2
##   column_name          n
##   <chr>            <int>
## 1 modifieddate       140
## 2 rowguid             61
## 3 id                  60
## 4 name                59
## 5 businessentityid    49
## 6 productid           32

How many column names are unique?

## # A tibble: 1 x 1
##       n
##   <int>
## 1   882

12.5 Database keys

12.5.1 Direct SQL

How do we use this output? Could it be generated by dplyr?

## Observations: 467,838
## Variables: 3
## $ table_name <chr> "adventureworks.hr.d", "adventureworks.hr.d", "adventurewo…
## $ conname    <chr> "FK_SalesOrderDetail_SpecialOfferProduct_SpecialOfferIDPro…
## $ condef     <chr> "FOREIGN KEY (specialofferid, productid) REFERENCES sales.…

The following is more compact and looks more useful. What is the difference between the two?

## Observations: 0
## Variables: 3
## $ table_from           <chr> 
## $ conname              <chr> 
## $ pg_get_constraintdef <chr>
## [1] 0

12.5.2 Database keys with dplyr

This query shows the primary and foreign keys in the database.

## Observations: 190
## Variables: 6
## $ table_name       <chr> "address", "address", "addresstype", "billofmaterial…
## $ table_type       <chr> "BASE TABLE", "BASE TABLE", "BASE TABLE", "BASE TABL…
## $ constraint_name  <chr> "FK_Address_StateProvince_StateProvinceID", "PK_Addr…
## $ constraint_type  <chr> "FOREIGN KEY", "PRIMARY KEY", "PRIMARY KEY", "FOREIG…
## $ column_name      <chr> "stateprovinceid", "addressid", "addresstypeid", "co…
## $ ordinal_position <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1, 1, 1, 2…

What do we learn from the following query? How is it useful?

##                        conname connamespace contype condeferrable condeferred
## 1 cardinal_number_domain_check        12771       c         FALSE       FALSE
## 2              yes_or_no_check        12771       c         FALSE       FALSE
## 3        CK_Employee_BirthDate        16386       c         FALSE       FALSE
## 4           CK_Employee_Gender        16386       c         FALSE       FALSE
## 5         CK_Employee_HireDate        16386       c         FALSE       FALSE
## 6    CK_Employee_MaritalStatus        16386       c         FALSE       FALSE
##   convalidated conrelid contypid conindid conparentid confrelid confupdtype
## 1         TRUE        0    12785        0           0         0            
## 2         TRUE        0    12797        0           0         0            
## 3         TRUE    16450        0        0           0         0            
## 4         TRUE    16450        0        0           0         0            
## 5         TRUE    16450        0        0           0         0            
## 6         TRUE    16450        0        0           0         0            
##   confdeltype confmatchtype conislocal coninhcount connoinherit conkey confkey
## 1                                 TRUE           0        FALSE   <NA>    <NA>
## 2                                 TRUE           0        FALSE   <NA>    <NA>
## 3                                 TRUE           0        FALSE    {5}    <NA>
## 4                                 TRUE           0        FALSE    {7}    <NA>
## 5                                 TRUE           0        FALSE    {8}    <NA>
## 6                                 TRUE           0        FALSE    {6}    <NA>
##   conpfeqop conppeqop conffeqop conexclop
## 1      <NA>      <NA>      <NA>      <NA>
## 2      <NA>      <NA>      <NA>      <NA>
## 3      <NA>      <NA>      <NA>      <NA>
## 4      <NA>      <NA>      <NA>      <NA>
## 5      <NA>      <NA>      <NA>      <NA>
## 6      <NA>      <NA>      <NA>      <NA>
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    conbin
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         {OPEXPR :opno 525 :opfuncid 150 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({COERCETODOMAINVALUE :typeId 23 :typeMod -1 :collation 0 :location 195} {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location 204 :constvalue 4 [ 0 0 0 0 0 0 0 0 ]}) :location 201}
## 2                                                                                                                                                                                 {SCALARARRAYOPEXPR :opno 98 :opfuncid 67 :useOr true :inputcollid 100 :args ({RELABELTYPE :arg {COERCETODOMAINVALUE :typeId 1043 :typeMod 7 :collation 100 :location 121} :resulttype 25 :resulttypmod -1 :resultcollid 100 :relabelformat 2 :location -1} {ARRAYCOERCEEXPR :arg {ARRAY :array_typeid 1015 :array_collid 100 :element_typeid 1043 :elements ({CONST :consttype 1043 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 131 :constvalue 7 [ 28 0 0 0 89 69 83 ]} {CONST :consttype 1043 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 138 :constvalue 6 [ 24 0 0 0 78 79 ]}) :multidims false :location -1} :elemexpr {RELABELTYPE :arg {CASETESTEXPR :typeId 1043 :typeMod -1 :collation 0} :resulttype 25 :resulttypmod -1 :resultcollid 100 :relabelformat 2 :location -1} :resulttype 1009 :resulttypmod -1 :resultcollid 100 :coerceformat 2 :location -1}) :location 127}
## 3     {BOOLEXPR :boolop and :args ({OPEXPR :opno 1098 :opfuncid 1090 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 5 :vartype 1082 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 5 :location 804} {CONST :consttype 1082 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location 817 :constvalue 4 [ 33 -100 -1 -1 -1 -1 -1 -1 ]}) :location 814} {OPEXPR :opno 2359 :opfuncid 2352 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 5 :vartype 1082 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 5 :location 842} {OPEXPR :opno 1329 :opfuncid 1190 :opresulttype 1184 :opretset false :opcollid 0 :inputcollid 0 :args ({FUNCEXPR :funcid 1299 :funcresulttype 1184 :funcretset false :funcvariadic false :funcformat 0 :funccollid 0 :inputcollid 0 :args <> :location 856} {CONST :consttype 1186 :consttypmod -1 :constcollid 0 :constlen 16 :constbyval false :constisnull false :location 864 :constvalue 16 [ 0 0 0 0 0 0 0 0 0 0 0 0 -40 0 0 0 ]}) :location 862}) :location 852}) :location 837}
## 4                                                                                                                                                                                                                  {SCALARARRAYOPEXPR :opno 98 :opfuncid 67 :useOr true :inputcollid 100 :args ({FUNCEXPR :funcid 871 :funcresulttype 25 :funcretset false :funcvariadic false :funcformat 0 :funccollid 100 :inputcollid 100 :args ({FUNCEXPR :funcid 401 :funcresulttype 25 :funcretset false :funcvariadic false :funcformat 1 :funccollid 100 :inputcollid 100 :args ({VAR :varno 1 :varattno 7 :vartype 1042 :vartypmod 5 :varcollid 100 :varlevelsup 0 :varnoold 1 :varoattno 7 :location 941}) :location 948}) :location 934} {ARRAY :array_typeid 1009 :array_collid 100 :element_typeid 25 :elements ({CONST :consttype 25 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 969 :constvalue 5 [ 20 0 0 0 77 ]} {CONST :consttype 25 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 980 :constvalue 5 [ 20 0 0 0 70 ]}) :multidims false :location 963}) :location 956}
## 5 {BOOLEXPR :boolop and :args ({OPEXPR :opno 1098 :opfuncid 1090 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 8 :vartype 1082 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 8 :location 1042} {CONST :consttype 1082 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location 1054 :constvalue 4 [ 1 -5 -1 -1 -1 -1 -1 -1 ]}) :location 1051} {OPEXPR :opno 2359 :opfuncid 2352 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 8 :vartype 1082 :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 8 :location 1079} {OPEXPR :opno 1327 :opfuncid 1189 :opresulttype 1184 :opretset false :opcollid 0 :inputcollid 0 :args ({FUNCEXPR :funcid 1299 :funcresulttype 1184 :funcretset false :funcvariadic false :funcformat 0 :funccollid 0 :inputcollid 0 :args <> :location 1092} {CONST :consttype 1186 :consttypmod -1 :constcollid 0 :constlen 16 :constbyval false :constisnull false :location 1100 :constvalue 16 [ 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ]}) :location 1098}) :location 1088}) :location 1074}
## 6                                                                                                                                                                                                           {SCALARARRAYOPEXPR :opno 98 :opfuncid 67 :useOr true :inputcollid 100 :args ({FUNCEXPR :funcid 871 :funcresulttype 25 :funcretset false :funcvariadic false :funcformat 0 :funccollid 100 :inputcollid 100 :args ({FUNCEXPR :funcid 401 :funcresulttype 25 :funcretset false :funcvariadic false :funcformat 1 :funccollid 100 :inputcollid 100 :args ({VAR :varno 1 :varattno 6 :vartype 1042 :vartypmod 5 :varcollid 100 :varlevelsup 0 :varnoold 1 :varoattno 6 :location 1181}) :location 1195}) :location 1174} {ARRAY :array_typeid 1009 :array_collid 100 :element_typeid 25 :elements ({CONST :consttype 25 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 1216 :constvalue 5 [ 20 0 0 0 77 ]} {CONST :consttype 25 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 1227 :constvalue 5 [ 20 0 0 0 83 ]}) :multidims false :location 1210}) :location 1203}
##                                                                                       consrc
## 1                                                                               (VALUE >= 0)
## 2 ((VALUE)::text = ANY ((ARRAY['YES'::character varying, 'NO'::character varying])::text[]))
## 3      ((birthdate >= '1930-01-01'::date) AND (birthdate <= (now() - '18 years'::interval)))
## 4                                (upper((gender)::text) = ANY (ARRAY['M'::text, 'F'::text]))
## 5           ((hiredate >= '1996-07-01'::date) AND (hiredate <= (now() + '1 day'::interval)))
## 6                         (upper((maritalstatus)::text) = ANY (ARRAY['M'::text, 'S'::text]))
##                                                                                         condef
## 1                                                                           CHECK (VALUE >= 0)
## 2 CHECK (VALUE::text = ANY (ARRAY['YES'::character varying, 'NO'::character varying]::text[]))
## 3      CHECK (birthdate >= '1930-01-01'::date AND birthdate <= (now() - '18 years'::interval))
## 4                              CHECK (upper(gender::text) = ANY (ARRAY['M'::text, 'F'::text]))
## 5           CHECK (hiredate >= '1996-07-01'::date AND hiredate <= (now() + '1 day'::interval))
## 6                       CHECK (upper(maritalstatus::text) = ANY (ARRAY['M'::text, 'S'::text]))

12.6 Creating your own data dictionary

If you are going to work with a database for an extended period it can be useful to create your own data dictionary. This can take the form of keeping detaild notes as well as extracting metadata from the dbms. Here is an illustration of the idea.

This probably doens’t work anymore

12.7 Save your work!

The work you do to understand the structure and contents of a database can be useful for others (including future-you). So at the end of a session, you might look at all the data frames you want to save. Consider saving them in a form where you can add notes at the appropriate level (as in a Google Doc representing table or columns that you annotate over time).

##  [1] "columns_info_schema_info"  "columns_info_schema_table"
##  [3] "con"                       "constraint_column_usage"  
##  [5] "cranex"                    "key_column_usage"         
##  [7] "keys"                      "public_tables"            
##  [9] "referential_constraints"   "rs"                       
## [11] "schema_list"               "table_constraints"        
## [13] "table_info"                "table_info_schema_table"  
## [15] "tables"

``` ## Cleaning up

Always have R disconnect from the database when you’re done and stop the Adventureworks Container