1 Introduction

The goal of this manual is to provide the basic foundations needed to analyze and visualize Open Contracting data using the R programming language to people that might be interested in pursuing this challenging but worthy endeavor. A basic understanding of computer programming will come in handy when reading this document; however, enough guidance is provided to ensure that less-tech-savy readers can also follow along.

As a part of this guide, we acquire, clean, analyze and plot Open Contracting data from 4 members of the partnership: Paraguay, Mexico, Uruguay and Colombia. This document was written using R Notebooks, a nifty feature which allows convenient integration of Markdown and R code, and exporting to HTML and PDF. The source code of this project is available here.

The rest of this document is organized as follows:

The remainder of this section introduces the Open Contracting Data Standard, briefly describing its main building blocks, and the R programming language, covering its core features and how to install it.
Section 2 deals with data acquisition and cleaning, focusing on reading and processing JSON files with R.
Section 3 presents the Tidyverse, a collection of R packages designed to perform common data science tasks.
Finally, section 4 introduces ggplot2 and the grammar of graphics it is based on, to draw several plots that describe our previously acquired datasets.

1.1 An introduction to the Open Contracting Data Standard

By requiring data-sharing in a structured, re-usable and machine readable form, Open data opens up new opportunities for analysis and citizen engagement and participation. The Open Contracting Data Standard was created to apply these principles to the complete contracting lifecycle including planning, tender, award, contract and implementation.

The data standard, designed and developed through an open process, allows governments and cities around the world to share their contracting data, enabling greater transparency in public contracting, and supporting accessible and in-depth analysis of the efficiency, effectiveness, fairness, and integrity of public contracting systems. Additionally, the help desk team, staffed by Open Data Services Co-operative, is available to assist prospective users in their journey towards adoption of the standard.

The intention of this section is to introduce the reader to the standard, the use cases it was designed for and the basic concepts needed to apply it. Most of the content was taken from the official documentation of the standard; for a more thorough introductory walktrough, please refer to the getting started guide.

1.1.1 Users and use cases

The standard was designed with four main groups of user needs:

Achieving value for money for government
Strengthening the transparency, accountability and integrity of public contracting
Enabling the private sector to fairly compete for public contracts
Monitoring the effectiveness of service delivery

To find out about who is using OCDS-compliant data around the globe and how they are doing it, have a look at the Open Contracting Partnership website. Four potential use cases for open contracting data are:

Value for money in procurement: helping officials get good value for money during the procurement process, and analyzing whether this goals was achieved afterwards.
Detecting fraud and corruption: identifying red flags that might indicate corruption by studying individual procurements or networks based on funding, ownership and interests.
Competing for public contracts: allowing private firms to understand the potential pipeline of procurement opportunities by looking at information related to past and current procurements.
Monitoring Service Delivery: helping interested actors to leverage traceability in the procurement process for monitoring purposes, linking budgets and donor data to the contracts and results.

1.1.2 The contracting process

The standard defines a contracting process as:

All the planning, tendering information, awards, contracts and contract implementation information related to a single initiation process.

The standard covers all the stages of a contracting process, even though some processes might not involve all possible steps. The stages of the procurement process, with example objects that might be associated to each one, are depicted in figure 1.

Figure 1.1: Stages of the procurement process.

For identification purposes, all contracting processes are assigned a unique Open Contracting ID (ocid), which can be used to join data from different stages. In order to avoid ocid clashes between publishers, a publisher can prepend a prefix to locally generated identifiers. Publishers are encouraged to register their prefix here.

1.1.3 Documents

Contracting processes are represented as documents in the Open Contracting Data Standard (OCDS from now on, for brevity). Each document is made up of several sections, mentioned below:

release metadata: contextual information about each release of data;
- parties: information about the organizations and other participants involved in the contracting process;
- planning: information about the goals, budgets and projects a contracting process relates to;
- tender: information about how a tender will take place, or has taken place;
- awards: information on awards made as part of a contracting process;
- contract: information on contracts signed as part of a contracting process;
  - implementation: information on the progress of each contract towards completion.

An example JSON snippet compliant with this structure looks as follows:

{
   "language": "en",
   "ocid": "contracting-process-identifier",
   "id": "release-id",
   "date": "ISO-date",
   "tag": ["tag-from-codelist"],
   "initiationType": "tender",
   "parties": {},
   "buyer": {},
   "planning": {},
   "tender": {},
   "awards": [ {} ],
   "contracts":[ {
       "implementation":{}
   }]
}

There are two types of documents defined in the standard:

Releases are immutable and represent updates on the contracting process. For example, they can be used to notify users of new tenders, awards, contracts and other updates. As such, a single contracting process can have many releases.
Records are snapshots of the current state of a contracting process. A record should be updated every time a new release associated to its contracting process is published; hence, there should only be a single record per contracting process.

1.1.4 Fields

Each section may contain several fields specified in the standard, which are used to represent data. These objects can appear several times in different sections of the same document; for example, items can occur in tender (to indicate the items that a buyer wishes to buy), in an award object (to indicate the items that an award has been made for) and in a contract object (to indicate the items listed in the contract). Some example fields, accompanied by corresponding JSON snippets, are presented below.

1.1.4.1 Parties (Organizations)

{
    "address": {
        "countryName": "United Kingdom",
        "locality": "London",
        "postalCode": "N11 1NP",
        "region": "London",
        "streetAddress": "4, North London Business Park, Oakleigh Rd S"
    },
    "contactPoint": {
        "email": "procurement-team@example.com",
        "faxNumber": "01234 345 345",
        "name": "Procurement Team",
        "telephone": "01234 345 346",
        "url": "http://example.com/contact/"
    },
    "id": "GB-LAC-E09000003",
    "identifier": {
        "id": "E09000003",
        "legalName": "London Borough of Barnet",
        "scheme": "GB-LAC",
        "uri": "http://www.barnet.gov.uk/"
    },
    "name": "London Borough of Barnet",
    "roles": [ ... ]
}

1.1.4.2 Amounts

{
    "amount": 11000000,
    "currency": "GBP"
}

1.1.4.3 Items

{
    "additionalClassifications": [
       {
            "description": "Cycle path construction work",
            "id": "45233162-2",
            "scheme": "CPV",
            "uri": "http://cpv.data.ac.uk/code-45233162.html"
        }
    ],
    "classification": {
        "description": "Construction work for highways",
        "id": "45233130",
        "scheme": "CPV",
        "uri": "http://cpv.data.ac.uk/code-45233130"
    },
    "description": "string",
    "id": "0001",
    "quantity": 8,
    "unit": {
        "name": "Miles",
        "value": {
            "amount": 137000,
            "currency": "GBP"
        }
    }
}

1.1.4.4 Time Periods

{
    "endDate": "2011-08-01T23:59:00Z",
    "startDate": "2010-07-01T00:00:00Z"
}

1.1.4.5 Documents

{
    "datePublished": "2010-05-10T10:30:00Z",
    "description": "Award of contract to build new cycle lanes to AnyCorp Ltd.",
    "documentType": "notice",
    "format": "text/html",
    "id": "0007",
    "language": "en",
    "title": "Award notice",
    "url": "http://example.com/tender-notices/ocds-213czf-000-00001-04.html"
}

1.1.4.6 Milestones

{
    "description": "A consultation period is open for citizen input.",
    "dueDate": "2015-04-15T17:00:00Z",
    "id": "0001",
    "title": "Consultation Period"
}

1.1.5 Extensions and codelists

In addition to regular fields, the OCDS schema defines some fields that can only be used in certain sections, e.g. titles and descriptions of tenders, awards and contracts. In some cases, publishers may require fields that are not provided by the core schema; an extension allows defining new fields that can be used in these cases. A list of available extensions is available here; if no existing extension addresses a publisher’s needs, the publisher is encouraged to collaborate on the creation of a new community extension.

Another concept worth mentioning is that of codelists. Codelists are sets of case sensitive strings with associated labels, available in each language OCDS has been translated into. Publishers should use codelist values whenever possible to map their existing classification systems; if needed, detail fields can be used to provide more detailed classification information. There are two types of codelists:

Closed codelists are fixed sets of values. If a field is associated with a closed codelist, it should only accept an option from the published list.
Open codelists are sets of recommended values. If a field is associated with an open codelist, it accepts options from the list but also other values.

The Open Contracting Data Standard is maintained using JSON Schema. In this section we have introduced and described the main sections and common objects used in the schema, providing JSON snippets as examples of these basic building blocks. If you are interested in the full JSON schema reference, please refer to the official documentation.

1.2 An introduction to the R programming language

R is a multi-paradigm interpreted programming language and software environment focused on statistical computing, commonly used for data analysis. It is published under the GPL v2 license and maintained by the R Foundation, with binaries available for GNU/Linux, macOS and Windows. While the basic installer comes bundled with a command line interface, several graphical integrated development environments are available, of which RStudio and RStudio Server are the most noteworthy¹.

In this section we will introduce some of R’s basic features and syntax; after going through it, the reader should be better prepared to understand the code used for analysis in the rest of the guide. Having said that, several features are ommited as a complete reference of R is out of the scope of this document. Should the reader feel inclined to learn more about R, out of curiosity or need, a list of manuals maintained by the R Development Core Team is available here.

For completeness and reproducibility purposes, we include a excerpt of system information below.

R.version

##                _                           
## platform       x86_64-apple-darwin15.6.0   
## arch           x86_64                      
## os             darwin15.6.0                
## system         x86_64, darwin15.6.0        
## status                                     
## major          3                           
## minor          3.2                         
## year           2016                        
## month          10                          
## day            31                          
## svn rev        71607                       
## language       R                           
## version.string R version 3.3.2 (2016-10-31)
## nickname       Sincere Pumpkin Patch

1.2.1 Operators

R’s arithmetic and logical operators should be familiar to most programmers, as they are common to other programming languages. It is worth mentioning that arithmetic operators work on scalars and collections.

Some R arithmetic operators
Operator	Description
+	Addition
-	Subtraction
*	Multiplication
/	Division
** or ^	Exponentiation
%%	Modulus

Some R logical operators
Operator	Description
>	Greater than
>=	Greater than or equal to
<	Lesser than
<=	Lesser than or equal to
==	Equal to
!=	Not equal to

1.2.2 Variables

R supports several data types including scalars, vectors (of numbers, strings, booleans, etc), matrices, dataframes and tibbles, among others. The <- operator is used to assign a value to a variable. A few examples of variables from these types are shown below:

a_scalar <- 4

a_number_vector <- c(1, 2, 3, 4, 5) # all elements of a vector must be of the same type
a_string_vector <- c("a1", "b2", "c3", "d4")
a_boolean_vector <- c(TRUE, FALSE)

# lists can have elements of different types, associated to a key
a_list <- list(name = "John", last_name = "Deer", age = 42, married = FALSE)

# there are several ways to access an element from a collection
a_number_vector[0]

## numeric(0)

a_string_vector[[1]]

## [1] "a1"

a_list$name

## [1] "John"

# a matrix is a special kind of vector, with the number of rows and columns as attributes
m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
m

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

1.2.3 Data Frames and Tibbles

Data types to store data tables are core to R’s purpose and functionality, hence they deserve their own subsection. The most common data table type in R are data frames, which can be though of as list of vectors of equal length.

name <- c("Mike", "Lucy", "John")
age <- c(20, 25, 30)
student <- c(TRUE, FALSE, TRUE)
df <- data.frame(name, age, student)
df

##   name age student
## 1 Mike  20    TRUE
## 2 Lucy  25   FALSE
## 3 John  30    TRUE

The access operators presented for other data types can also be used to get data cells from a data frame.

df[1, ] # R has 1-based indexing, i.e. collections start at 1.

##   name age student
## 1 Mike  20    TRUE

df[1, ]$name # Values from a row can be accessed by column name

## [1] Mike
## Levels: John Lucy Mike

df$name # Columns can also be accessed directly from the data frame.

## [1] Mike Lucy John
## Levels: John Lucy Mike

Another type of data table available are Tibbles. Tibbles belong to the Tidyverse, a set of libraries to perform data analysis in R following best practices that will be discussed at length in another section. For now, just think of Tibbles as data frames with some quirkyness fixed. As an example, printing a data frame displays all of its rows, which may be problematic for large datasets; printing a tibble displays only the first 10 rows by default.

Data frames can be converted to tibbles and viceversa; the latter can be useful when dealing with older libraries that are not Tidyverse-ready. The author advises the use of tibbles and favors (as many users in the community) leveraging Tidyverse features and adhering to its guidelines as closely as possible.

# Packrat should have installed dependencies for you, 
# you can do it manually by running the command below
# install.packages("tidyverse")
library(tidyverse) # Side note: this is how you import a library in R

tb <- as_tibble(df)
class(as.data.frame(tb))

## [1] "data.frame"

1.2.4 Functions

The syntax for function definition in R is pretty straightforward and resemblant of that of other programming languages. A function receives zero, one or multiple arguments and it runs the code included in its body:

function ( arglist ) {body}

Because of the fact that R targets a specific niche (statistical computing), it offers a very rich set of built-in functions and libraries readily available for installation. Thus, function calls tend to be much more common in R than function definition.

Most of the R ecosystem is focused on dealing (cleaning, plotting, modelling) tabular data. As a simple display of basic statistical features, we can use the summary function to get descriptive statistics from a table.

summary(tb)

##    name        age        student       
##  John:1   Min.   :20.0   Mode :logical  
##  Lucy:1   1st Qu.:22.5   FALSE:1        
##  Mike:1   Median :25.0   TRUE :2        
##           Mean   :25.0   NA's :0        
##           3rd Qu.:27.5                  
##           Max.   :30.0

Reading data that is already tabular is also very simple, as shown in the example below which reads a CSV file as a data frame. The file corresponds to the iris dataset, a very popular toy dataset for data science available online.

iris <- as_tibble(read.table("iris.csv", header = TRUE, sep = ","))
iris

## # A tibble: 150 x 5
##    SepalLength SepalWidth PetalLength PetalWidth Name       
##          <dbl>      <dbl>       <dbl>      <dbl> <fct>      
##  1         5.1        3.5         1.4        0.2 Iris-setosa
##  2         4.9        3           1.4        0.2 Iris-setosa
##  3         4.7        3.2         1.3        0.2 Iris-setosa
##  4         4.6        3.1         1.5        0.2 Iris-setosa
##  5         5          3.6         1.4        0.2 Iris-setosa
##  6         5.4        3.9         1.7        0.4 Iris-setosa
##  7         4.6        3.4         1.4        0.3 Iris-setosa
##  8         5          3.4         1.5        0.2 Iris-setosa
##  9         4.4        2.9         1.4        0.2 Iris-setosa
## 10         4.9        3.1         1.5        0.1 Iris-setosa
## # ... with 140 more rows

One last useful operator that is worth mentioning is the pipe %>%. Pipes allow function chaining in R, which favors readability and (one might say) elegance when a sequence of function calls is needed.

# For now, there is no need to understand what each function of this snippet actually does
iris %>%
  group_by(Name) %>%
  summarize_if(is.numeric, mean) %>%
  ungroup()

## # A tibble: 3 x 5
##   Name            SepalLength SepalWidth PetalLength PetalWidth
##   <fct>                 <dbl>      <dbl>       <dbl>      <dbl>
## 1 Iris-setosa            5.01       3.42        1.46      0.244
## 2 Iris-versicolor        5.94       2.77        4.26      1.33 
## 3 Iris-virginica         6.59       2.97        5.55      2.03

The pipe operator passes the value of its left-hand-side operand as the first argument of its right-hand-side operand. Using it avoids the need of declaring variables to store intermediate results in additional variables (thus cluttering the namespace) or nesting function calls (all those parenthesis can be hard to read).

1.2.5 Style

As a final remark, although there is no official code style guideline for the R programming language, the author (not surprisingly if you have been paying attention) recommends following the Tidyverse style guide ². The styler package is a handy linter and code formatter that can help keeping source code compliant; it includes an RStudio add-in, the easiest way to re-style existing code.

1.3 Installing R and other dependencies

Assuming a machine using Ubuntu 18.04 Bionic Beaver as its operating system, a user with sudo access can install the R programming language and R Studio by following these steps:

Update system packages to make sure already installed dependencies are up to date.

sudo apt update

Install the R programming language.

sudo apt -y install r-base

RStudio for Ubuntu comes bundled as a .deb package. The easiest way to install a .deb package in Ubuntu is by using the gdebi command, which is not installed in the system by default. It can be installed by running the following command.

sudo apt install gdebi-core

Go to the official RStudio download page and download the file corresponding to you operating system. In our current scenario, that would be RStudio 1.1.456 - Ubuntu 16.04+/Debian 9+ (64-bit).
From the directory in which you downloaded the package at the last step, install Rstudio using gdebi.

sudo gdebi rstudio-xenial-1.1.456-amd64.deb

Profit! You should now be able to access Rstudio by running the rstudio command or by clicking its icon in the apps menu.

Besides RStudio Desktop, which can be installed by following the previous set of steps, RStudio is also available in a remote access flavor. RStudio Server is an application that can be installed in a web server, thus offering the same features of the desktop version through a web browser. To find out more about this alternative and how to install it, please refer to the official documentation.

The project this document is generated from uses Packrat to keep track of all the libraries that it depends on. This ensures the portability of the software (on import, the appropriate version for the corresponding OS will be installed) and the reproducibility of the analysis. Therefore, assuming you have Git installed, playing around with this notebook should be as easy as:

Cloning the project

git clone https://github.com/rparrapy/ocds-r-manual.git

Opening it with RStudio by using the menu bar option File -> Open Project…

We will also be using a Postgres database to run our analysis, hence we need to install Postgres in case it is missing in our system. We can do so by running:

sudo apt install postgresql postgresql-contrib

The last command install the Postgres database engine and creates a postgres user as a default account to access our databases.

2 Data acquisition

This section is focused on data loading from diverse sources, which is the first step for any analysis process. We will read the data from the four countries this document is concerned with, using them as examples for different data acquisition methods:

Data from Uruguay will be read using their public web API.
Data from Paraguay and Mexico will be loaded from local JSON files.
Data from Colombia will be loaded from JSONB columns of a Postgres database. Since this is the largest dataset we have access to, we will use it as an example of how to handle big data with R using JSON streams.

2.1 From a REST API

The open contracting dataset from Uruguay is available via a RSS endpoint, which includes URLs to the latest releases. It is worth noting that since only latest releases are accessible, not much in depth analysis can be done using this dataset. Having said that, it is still a perfectly suitable example to showcase R’s HTTP fetching capabilities; we start by fetching the RSS feed and creating a data frame where each row corresponds to a release.

library(XML)

xml_url <- "http://www.comprasestatales.gub.uy/ocds/rss"
xml_file <- xmlParse(xml_url)
xml_top <- xmlRoot(xml_file)[[1]]

index_tb <- as_tibble(xmlToDataFrame(xmlElementsByTagName(xml_top, "item")))
index_tb

## # A tibble: 500 x 5
##    title             pubDate      link                  guid      category
##    <fct>             <fct>        <fct>                 <fct>     <fct>   
##  1 id_compra:674909~ Mon, 17 Sep~ http://www.comprases~ ajuste_a~ awardUp~
##  2 id_compra:675263~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
##  3 id_compra:675836~ Mon, 17 Sep~ http://www.comprases~ llamado-~ tender  
##  4 id_compra:675839~ Mon, 17 Sep~ http://www.comprases~ llamado-~ tender  
##  5 id_compra:675838~ Mon, 17 Sep~ http://www.comprases~ adjudica~ award   
##  6 id_compra:675529~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
##  7 id_compra:675272~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
##  8 id_compra:675279~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
##  9 id_compra:675837~ Mon, 17 Sep~ http://www.comprases~ adjudica~ award   
## 10 id_compra:663987~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
## # ... with 490 more rows

In the snippet above, we start by importing R’s XML library since Uruguay’s API responses are in this data format. The xmlParse method fetches the content given by the RSS feed url and parses it, returning a tree-like nested array. We then get the root of the parsed structure and seek every element with the item tag by calling the xmlElementsByTagName method. We convert the result to a data frame and after that to a tibble using xmlToDataFrame and as_tibble respectively.

We then proceed to fetch the JSON content corresponding to each release, using the values of the link column defined above. There are many libraries to handle JSON with R, each with its list of pros and cons; because of its focus in performance and web api consumption, we use jsonlite in this guide.

library(jsonlite)

# we run fromJSON (which fetches the url) for every row of the data frame
releases <- apply(index_tb["link"], 1, function(r) {
  return(as_tibble(fromJSON(r, flatten = TRUE)[["releases"]]))
})
releases[[1]]

## # A tibble: 1 x 10
##   ocid  id    date  tag   language initiationType parties awards buyer.name
## * <chr> <chr> <chr> <lis> <chr>    <chr>          <list>  <list> <chr>     
## 1 ocds~ ajus~ 2018~ <chr~ es       tender         <data.~ <data~ "Jefatura~
## # ... with 1 more variable: buyer.id <chr>

The main function in the snippet above is fromJSON, which is provided by jsonlite. We use it to fetch the JSON files from the url contained in the link column of index_tb. The flatten flag indicates that we want the function to squeeze the nested JSON structure as much as possible, to make it suitable for a tabular format. We repeat the process for each row of the tibble using the apply function, which applies the function it receives as a parameter to each row and returns the values as a list.

At this point we have a list of tibbles, each tibble representing a release. As our next step, we can merge all those tibbles, filling missing values (for example, according to the stage they correspond to, some releases include tender information while some do not) with NA (R’s constant for ‘Not Available’). Merging severel tibbles row wise is achieved by calling the bind_rows function.

releases_tb <- bind_rows(releases)

Because data from Uruguay includes only the latest 500 releases, actual compilation of records is not possible (as not enough data is available). Therefore, this is as far as we will go with their open contracting data.

2.2 From JSON files

We can also read records or releases from a set of JSON files, published by one or more official publishers associated to a country. That is the case of Paraguay, the next country we will be adding to our analysis. Paraguayan open contracting data are published by the National Directorate of Public Tender (Dirección Nacional de Contrataciones Públicas, DNCP) and the Ministry of Finance (Ministerio de Hacienda); as they share ocids, we should be able to use their published data together.

To continue with this guide, you should now download a copy of these datasets to your local machine, and store them under the ./data directory (which should be empty if you just cloned this project). That is:

Head over to https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ.
Download the data.zip file and extract its content inside the ./data directory.
If succesful, you should now have two new folders with .json files in them in your system: ./data/records_dncp and ./data/releases_mh.

Before actually loading the data, lets take a small detour to make our data munging more efficient. The size of the datasets we use in this guide ranges from non trivial (with processing taking a few minutes) to challenging (with processing taking several hours). This is aggravated by the fact that R runs in a single thread by default, not leveraging the multicore capabilities that most modern day computers offer. To change that, we can use the parallel library, which allows us to run apply-like functions in parallel.

library(glue) # This is just a utility library to concatenate strings
library(parallel)
library(lubridate) # if you ever have to work with dates, do yourself a favor and use this


# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(no_cores)
clusterExport(cl, c("fromJSON", "paste", "as_tibble", "ymd_hms", "select"))

To start using parallel, we define a cluster of size equal to the number of cores available in our machine minus one (not leaving a single core available for other computation can lead to our computer freezing) with the makeCluster function. Each of the workers of our newly created cluster runs an R interpreter with no other dependencies; to import libraries we are going to need to them we use the clusterExport function.

We are now ready to start with DNCP data, which is stored under data/records_dncp:

files <- list.files("data/records_dncp/", "*.json")
records <- parLapply(cl, files, function(r) {
  file <- paste("data/records_dncp/", r, sep = "")
  parsed <- fromJSON(file, flatten = TRUE)
  records <- parsed[["records"]]
  publishedDate <- ymd_hms(parsed[['publishedDate']])
  records$publishedDate <- rep(publishedDate, nrow(records))
  return(as_tibble(records))
})
records[[1]]

## # A tibble: 1 x 39
##   releases ocid  compiledRelease~ compiledRelease~ compiledRelease~
## * <list>   <chr> <list>           <chr>            <list>          
## 1 <data.f~ ocds~ <list [0]>       ocds-03ad3f-134~ <list [0]>      
## # ... with 34 more variables: compiledRelease.date <chr>,
## #   compiledRelease.language <chr>, compiledRelease.initiationType <chr>,
## #   compiledRelease.id <chr>, compiledRelease.tag <list>,
## #   compiledRelease.planning.url <chr>,
## #   compiledRelease.planning.budget.description <chr>,
## #   compiledRelease.planning.budget.amount.amount <lgl>,
## #   compiledRelease.planning.budget.amount.currency <chr>,
## #   compiledRelease.buyer.name <chr>,
## #   compiledRelease.buyer.contactPoint.email <chr>,
## #   compiledRelease.buyer.contactPoint.name <chr>,
## #   compiledRelease.buyer.contactPoint.telephone <chr>,
## #   compiledRelease.tender.title <chr>,
## #   compiledRelease.tender.hasEnquiries <lgl>,
## #   compiledRelease.tender.submissionMethod <list>,
## #   compiledRelease.tender.lots <list>,
## #   compiledRelease.tender.documents <list>,
## #   compiledRelease.tender.status <chr>, compiledRelease.tender.id <chr>,
## #   compiledRelease.tender.items <list>, compiledRelease.tender.url <chr>,
## #   compiledRelease.tender.procuringEntity.name <chr>,
## #   compiledRelease.tender.procuringEntity.contactPoint.email <chr>,
## #   compiledRelease.tender.procuringEntity.contactPoint.name <chr>,
## #   compiledRelease.tender.procuringEntity.contactPoint.telephone <chr>,
## #   compiledRelease.tender.awardPeriod.endDate <lgl>,
## #   compiledRelease.tender.awardPeriod.startDate <chr>,
## #   compiledRelease.tender.enquiryPeriod.endDate <chr>,
## #   compiledRelease.tender.tenderPeriod.endDate <chr>,
## #   compiledRelease.tender.tenderPeriod.startDate <lgl>,
## #   compiledRelease.tender.value.amount <lgl>,
## #   compiledRelease.tender.value.currency <chr>, publishedDate <dttm>

The code snippet above uses several of our already familiar functions such as fromJSON, as_tibble and so on. There are three newcomers that demand explanation:

list.files returns a list of paths corresponding to files under the path that it receives as its first parameter, that match the regular expression it receives as its second parameter.
paste simply concatenates every string it receives as a parameter, placing the sep string between them.
parLapply is the parallel version of lapply, scheduling different executions of the function it receives as a parameters in different workers of our cluster. This can be done since reading different JSON files is an embarrasingly parallel task.

dncp_records_tb <- bind_rows(records)
remove(records)
dncp_records_tb

## # A tibble: 138,398 x 39
##    releases ocid  compiledRelease~ compiledRelease~ compiledRelease~
##    <list>   <chr> <list>           <chr>            <list>          
##  1 <data.f~ ocds~ <list [0]>       ocds-03ad3f-134~ <list [0]>      
##  2 <data.f~ ocds~ <list [0]>       ocds-03ad3f-134~ <list [0]>      
##  3 <data.f~ ocds~ <data.frame [1 ~ ocds-03ad3f-134~ <data.frame [1 ~
##  4 <data.f~ ocds~ <list [0]>       ocds-03ad3f-134~ <list [0]>      
##  5 <data.f~ ocds~ <data.frame [1 ~ ocds-03ad3f-134~ <data.frame [1 ~
##  6 <data.f~ ocds~ <NULL>           <NA>             <NULL>          
##  7 <data.f~ ocds~ <NULL>           <NA>             <NULL>          
##  8 <data.f~ ocds~ <data.frame [1 ~ ocds-03ad3f-136~ <data.frame [3 ~
##  9 <data.f~ ocds~ <data.frame [1 ~ ocds-03ad3f-136~ <data.frame [1 ~
## 10 <data.f~ ocds~ <list [0]>       ocds-03ad3f-136~ <list [0]>      
## # ... with 138,388 more rows, and 34 more variables:
## #   compiledRelease.date <chr>, compiledRelease.language <chr>,
## #   compiledRelease.initiationType <chr>, compiledRelease.id <chr>,
## #   compiledRelease.tag <list>, compiledRelease.planning.url <chr>,
## #   compiledRelease.planning.budget.description <chr>,
## #   compiledRelease.planning.budget.amount.amount <dbl>,
## #   compiledRelease.planning.budget.amount.currency <chr>,
## #   compiledRelease.buyer.name <chr>,
## #   compiledRelease.buyer.contactPoint.email <chr>,
## #   compiledRelease.buyer.contactPoint.name <chr>,
## #   compiledRelease.buyer.contactPoint.telephone <chr>,
## #   compiledRelease.tender.title <chr>,
## #   compiledRelease.tender.hasEnquiries <lgl>,
## #   compiledRelease.tender.submissionMethod <list>,
## #   compiledRelease.tender.lots <list>,
## #   compiledRelease.tender.documents <list>,
## #   compiledRelease.tender.status <chr>, compiledRelease.tender.id <chr>,
## #   compiledRelease.tender.items <list>, compiledRelease.tender.url <chr>,
## #   compiledRelease.tender.procuringEntity.name <chr>,
## #   compiledRelease.tender.procuringEntity.contactPoint.email <chr>,
## #   compiledRelease.tender.procuringEntity.contactPoint.name <chr>,
## #   compiledRelease.tender.procuringEntity.contactPoint.telephone <chr>,
## #   compiledRelease.tender.awardPeriod.endDate <chr>,
## #   compiledRelease.tender.awardPeriod.startDate <chr>,
## #   compiledRelease.tender.enquiryPeriod.endDate <chr>,
## #   compiledRelease.tender.tenderPeriod.endDate <chr>,
## #   compiledRelease.tender.tenderPeriod.startDate <chr>,
## #   compiledRelease.tender.value.amount <dbl>,
## #   compiledRelease.tender.value.currency <chr>, publishedDate <dttm>

To free up some space, we use the remove function to explicitly clear intermediate results from memory. Data from the Ministry of Finance can be read likewise:

files <- list.files("data/releases_mh/", "*.json")
releases <- parLapply(cl, files, function(r) {
  file <- paste("data/releases_mh/", r, sep = "")
  parsed <- fromJSON(file, flatten = TRUE)
  releases <- parsed[["releases"]]
  publishedDate <- ymd_hms(parsed[['publishedDate']])
  # there are some files that don't include any release (?)
  if (!is.null(nrow(releases)) && nrow(releases) > 0) {
      releases$publishedDate <- rep(publishedDate, nrow(releases))
  }
  return(as_tibble(releases))
})

mh_releases_tb <- bind_rows(releases)
remove(releases)
mh_releases_tb

## # A tibble: 25,839 x 27
##    recordPackageURI initiationType language contracts id    tag   parties
##    <chr>            <chr>          <chr>    <list>    <chr> <lis> <list> 
##  1 https://www.con~ tender         es       <data.fr~ 1567~ <chr~ <data.~
##  2 https://www.con~ tender         es       <data.fr~ 1848~ <chr~ <data.~
##  3 https://www.con~ tender         es       <data.fr~ 1881~ <chr~ <data.~
##  4 https://www.con~ tender         es       <data.fr~ 1881~ <chr~ <data.~
##  5 https://www.con~ tender         es       <data.fr~ 1888~ <chr~ <data.~
##  6 https://www.con~ tender         es       <data.fr~ 1890~ <chr~ <data.~
##  7 https://www.con~ tender         es       <data.fr~ 1891~ <chr~ <data.~
##  8 https://www.con~ tender         es       <data.fr~ 1898~ <chr~ <data.~
##  9 https://www.con~ tender         es       <data.fr~ 1911~ <chr~ <data.~
## 10 https://www.con~ tender         es       <data.fr~ 1911~ <chr~ <data.~
## # ... with 25,829 more rows, and 20 more variables: date <chr>,
## #   awards <list>, ocid <chr>, planning.documents <list>,
## #   planning.budget.project <chr>, planning.budget.budgetBreakdown <list>,
## #   planning.budget.description <chr>,
## #   planning.budget.amount.currency <chr>,
## #   planning.budget.amount.amount <dbl>, tender.id <chr>,
## #   tender.procurementMethodDetails <chr>, tender.title <chr>,
## #   tender.procuringEntity.id <chr>, tender.procuringEntity.name <chr>,
## #   tender.tenderPeriod.startDate <chr>,
## #   tender.tenderPeriod.maxExtentDate <lgl>, buyer.id <chr>,
## #   buyer.name <chr>, publishedDate <dttm>, value <lgl>

2.3 From a Postgres database with JSONB columns

For Mexico and Colombia we process data as a stream, i.e. one record at a time. Data frames and tibbles are stored in memory by R, while stream processing only requires one record to be stored in memory. For this reason, this data processing paradigm is particularly useful when dealing with big data.

To continue with this guide, you should now download the backup files of these databases to your local machine, and restore them to your local Postgres instance. To do so:

Again, head over to https://mega.nz/#F!PCQFzAyY!m54DS0hA3pyKXqergJUBFQ.
Download the files ocds_colombia.dump and ocds_mexico.dump to your local hard drive.
Create the databases by running sudo -u postgres createdb ocds_colombia and sudo -u postgres createdb ocds_mexico from a terminal session.
Within a terminal session, from the directory where you downloaded the files, run pg_restore -d ocds_colombia ocds_colombia_dump and pg_restore -d ocds_mexico ocds_mexico.dump.
Check the databases you just created, they should have several tables which were created by the restore process. In particular, the data table should contain multiple records that we will read for our analysis.

The library we are using (jsonlite) includes the stream_in function to support ndjson a convenient format to store multiple JSON records in a single file. Since our records are stored in a Postgres database, we have to write a similar feature ourselves. Let’s write a function to do precisely that in the next code snippets:

library(RPostgreSQL)

db_engine <- "PostgreSQL"
host <- "localhost"
user <- "postgres" # if you are concerned about security
password <- "" # you should probably change this 2 lines
port <- 5433 # set the port of your Postgres instance, default is 5432
query <- "SELECT id, data FROM data"

drv <- dbDriver(db_engine)
con_colombia <- dbConnect(drv, host = host, port = port,
                          dbname = "ocds_colombia", user = user, password = password)

con_mexico <- dbConnect(drv, host = host, port = port,
                        dbname = "ocds_mexico", user = user, password = password)

We connect and query Postgres through DBI, a database interface definition for communication between R and relational database management systems. In particular, we use RPostgreSQL, a DBI implementation for PostgreSQL databases.

stream_in_db <- function(con, query, page_size = 1000, acc = 0) {
  current_id <- 0

  return(function(handler) {
    repeat{
      paged_query <- paste(query, "WHERE id > ", current_id, "ORDER BY id ASC LIMIT", page_size)
      data <- dbGetQuery(con, paged_query)

      if (dim(data)[1] == 0) {
        break
      }
      
      acc <- handler(data[['data']], acc)
      current_id <- tail(data[['id']], n=1)
    }
    return(acc)
  })
}

stream_in_colombia <- stream_in_db(con_colombia, query)

stream_in_mexico <- stream_in_db(con_mexico, query)

Let’s recap on what we just did:

First, we defined a function stream_in_db that receives database connection parameters and a query, and runs the query against the specified database.
stream_in_db returns a function, which applies a handler received as a parameter to every page of the result set.
Finally, we used stream_in_db to get functions pointing at the databases of Mexico and Colombia.

Pagination using LIMIT and OFFSET is known to degrade its performance as we scan pages further appart from the start. To speed things up as much as possible we use an approach know as keyset pagination, which leverages the fact that we have an index defined for the primary key. For a more detailed comparison between pagination approaches please read this article.

Now let’s take our streaming functions for a ride. To test them out, we can define a dummy handler to count the number of rows of data that result from our query.

sanity_checker <- function(data, acc) {
  m <- parLapply(cl, data, function(e) {
    t <- fromJSON(e, flatten = TRUE)
    return(1)
  })
  
  return(acc + Reduce("+", m))
}

3 Analysis of contracting data

This section introduces a bunch of useful tools to perform basic data analysis using R, accompanied by some basic examples to showcase their features.

3.1 Introducing the Tidyverse

The Tidyverse is a collection of R packages aimed at data science tasks; besides the libraries, it shares a common set of data structures, a style guide, and an underlying philosophy for data analysis. The core Tidyverse packages are:

gglplot2: a system for declaratively creating graphics, based on The Grammar of Graphics, a tool that enables the user to concisely describe most of the graphical mappings used in data visualization. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
dplyr: an implementation of a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges
tidyr: provides a set of functions that help you get to tidy data. Tidy data is data with a consistent form: in brief, every variable goes in a column, and every column is a variable.
readr: provides a fast and friendly way to read rectangular data (like csv, tsv, and fwf).
purrr: enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
tibble: a modern reimplementation of R’s star data structure: the data frame. At this point, the user should be fairly familiar with tibbles.
stringr: provides a cohesive set of functions designed to make working with strings as easy as possible.
forcats: provides a suite of useful tools that solve common problems with factors, which are used by R to handle categorical variables.

For the purposes of this guide, the first three packages above are the most important ones. Hence, tidyr and dplyr will be covered in the remainder of this section and ggplot2 will be described in the next one.

3.2 Data cleaning functions

According to the Tidyverse philosophy, data cleaning is the process of making data tidy. Data being tidy means:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Every package from the Tidyverse is designed to work with tidy data; when dealing with an untidy dataset, the first step of our analysis must therefore be using tidyr to clean it up. There are three main verbs provided by this package to help making data tidy: gather, spread and separate.

3.2.1 Gathering

A frequent issue with untidy data is a dataset where some column names are not variable names but variable values. For example, consider the code snippet below:

world_population = tibble(
  country = c("Paraguay", "Uruguay", "Colombia", "Mexico"),
  "2017" = c(7000000, 3000000, 45000000, 127000000),
  "2018" = c(7200000, 3200000, 46000000, 128000000),
)
world_population

## # A tibble: 4 x 3
##   country     `2017`    `2018`
##   <chr>        <dbl>     <dbl>
## 1 Paraguay   7000000   7200000
## 2 Uruguay    3000000   3200000
## 3 Colombia  45000000  46000000
## 4 Mexico   127000000 128000000

In the example above, the data is untidy because 2017 and 2018 are values of the implicit year variable. This means each row corresponds to two observations, not one. We can fix this problem by applying the gather function as follows:

world_population %>% gather(`2017`, `2018`, key = "year", value = "inhabitants")

## # A tibble: 8 x 3
##   country  year  inhabitants
##   <chr>    <chr>       <dbl>
## 1 Paraguay 2017      7000000
## 2 Uruguay  2017      3000000
## 3 Colombia 2017     45000000
## 4 Mexico   2017    127000000
## 5 Paraguay 2018      7200000
## 6 Uruguay  2018      3200000
## 7 Colombia 2018     46000000
## 8 Mexico   2018    128000000

Gather receives as parameters the names of the columns we want to pivot, and the names of the two new columns we want to create. Much better! In the final result, the problematic columns are dropped and our dataset is now 100% tidy.

3.2.2 Spreading

Gathering is useful when one row corresponds to more than one observation. Spreading works in the opposite scenario, when a single observation is scattered accross multiple rows. Consider a simple extension of our example dataset:

world_count = tibble(
  country = c("Paraguay", "Uruguay", "Colombia", "Mexico", "Paraguay", "Uruguay", "Colombia", "Mexico"),
  year = 2018,
  type = c("inhabitants", "inhabitants", "inhabitants", "inhabitants", "cars", "cars", "cars", "cars"),
  count = c(7000000, 3000000, 45000000, 127000000, 1000000, 500000, 10000000, 75000000)
)
world_count

## # A tibble: 8 x 4
##   country   year type            count
##   <chr>    <dbl> <chr>           <dbl>
## 1 Paraguay  2018 inhabitants   7000000
## 2 Uruguay   2018 inhabitants   3000000
## 3 Colombia  2018 inhabitants  45000000
## 4 Mexico    2018 inhabitants 127000000
## 5 Paraguay  2018 cars          1000000
## 6 Uruguay   2018 cars           500000
## 7 Colombia  2018 cars         10000000
## 8 Mexico    2018 cars         75000000

In this case we have variables stored as cell values, which is the case of inhabitants and cars. How do we fix it? Just watch the spread function do its magic below:

world_count %>% spread(key = type, value = count)

## # A tibble: 4 x 4
##   country   year     cars inhabitants
##   <chr>    <dbl>    <dbl>       <dbl>
## 1 Colombia  2018 10000000    45000000
## 2 Mexico    2018 75000000   127000000
## 3 Paraguay  2018  1000000     7000000
## 4 Uruguay   2018   500000     3000000

Spread turns every value in the column given by its key parameter and turns it into a separate column, filling in the cells with the corresponding value from the column given by its value parameter.

3.2.3 Separating

Our final tidyr function, separate, allows us to solve cases in which multiple variable values are stored in the same cell. As an example, consider the result of parsing the Uruguayan RSS feed that we downloaded in a previous section.

index_tb

## # A tibble: 500 x 5
##    title             pubDate      link                  guid      category
##    <fct>             <fct>        <fct>                 <fct>     <fct>   
##  1 id_compra:674909~ Mon, 17 Sep~ http://www.comprases~ ajuste_a~ awardUp~
##  2 id_compra:675263~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
##  3 id_compra:675836~ Mon, 17 Sep~ http://www.comprases~ llamado-~ tender  
##  4 id_compra:675839~ Mon, 17 Sep~ http://www.comprases~ llamado-~ tender  
##  5 id_compra:675838~ Mon, 17 Sep~ http://www.comprases~ adjudica~ award   
##  6 id_compra:675529~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
##  7 id_compra:675272~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
##  8 id_compra:675279~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
##  9 id_compra:675837~ Mon, 17 Sep~ http://www.comprases~ adjudica~ award   
## 10 id_compra:663987~ Mon, 17 Sep~ http://www.comprases~ aclar_ll~ tenderU~
## # ... with 490 more rows

It is clear that the title column contains values for two variables: id_compra and release_id, which are separated by a comma. Let’s use separate to make things tidy:

index_tb %>% 
  separate(title, into = c("id_compra", "release_id"), sep = ",") %>%
  transform(id_compra=str_replace(id_compra,"id_compra:","")) %>%  
  transform(release_id=str_replace(release_id,"release_id:",""))   %>%
  # the last 2 lines just remove the needless prefixes
  head(5)

##   id_compra               release_id                   pubDate
## 1    674909 ajuste_adjudicacion-3428 Mon, 17 Sep 2018 18:10:06
## 2    675263   aclar_llamado-675263-4 Mon, 17 Sep 2018 17:56:01
## 3    675836           llamado-675836 Mon, 17 Sep 2018 17:40:06
## 4    675839           llamado-675839 Mon, 17 Sep 2018 17:40:06
## 5    675838      adjudicacion-675838 Mon, 17 Sep 2018 17:35:06
##                                                                       link
## 1 http://www.comprasestatales.gub.uy/ocds/release/ajuste_adjudicacion-3428
## 2   http://www.comprasestatales.gub.uy/ocds/release/aclar_llamado-675263-4
## 3           http://www.comprasestatales.gub.uy/ocds/release/llamado-675836
## 4           http://www.comprasestatales.gub.uy/ocds/release/llamado-675839
## 5      http://www.comprasestatales.gub.uy/ocds/release/adjudicacion-675838
##                       guid     category
## 1 ajuste_adjudicacion-3428  awardUpdate
## 2   aclar_llamado-675263-4 tenderUpdate
## 3           llamado-675836       tender
## 4           llamado-675839       tender
## 5      adjudicacion-675838        award

Separate splits the value of a single column by the character given by the sep parameter; the names of the newly defined columns are given by the into parameter, which should be a collection of strings. The last two lines of the code snippet above are just a sneak peak of one of the features of the package we will cover next: dplyr.

3.3 Data analysis functions

Once we have tidied up our data with tidyr, the next steps of data analysis usually involve data manipulation in one way or the other. Selecting specific columns and rows based on a condition, adding composite columns, summarizing data, etc. can be cited among the operations that are frequently performed as a part of the analytical process. The Tidyverse includes dplyr as the go-to tool for this kind of data manipulation, we will now cover some of its basic functions.

3.3.1 Mutating

The mutate function allows the user to add a new column to a tibble based on the values of one (or more) already existing columns. Let’s use the dataset of releases from Uruguay to illustrate a case where this function might come in handy:

releases_tb

## # A tibble: 500 x 28
##    ocid  id    date  tag   language initiationType parties awards
##    <chr> <chr> <chr> <lis> <chr>    <chr>          <list>  <list>
##  1 ocds~ ajus~ 2018~ <chr~ es       tender         <data.~ <data~
##  2 ocds~ acla~ 2018~ <chr~ es       tender         <data.~ <NULL>
##  3 ocds~ llam~ 2018~ <chr~ es       tender         <data.~ <NULL>
##  4 ocds~ llam~ 2018~ <chr~ es       tender         <data.~ <NULL>
##  5 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  6 ocds~ acla~ 2018~ <chr~ es       tender         <data.~ <NULL>
##  7 ocds~ acla~ 2018~ <chr~ es       tender         <data.~ <NULL>
##  8 ocds~ acla~ 2018~ <chr~ es       tender         <data.~ <NULL>
##  9 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
## 10 ocds~ acla~ 2018~ <chr~ es       tender         <data.~ <NULL>
## # ... with 490 more rows, and 20 more variables: buyer.name <chr>,
## #   buyer.id <chr>, tender.id <chr>, tender.hasEnquiries <lgl>,
## #   tender.documents <list>, tender.amendments <list>, tender.title <chr>,
## #   tender.description <chr>, tender.status <chr>, tender.items <list>,
## #   tender.procurementMethod <chr>, tender.procurementMethodDetails <chr>,
## #   tender.submissionMethod <list>, tender.submissionMethodDetails <chr>,
## #   tender.procuringEntity.name <chr>, tender.procuringEntity.id <chr>,
## #   tender.tenderPeriod.startDate <chr>,
## #   tender.tenderPeriod.endDate <chr>,
## #   tender.enquiryPeriod.startDate <chr>,
## #   tender.enquiryPeriod.endDate <chr>

Let’s suppose we need a column with the name of the month when a release was made. This value is not readily available in the dataset, but the information is contained in the date column and can be extracted with some help from mutate:

uruguay_releases_with_month = releases_tb %>% mutate(month = month.name[month(date)])
## month.name is a vector with the name of each month
uruguay_releases_with_month[c('ocid', 'month')]

## # A tibble: 500 x 2
##    ocid               month    
##    <chr>              <chr>    
##  1 ocds-yfs5dr-674909 September
##  2 ocds-yfs5dr-675263 September
##  3 ocds-yfs5dr-675836 September
##  4 ocds-yfs5dr-675839 September
##  5 ocds-yfs5dr-675838 September
##  6 ocds-yfs5dr-675529 September
##  7 ocds-yfs5dr-675272 September
##  8 ocds-yfs5dr-675279 September
##  9 ocds-yfs5dr-675837 September
## 10 ocds-yfs5dr-663987 September
## # ... with 490 more rows

Mutate receives a tibble as its first parameter; it receives a variable assignment expression as the second parameter, which is evaluated row-wise to produce the new column.

3.3.2 Selecting

In our code snippet above we selected certain columns of our tibble using array-indexing notation. dplyr provides a function to do the same thing, select:

select(uruguay_releases_with_month, ocid, month)

## # A tibble: 500 x 2
##    ocid               month    
##    <chr>              <chr>    
##  1 ocds-yfs5dr-674909 September
##  2 ocds-yfs5dr-675263 September
##  3 ocds-yfs5dr-675836 September
##  4 ocds-yfs5dr-675839 September
##  5 ocds-yfs5dr-675838 September
##  6 ocds-yfs5dr-675529 September
##  7 ocds-yfs5dr-675272 September
##  8 ocds-yfs5dr-675279 September
##  9 ocds-yfs5dr-675837 September
## 10 ocds-yfs5dr-663987 September
## # ... with 490 more rows

Select receives a variable number of parameters, the first one must be a tibble or data frame and the rest of them are the columns to be included in the selection.

3.3.3 Filtering

A very common operation when dealing with data is to select a subset of rows that are of interest for our analysis. For example, let’s assume we are only interested in releases that are related to awards. We can identify the stage of the contracting process to which a release is related through the tag column.

releases_tb[1,]$tag

## [[1]]
## [1] "awardUpdate"

Knowing this, it is easy to get the collection of releases that correspond to awards by using filter as shown below:

awards_tb = releases_tb %>% filter(tag == 'award')
awards_tb

## # A tibble: 230 x 28
##    ocid  id    date  tag   language initiationType parties awards
##    <chr> <chr> <chr> <lis> <chr>    <chr>          <list>  <list>
##  1 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  2 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  3 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  4 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  5 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  6 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  7 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  8 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
##  9 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
## 10 ocds~ adju~ 2018~ <chr~ es       tender         <data.~ <data~
## # ... with 220 more rows, and 20 more variables: buyer.name <chr>,
## #   buyer.id <chr>, tender.id <chr>, tender.hasEnquiries <lgl>,
## #   tender.documents <list>, tender.amendments <list>, tender.title <chr>,
## #   tender.description <chr>, tender.status <chr>, tender.items <list>,
## #   tender.procurementMethod <chr>, tender.procurementMethodDetails <chr>,
## #   tender.submissionMethod <list>, tender.submissionMethodDetails <chr>,
## #   tender.procuringEntity.name <chr>, tender.procuringEntity.id <chr>,
## #   tender.tenderPeriod.startDate <chr>,
## #   tender.tenderPeriod.endDate <chr>,
## #   tender.enquiryPeriod.startDate <chr>,
## #   tender.enquiryPeriod.endDate <chr>

awards_tb[10, ]$tag

## [[1]]
## [1] "award"

Filter receives a tibble as its first parameter and a boolean expression as the second one, the boolean expression is evaluated for each row and only rows wich evaluate to TRUE are included in the final result.

3.3.4 Summarizing

Summarization allows us to aggregate variables based on a preset grouping. To illustrate this we can revisit our example using the Iris dataset:

iris %>%
  group_by(Name) %>%
  summarize_if(is.numeric, mean)

## # A tibble: 3 x 5
##   Name            SepalLength SepalWidth PetalLength PetalWidth
##   <fct>                 <dbl>      <dbl>       <dbl>      <dbl>
## 1 Iris-setosa            5.01       3.42        1.46      0.244
## 2 Iris-versicolor        5.94       2.77        4.26      1.33 
## 3 Iris-virginica         6.59       2.97        5.55      2.03

In the snippet above, we start by grouping the dataset by species name and then summarize every numeric column using the mean as the aggregator function.

4 Data visualization

Data visualization can be defined as the visual mapping of data using visual cues such as size, shape and color to represent the variation of the values of the variables in a dataset. It is a good way to communicate complex information, since it is much easier to spot patterns and make comparisons than with raw data.

This section introduces the reader to some basic concepts of data visualization, providing a few guidelines on how to choose an appropriate visual representation of a dataset. Although it is definitely helpful to acquire such theoretical background, the good news for the reader are that R gently guides you towards the right choice (if you let it). The last statement will become clearer once we cover ggplot2, R’s main data visualization library and a very important component of the Tidyverse.

Finally, we will test what we just learned by plotting OCDS data using ggplot2.

4.1 A data visualization primer

Whenever we visualize data, we are encoding variables using visual attributes such as size, shape or color. Consider for example a quantitative variable, differences among values associated to each observation can be represented in several ways, as displayed in figure 2:

Figure 4.1: Potential attribute mapping for a quantitative variable.

As the reader might have noticed, many mappings are possible, but not all of them seem equally appropriate. In fact, statisticians William Cleveland and Robert McGill explored this intuition further by running some experiments with human volunteers to determine which visual attributes encoded quantitative information more accurately. Their findings are summarized in figure 3:

Figure 4.2: Preattentive attributes sorted by their suitability to encode quantitative variables.

While this preference order is a good guideline for quantitative variables, other types of data map differently to visual attributes. For example, while color hue is a poor choice to encode a quantitative variable, it works well to encode a categorical one.

Different combinations of encoding choices render different types of plots. In this manual we will use four type of plots, which will be introduced as we use them to describe OCDS data in a later subsection: bar charts, line charts, box plots and histograms. There are several other types of charts, and additional considerations that can be taken when visualizing data; for a more in depth tour of these topics we recommend the great visualization course by Peter Aldhous, which is available (online)[http://paldhous.github.io/ucb/2016/dataviz/week2.html] (figures from this section were taking from the course site).

4.2 An introduction to ggplot2

ggplot2 is the Tidyverse package for data visualization. It is based on the grammar of graphics, a formal grammar to declaratively describe most of the most common charts used when visualizing data.

Graphs are described in ggplot by using a concise set of elements which can be combined according to a basic structure defined by the grammar of graphics. A simplified version of the grammar can be read below:

ggplot(data = [DATA]) + 
  [GEOM_FUNCTION](mapping = aes([MAPPINGS]))

You can consider the snippet above as a template for graphics. To make a plot, you simply fill in the elements within brackets with actual values:

[DATA] is pretty self-explanatory, it is the placeholder for the dataset that we want to plot.
[GEOM] should be replaced by any geom function provided by ggplot2. geoms are geometrical objects used to represent data such as points, lines, etc.
Finally [MAPPINGS] should be a collection of ggplot2 aesthetics, which are basically visual encoding specifications for the variables we want to visualize.

Let’s see this template in action by plotting our well-known Iris dataset as a scatter plot:

ggplot(data = iris) + geom_point(mapping = aes(x = PetalLength, y = PetalWidth, color = Name))

As the reader might have noticed, ggplot2 favors convention over configuration and any plot that we draw with it includes many good practices out of the box. For example, in the scatter plot above we got axis labelling and a legend for free just by specifying our set of aesthetic mappings.

There are several other components of the ggplot2 grammar, but for this brief introduction our simplified template will do. To check which other elements are available, besides an exhaustive list of geoms and aesthetics please visit the official website of the library.

4.3 Visualizing Open Contracting data

Now that we know the basics of ggplot2 we can go ahead and start making some plots …almost. Since we are mostly interested in comparing amounts spent by year and buyer, we can start off by extracting all rows of our dataset that contain at least one contract. We achieve exactly that by building a boolean index with one element per row of our dataset, indicating whether its corresponding row passes our condition or not.

contract_indices <- parApply(cl, dncp_records_tb, 1, function(r) {
  contracts <- r['compiledRelease.contracts'][[1]]
  return(!is.null(contracts) && !is.null(dim(contracts)))
})

Once we have our boolean index, we can use it to filter out the records that we are not interested in and keep the ones that are useful for our analysis; which we achieve this by doing dncp_records_tb[contract_indices, ] in the snippet below. After that, we extract the contracts and the buyer name from the records we kept.

contracts <- parApply(cl, dncp_records_tb[contract_indices, ], 1, function(r) {
  result <- r['compiledRelease.contracts'][[1]]
  result['buyer.name'] <- r['compiledRelease.buyer.name']
  result['publishedDate'] <- r['publishedDate']
  return(result)
})

The contracts object defined above is a nested collection of contracts. To make it a single tibble we call the bind_rows function. We explicitly free the memory we are no longer going to use by removing the contracts object.

contracts_dncp <- bind_rows(contracts)
rm(contracts)

We now have a contracts dataset from DNCP ready for plotting and analysis. We can follow similar steps to get one from the Finance Ministry.

contracts <- parApply(cl, mh_releases_tb, 1, function(r) {
  result <- r['contracts'][[1]]
  result['buyer.name'] <- r['buyer.name']
  result['publishedDate'] <- r['publishedDate']
  return(result)
})

contracts_hacienda <- bind_rows(contracts)
rm(contracts)

We know there is significant overlap between the publications of DNCP and the Ministry of Finance, so naively merging both tibbles will end up in contract duplication, hence making our analysis invalid. First, let’s make sure the overlap actually exists, by selecting the contracts that are present in the DNCP dataset and the Ministry of Finance dataset. The subset function selects rows from a tibble based on a condition; in this case, a non-empty subset would confirm the existance of duplicates.

overlap <- subset(
  contracts_dncp,
  (dncpContractCode %in% contracts_hacienda[['dncpContractCode']]))['dncpContractCode']
dim(overlap)

## [1] 43351     1

Having confirmed the presence of duplicate contracts, we need to come up with a smarter merging strategy: lets merge both tibbles together and group them by dncpContractCode (this should group duplicates together), sort the rows within each group by publishedDate in ascending order, and finally pick the last row from each group. This preprocessing step, coded in the snippet below, ensures that we only include the most current version of a contract in our analysis.

contracts_paraguay <- bind_rows(contracts_dncp, contracts_hacienda) %>%
  group_by(dncpContractCode) %>%
  arrange(publishedDate) %>%
  slice(n()) %>%
  ungroup

contracts_paraguay

## # A tibble: 147,196 x 24
##    awardID title dncpContractCode lots  documents dateSigned id    items
##    <chr>   <chr> <chr>            <lis> <list>    <chr>      <chr> <lis>
##  1 195388~ Cont~ AC-11001-10-0155 <NUL~ <data.fr~ 2010-09-2~ 1953~ <NUL~
##  2 156467~ Cont~ AC-11001-10-0290 <NUL~ <data.fr~ 2010-10-2~ 1564~ <NUL~
##  3 156467~ Cont~ AC-11001-10-0292 <NUL~ <data.fr~ 2010-10-2~ 1564~ <NUL~
##  4 156467~ Cont~ AC-11001-10-0293 <NUL~ <data.fr~ 2010-10-2~ 1564~ <NUL~
##  5 185054~ "Con~ AC-11001-10-0319 <NUL~ <data.fr~ 2010-10-2~ 1850~ <NUL~
##  6 185054~ "Con~ AC-11001-10-0361 <NUL~ <data.fr~ 2010-11-0~ 1850~ <NUL~
##  7 185054~ "Con~ AC-11001-10-0362 <NUL~ <data.fr~ 2010-11-0~ 1850~ <NUL~
##  8 201352~ Cont~ AC-11001-10-0363 <NUL~ <data.fr~ 2010-11-0~ 2013~ <NUL~
##  9 186382~ Cont~ AC-11001-10-0408 <NUL~ <data.fr~ 2010-11-0~ 1863~ <NUL~
## 10 189787~ "Con~ AC-11001-10-0421 <NUL~ <data.fr~ 2010-11-0~ 1897~ <NUL~
## # ... with 147,186 more rows, and 16 more variables: url <chr>,
## #   status <chr>, value.amount <dbl>, value.currency <chr>,
## #   suppliers.name <chr>, suppliers.identifier.scheme <chr>,
## #   suppliers.identifier.legalName <chr>, suppliers.identifier.id <chr>,
## #   suppliers.identifiers.key <chr>, period.endDate <chr>,
## #   period.startDate <chr>, buyer.name <chr>, publishedDate <dbl>,
## #   extendsContractID <chr>, dncpAmendmentType <chr>,
## #   implementation.transactions <list>

Next, there is the issue of contract currency. Let’s have a look at which currencies are used and how many contracts were signed for each one.

contracts_paraguay %>%
  group_by(value.currency) %>%
  summarise(count = n())

## # A tibble: 3 x 2
##   value.currency  count
##   <chr>           <int>
## 1 PYG            146700
## 2 USD               495
## 3 <NA>                1

Results presented in the table above show that the vast majority of contracts were signed with amounts in Paraguayan Guaranies (PYG), a non-so-surprising fact given that we are looking at Paraguayan contracts. We could include contracts with amounts in US dollars by using a given exchange rate, but to keep things simple we will consider only the former currency.

To continue with our analysis, lets get a tibble of our contract dataset grouped by year and some interesting summaries. First, we extract the year from the dateSigned field using the mutate function, then we filter some invalid rows (na values) and outliers while keeping only the contracts in PYG using the filter function. Finally, we group the remaining contracts by year with the group_by function and obtain the total count and amount per group using the summarise function.

by_year <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(!is.na(signed_year), !is.na(value.amount),
         signed_year < 2019, signed_year > 2009, value.currency == 'PYG') %>%
  group_by(signed_year) %>%
  summarise(count = n(), amount = sum(value.amount))

With our dataset defined above, we can draw line plots with a few lines of ggplot2 magic. Let’s plot the number of contracts and the total amount for all contracts per year:

ggplot(data=by_year, aes(x=signed_year, y=count)) +
  geom_line(color="blue") +
  geom_point() +
  labs(x = 'Year', y = 'Number of contracts')

ggplot(data=by_year, aes(x=signed_year, y=amount)) +
  geom_line(color='springgreen4') +
  geom_point() +
  labs(x = 'Year', y = 'Total Amount (PYG)')

We have done some line plots, what about drawing some bars? Lets draw the number of contracts per buyer for a given year (2017). You should be noticing a pattern by now: we start by getting our data in the right shape using dplyr verbs, and only then we do the plotting. Bar charts require an additional intermediate step, at least to give them the typical top-n look: buyer names are character arrays and hence sorted lexicographically by default, we need to redefine the column using the factor type (used for categorical values) and set a custom ordering using the order function.

count_by_buyer <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(value.currency == 'PYG', signed_year == 2017) %>%
  group_by(buyer.name) %>%
  summarise(ccount = n()) %>%
  arrange(desc(ccount)) %>%
  head(5)

count_by_buyer$buyer.name <- factor(
  count_by_buyer$buyer.name,
  levels = count_by_buyer$buyer.name[order(count_by_buyer$ccount)])


ggplot(data=count_by_buyer, aes(x=buyer.name, y=ccount)) +
  geom_bar(stat="identity", fill="steelblue") +
  coord_flip() +
  labs(y = 'Number of contracts', x = 'Buyer')

We can follow similar steps to draw a bar plot considering the total amount by buyer:

amount_by_buyer <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(value.currency == 'PYG', signed_year == 2017) %>%
  group_by(buyer.name) %>%
  summarise(amount = sum(value.amount)) %>%
  arrange(desc(amount)) %>%
  head(5)

top_buyers = amount_by_buyer$buyer.name
amount_by_buyer$buyer.name <- factor(
  amount_by_buyer$buyer.name,
  levels = amount_by_buyer$buyer.name[order(amount_by_buyer$amount)])


ggplot(data=amount_by_buyer, aes(x=buyer.name, y=amount)) +
  geom_bar(stat="identity", fill="springgreen4") +
  coord_flip() +
  labs(y = 'Total Amount (PYG)', x = 'Buyer')

The plots we have drawed so far are useful to show grouped summaries, but they do not give us any clue about the underlying distribution of the plotted numerical variable. Box plots let us visualize distributions grouped by a categorical variable, together with some very useful statistics; figure 4 (taken from a stats course from the BBC shows how to interpret each component of a box plot:

Figure 4.3: Elements of a box plot.

The beauty of ggplot2 lies on its declarative nature. Take a look at the code snippet below, which allows us to draw a box plot: the only significant difference with previous example is the usage of another geom (geom_boxplot).

contracts_top5_buyers <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(value.currency == 'PYG', signed_year == 2017, buyer.name %in% top_buyers) %>%
  arrange(desc(value.amount)) %>%
  tail(1000)

contracts_top5_buyers$buyer.name <- factor(contracts_top5_buyers$buyer.name)


ggplot(data=contracts_top5_buyers, aes(x=buyer.name, y=value.amount, color=buyer.name)) +
  geom_boxplot(outlier.colour="black", outlier.shape=NA, notch=FALSE) +
  labs(y = 'Total Amount (PYG)', x = 'Buyer') +
  theme(axis.text.y=element_blank()) +
  coord_flip()

There was a tiny lie in the previous paragraph, as there is another difference in the snippet above: since there is a buyer with contracts much larger than the others, our box plot would actually degenerate if we plotted all contracts. To avoid this ugly scenario we restrict ourselves to the thousand contracts in the lower end of our dataset in terms of amount.

A histogram is an appropriate choice if we want to represent the distribution of a quantitative variable such as contract amount without any grouping. A histogram splits the entire domain of the variable in multiple equally sized ranges (usually called bins) and counts how many of the samples (contracts, in this case) fall in each one. The code snippet below draws a histogram of contract amounts of 2017; again, to filter outliers in order to get a meaningful plot we consider only the bottom 10000 contracts.

selected_contracts <- contracts_paraguay %>%
  mutate(signed_year = year(dateSigned)) %>%
  filter(value.currency == 'PYG', signed_year == 2017, !is.na(value.amount)) %>%
  arrange(desc(value.amount)) %>%
  tail(10000)

ggplot(data=selected_contracts, aes(value.amount)) +
  geom_histogram(fill='steelblue', col='black') +
  labs(x = 'Amount (PYG)', y = 'Number of Contracts')

In the last two examples we have been restricting our analysis to a certain tail of the distribution. This is not uncommon, specially in the presence of outliers, since the difference in magnitudes makes it hard to extract meaningful conclusions from an overall plot. It is easy to change the part of the distribution we are looking at by changing the parameter of the tail function, or by using its complement, the head function. If you are running this notebook in an interactive mode, you can take a few minutes to do so and try to find some interesting patterns in the data.

As our last plot we will draw a histogram for Mexican contracts, which will be read from a PostgreSQL database with the streaming helper function we implemented in previous sections. We do so by defining a callback function that will extract the contract amounts for every contract in every published compiledRelease. In summary, the callback defined in the snippet below does the following:

It parses a release using jsonlite helper functions.
Given a release, it extracts the contracts if there are any (checked using the dim function).
For a given contract, if no amount is available it sets a dummy amount to allow filtering later on.
It merges contracts from the current page with contracts from previously processed pages (stored in the acc variable).

amount_getter <- function(data, acc) {
  m <- parLapply(cl, data, function(e) {
    t <- as_tibble(fromJSON(e, flatten = TRUE)$compiledRelease$contracts)
    if (dim(t)[[1]] > 0) {
      if (!("valueWithTax.amount" %in% colnames(t))) {
        t$valueWithTax.amount = -1
        t$valueWithTax.currency = "BTC"
      }
      return(select(t, valueWithTax.amount, valueWithTax.currency))
    } else {
      return(NULL)
    }
  })
  
  if (is.tibble(acc)) {
    return(bind_rows(acc, m))
  } else {
    return(bind_rows(m))
  }
}

contracts <- stream_in_mexico(amount_getter)

Notice how cumbersome this á-la-map-reduce processing is, no matter how efficient, when compared to the dplyr functions we used in previous examples. The moral of the story is: while R can handle out-of-core computation, several of its tools are not really well prepared to handle such a case. Having said that, once we have a tibble that does fit into memory things get much friendlier again, as seen in the following code snippets to draw our histogram.

contracts_mx <- contracts %>%
  filter(valueWithTax.currency == 'MXN') %>%
  arrange(desc(valueWithTax.amount)) %>%
  tail(10000)

ggplot(data=contracts_mx, aes(valueWithTax.amount)) +
  geom_histogram(fill='forestgreen', col='black') +
  labs(x = 'Amount (MXN)', y = 'Number of Contracts')

stopCluster(cl)

Since we are done with our heavy data munging, let’s not forget to clean up after ourselves. The stopCluster function releases all the resources used by our parallel processing nodes.

5 Recommendations and conclusions

In the pages of this guide we have introduced the reader to the Open Contracting Data Standard and the R programming language; furthermore, we have shown how to use R, and more specifically tools available as a part of the Tidyverse, to perform basic data fetching, cleaning, analysis and visualization.

The specifics of the data format and the volume of the available dataset have given us enough reason to introduce more advanced concepts such as parallel processing. Having said that, publishers could greatly improve the experience of prospective data analysts by:

Providing an ndjson dump of their datasets, which would allow the usage of jsonlite’s stream_in and stream_out functions.
Even better, publishing unnormalized tabular datasets of objects of interest, such as contracts. This is important because a) the largest overhead of out-of-core computations in our experience are serialization and flattening costs and b) most data analysis tools are designed with tabular data in mind.
Maximizing standard coverage; in other words, using more fields provided by the standard when publishing their datasets.

Having reached this point, the reader should be able to do their own analysis on OCDS dataset using the R programming language. Throughout the guide, we have provided several links that point to valuable resources to keep on learning about the language and its rich tool ecosystem. Now it’s time to use what we learned to pursue the goal of the OCDS: to increase contracting transparency, and allow deeper analysis of contracting data.

The author recommends RStudio as the go-to way to run R code and believes it is one major reason behind R’s spike in popularity. Consequently, this guide was written using RStudio v1.1.453 in macOS High Sierra.↩
This notebook was written with the Tidyverse code style guide in mind. Please blame any non-conformant snippet on the lack of good coffee at that point of the writing process :)↩

Analyzing Open Contracting data

A manual using the R programming language

Rodrigo Parra