Chapter 6 Data Wrangling

Contributors: Kunal Mishra, Jade Benjamin-Chung, Ben Arnold

6.1 Overview

Processing data from raw inputs to final data for analysis is the foundation of the data science workflow. “Management” is a bad word in both academia and hip parts of the private sector, so what used to be called “data management” is now called “data wrangling” or “data munging”.

By any name, it is by far the most tedious and time consuming step in the data analysis workflow. Data wrangling is also the most error prone. Errors are likely because data processing usually requires hundreds or thousands of lines of code. Generating final datasets often requires programmers to merge and append multiple different raw datasets; often, the process requires them to re-shape the data from wide format to long format.

Read the chapters on Data Wrangling in R for Data Science. That book covers the foundational skills and know-how.

Here, we provide some additional recommendations for best practices related to data wrangling and highlight a few topics not covered in the excellent R for Data Science book.

6.2 Cardinal rule


Cardinal rule of data wrangling: NEVER alter untouched datasets


Untouched datasets include data output from data collection software, distributed by the data team, outside collaborators, or downloaded from a data repository. They should be left untouched. All data cleaning and processing activities should read in untouched data and output either temporary files or final datasets. Never, ever overwrite an untouched dataset.

6.3 Data input/output (I/O)

Read the Data Import section of R for Data Science.

6.3.1 Excel files

Often collaborators share data as Excel files (.xlsx). In this case, we recommend making a copy of the file in the untouched repository and converting it into a .csv file (one per tab) to input into the data processing workflow.

This step should mark the end of your use of Excel in your data analysis, aside from inspecting .csv files. Excel is extremely error prone because it does not have good fidelity with variable types (e.g., dates, leading zeros), and has no real programmatic interface (VBA?). Avoid it. We can find no compelling case for its use in data science.

6.3.2 .RDS vs .RData Files

One of the most common ways to load and save data in Base R is with the load() and save() functions to serialize multiple objects in a single .RData file. The biggest problems with this practice include an inability to control the names of things getting loaded in, the inherent confusion this creates in understanding older code, and the inability to load individual elements of a saved file. For this, we recommend using the RDS format to save R objects using saveRDS() and its complement readRDS().

  • Note: if you have many related R objects you would have otherwise saved all together using the save function, the functional equivalent with RDS would be to create a (named) list containing each of these objects, and saving it.

  • Note: there is an important caveat for .rds files: they are not automatically backward compatible across different versions of R! So, while they are very useful in general, beware. See, for example, this thread on StackExchange. .csv files embed slightly less information (typically), but are more stable across different versions of R.

6.3.3 .CSV Files

Once again, the readr package as part of the Tidvyerse is great, with a much faster read_csv() than Base R’s read.csv(). For massive CSVs (> 5 GB), you’ll find data.table::fread() to be the fastest CSV reader in any data science language out there. For writing CSVs, readr::write_csv() and data.table::fwrite() outclass Base R’s write.csv() by a significant margin as well.

6.4 Documenting datasets

Datasets need to have metadata (documentation) associated with them to help people understand them. Well documented datasets save an enormous amount of time because it helps avoid lots of back-and-forth with new people orienting themselves with the data. This applies to both private and public data used in your work flow.

Each raw and final dataset should include a codebook. Sometimes survey instruments or electronic data capture schematics can stand-in for raw dataset codebooks, as long as that information is stored alongside the raw data!

The file asembo_analysis_codebook.txt provides one example of what a codebook for a simple, final dataset could contain.

For complex studies with multiple, relational data files, it is exceptionally helpful to also include a README overview in plain text or markdown that explains the relationships between the datasets. Here is an example from the WASH Benefits Bangladesh trial primary outcomes analysis: README-WBB-primary-outcomes-datasets.md.

6.5 Mapping data from untouched -> final

An important step for the data processing is to come up with a plan for which final datasets you want to create.

Best pracice is to scope out the flow of data from the original forms/modules/tables at the time of data capture to the final datasets planned for the analysis, and ensure that there is a Key linking each table needed to create the final analysis datasets.

These are probably the two most common models for the end product:

Single Massive Dataset A single, large, final dataset for the whole project with potentially multiple levels of data (e.g., household, individual) and hundreds or thousands of variables.
Multiple Relational datasets Multiple relational datasets, each tailored to a specific type of data collected in the study. Smaller, relational datasets can be recombined as needed for analyses.

We strongly recommend using the second model, relational data tables, in almost every case. The chapter on Relational Data in R for Data Science explains details.

Rationale: If you create a single massive dataset that includes information at multiple levels, then it can create challenges. First, the datasets are unwieldy, often consisting of hundreds or thousands of variables; this makes it difficult to find variables, difficult to view the data through a software browser, and easy to forget about variables. But there are bigger problems. Datasets with an exceptionally large number of variables or multiple levels of data in the same file make it difficult to visually detect pathological problems in the data (e.g., unexpected missing values). For studies that include data from multiple levels (e.g., communities, households, individuals) the inclusion of multiple types of data in the same file makes analyses error prone because it requires the programmer to keep track of when variables include duplicated observations. For example, the inclusion of household characteristics in an individual-level data file means that if there are multiple individuals per household, that simple means of household characteristics cannot be calculated without first restricting the file to unique observations at the household level.

6.6 Relational data

Refer to Relational Data in R for Data Science for a good example of what a good relational database can look like.

Important: For a set of relational tables to work, you must ensure that each table relates to another with a unique Key. For example, a child-level dataset must also include IDs for household- or cluster to be merged to household or cluster level data.

Practical advice on Key variables / IDs used to link datasets. Numeric IDs can occasionally cause problems with merges due to machine rounding errors so character values can sometimes be a good alternative. Whatever format you use, ensure that they do not in some way encode identifying information (e.g., using a social security number or medical record number would be a poor choice of ID for a final dataset).

6.7 Be careful with joins / merges

Experience shows that merging datasets is one of the most error prone steps in data wrangling. Merges can be particularly tricky for less experienced programmers. Errors are likely early in the data processing work flow, and less likely when merging final datasets with clean key variables (IDs) and clear relationships between datasets. Carefully check each merge to ensure it is working as you expect.

ESSENTIAL READING: Join Problems in R for Data Science

In the Stata software there is a really nice feature with merge that allows you to specify the type of merge/join and then examine the diagnostics of that merge.

Although R does not have a similarly nice feature at this time, the different join functions in the dplyr package provide all of the tools you need to correctly join your data.

In our experience, the two most common types of joins in our work are mutating joins and filtering joints. (from the R for Data Science book):

Mutating joins: add new variables to one data frame from matching observations in another.

Filtering joins: filter observations from one data frame based on whether or not they match an observation in the other table.

Familiarize yourself with these types of operations. The Relational data chapter of R for Data Science includes details.

6.8 Reshaping data

The dplyr package introduced the pivot_wider and pivot_longer functions in 2019. They make reshaping data dramatically easier than any previous functions, including reshape() (base R) or gather/spread (Tidyverse). This excellent article includes details:

https://tidyr.tidyverse.org/dev/articles/pivot.html

6.9 Data cleaning

Data cleaning is typically an iterative process that you should conduct on a variable-by-variable basis. Occasionally it is useful to clean an entire questionnaire module at the same time since variables in the same module can be related to one another, and since it can help to proceed through the cleaning process in well-defined chunks. I find it helpful to have a copy of the survey questionnaire (if relevant) handy while cleaning specific variables. This enables you to check skip patterns and codes in the survey against the dataset.

We don’t have a lot of general advice for data cleaning except the following. First, there is something to be said for “trolling” (in the fishing sense) or visually scrolling through parts of the dataset to see what the data look like and to make sure you have a feel for how information is stored in different variables (e.g., are missing values coded or simply recored as “.” or NA). Use the View() utility in RStudio to look at your data often during the cleaning process. Second, it can be helpful to consolidate all data cleaning into a single program for each dataset. The data cleaning program corrects entry errors, labels variables, formats variables, and outputs either a temporary or final dataset (but does not, ever, overwrite the untouched data). As noted in the Workflows chapter, key data processing should be done once, at the earliest possible place in the workflow.