Chapter 9 Working with Big Data
Contributors: Eric Kim, Kunal Mishra and Jade Benjamin-Chung
9.1 Basics
A pitfall of working in R is that all objects are stored in memory - this makes it very difficult to work with datasets that are larger than 1-2 Gb for most standard computers. Here, we’ll explore some alternatives to working with big data.
The Berkeley Statistical Computing Facility also has many good training resources.
9.2 Using downsampled data
In studies with very large datasets, we save “downsampled” data that usually includes a 1% random sample stratified by any important variables, such as year or household id. This allows us to efficiently write and test our code without having to load in large, slow datasets that can cause RStudio to freeze. Be very careful to be sure which dataset you are working with and to label results output accordingly.
9.3 Unix
Though bash is very commonly used for management of your file system (see Chapter 11), it is also a very capable at doing basic data manipulation with big data. At the core, since the data is stored on disk, you avoid having to overload memory when using bash commands as it will work with the files directly. By default, these commands will print the results to standard output (probably your terminal screen), but you can then redirect the results to other files on disk to save your results. These commands can also be chained via pipes (represented as |
, similar to %>%
in tidyverse). All of these have a list of arguments that can be passed in via flags (check the man
page for more details on each).
Command | Description |
---|---|
head /tail |
Displays the first few or last few rows of a file |
cat |
Concatenates files and prints them |
sort |
Sorts the file |
cut |
Cuts out portions of each line and prints it |
grep |
Finds lines of a file that matches inputted patterns |
sed |
Find and replace |
awk |
Similar to grep and sed but with some extra programmatic functionality |
uniq |
Unifies repeated lines (combine with sort to get unique rows) |
wget / curl |
Downloads data/files from websites |
9.4 SQL and dbplyr
SQL databases are relational databases that are a collection of tables that consists of fields or attributes, each containing a single type. If you use dplyr
a lot, you will find that it is heavily inspired with a SQL flavor in mind. Formally, data gets loaded onto a database system and it is stored on disk. This alone makes working with data fast, but the real efficiency gain is in the concept of indexing. If you are curious, most SQL databases implement their index with B trees or B+ trees, which allow for log time complexity for search operations in average and worst case scenarios while providing constant time complexity in best case scenario.
The basic structure of a SQL query is as follows:
SELECT [DISTINCT] (attributes)
FROM (table)
[WHERE (conditions)]
[GROUP BY (attributes) [HAVING (conditions)]]
[ORDER BY (attributes) [DESC]]
The equivalent dplyr
command would look as such:
table %>%
select(attributes) %>% # distinct(attributes) for select distinct
group_by(attributes) %>% #
filter(conditions) %>% #
arrange(attributes) # arrange(desc(attributes)) for descending
There is ample support for connection to databases in R, and, in particular, there is the dbplyr
package, which allows you to interface with the data with dplyr
code instead of SQL code.
9.5 data.table
and dtplyr
It is often possible to load large datasets into memory in R, but computations will require more consumption of memory and will probably be very slow. One way around this is to use data.table
. You will find that operations on the data are much faster than base R or dplyr
even though data is loaded into memory - this is because of clever programming in C as well as internally creating a key (the SQL equivalent of an index) by default when loading in the data. You can improve on this even more by setting extra keys for variables you know you will be doing filter or join operations on.
More recently from the tidyverse, is the implementation of dtplyr
, which allows for dplyr
syntax on data.table
objects.
An overview of the dplyr
vs data.table
debate can be found in this stackoverflow post and all 3 answers are worth a read.
9.6 ff
, bigmemory
, biglm
Sometimes, it may be impossible to load data into memory. Because of the overhead required, you can expect at least twice as much memory needed as the size of the file on disk to just load in a sufficiently large dataset and with all the other things that your computer needs to put on RAM in order to just run, you’ll run out of space. One way to work around this is to keep the data on disk and instead create clever data structures that allow for natural interfacing with the data by mapping operations to the data on disk. Two R packages that implement these ideas are ff
and bigmemory
.
We can interface with the data while avoiding loading it into memory with these packages, but we run into issues when we try to fit models on it. For an \(n \times p\) dataset, linear regression has a time complexity of \(O(np^2 + p^3)\) and a space complexity of \(O(np + p^2)\) (this just means it will take a while and take up a lot of space for large \(n\) and even moreso for large \(p\)). So even if we could load the data on disk, fitting these models would be out of question. This is where standard solutions used in machine learning (iterative algorithms like stochastic gradient descent) can help us. The idea is to take a smaller portion of our data (which will fit in memory), fit the regression (which will take a reasonable amount of time), then update the coefficient based on another run of linear regression on another small portion of the data until convergence. For GLM models, this can be done with the biglm
package which has integration with ff
and bigmemory
.
9.7 Parallel computing
9.7.1 Embarrassingly Parallel Problems
Sometimes, we have to do something in a loop-like structure where each iteration may be independent of each other, such as simulations or bootstrap. These types of loops are referred to as embarrassingly parallel problems. Each iteration takes some time and every iteration thereafter must wait because the loop is operating as a queue, which gives a very obvious way to parallelize (hence “embarrassingly”). Every computer these days come with at least 2 cores in the CPU and each CPU core can operate independently, so after some overhead, we can speed up our loop by about a factor of the number of cores our computers have.
9.7.2 Packages
In R, the popular packages to do this are parallel
, foreach
, and doParallel
(the backend that connects foreach
and parallel
). More modern parallel computing packages in R are future
and furrr
(inspired by future purrr
, it allows for purrr
like syntax using the future
data structure from its namesake package). In Python, the Dask
library has similar functionality to R’s future
. Note: the parallel
package comes with a detectCores
function, but I sometimes find that it is not accurate. On a Mac, you can manually check the number of cores by going into About This Mac then System Report then checking the Total Number of Cores in the Hardware tab.
9.7.3 GPU’s
For most everyday tasks, CPU will be sufficient, but for large problems even an 8x speed boost from a computer with 8 cores might not be enough. This is where GPU’s come into play. While CPU cores are good at complex operations, GPU cores are good at many small operations like matrix multiplication. GPU cores come in the hundreds for cheaper graphics cards and thousands for top end graphics cards, so they are ideal for training machine learning models, particularly neural networks. However, as these GPU cores were intended for the rendering of graphics on our computers, we cannot easily access their computing power out of R or Python without some translation in between. Graphics manufacturers have been catching up to this market, with one of the most popular platforms for parallel computing on GPU’s being Nvidia’s CUDA for use with Nvidia graphics cards.
9.7.4 The MapReduce paradigm
The idea of the MapReduce paradigm is that we can distribute the data across many nodes and try to do the computation on each piece of the data in each node. One benefit of this is that if our data is too large to fit on disk for a single machine, we can instead spread it across many then do our operations in parallel and aggregate the results back together. We can formalize this paradigm into three steps
- Map: Split the data into sub-datasets and perform an operation on each entry in each sub-dataset thereby creating key-value pairs.
- Shuffle: Merge the key-value pairs and sort them.
- Reduce: Apply an operation on the associated values for each key.
An excellent example is included at the bottom of this link. A similar paradigm that is implemented in the tidyverse is the split-apply-combine strategy.
The popular infrastructures for doing parallel computing with the MapReduce paradigm are Hadoop and Spark (think of Spark as an in-memory version of Hadoop). Spark can be more easily interfaced with than Hadoop through Python via PySpark
and through R via SparkR
(from Apache) or sparklyr
(from RStudio). However, note that because Spark is natively implemented in Java and Scala, the overhead of serialization between R/Python to Java/Scala may be a time expensive operation.
9.8 Optimal RStudio set up
Using the following settings will help ensure a smooth experience when working with big data. In RStudio, go to the “Tools” menu, then select “Global Options”. Under “General”:
Workspace
- Uncheck Restore RData into workspace at startup
- Save workspace to RData on exit – choose never
History
- Uncheck Always save history
Unfortunately RStudio often gets slow and/or freezes after hours working with big datasets. Sometimes it is much more efficient to just use Terminal / gitbash to run code and make updates in git.