Chapter 4 Coding Practices
Contributors: Kunal Mishra, Jade Benjamin-Chung, Ben Arnold
4.1 Organizing scripts
Just as your data “flows” through your project, data should flow naturally through a script. Very generally, you want to:
- describe the work completed in the script in a comment header
- source your configuration file (
0-config.R
) - load all of your data
- do all your analysis/computation in order
- save your results
Each section should be “chunked together” using comments, often with many chunks in a single section. See this file for a good example of how to cleanly organize a file in a way that follows this “flow” and functionally separate pieces of code that are doing different things. This is another example in a .Rmd
format, where chunking is made even more obvious by the interleaving of markdown text and R code in the same notebook file.
4.2 Documenting your code
4.2.1 File headers
Every file in a project should have a header that allows it to be interpreted on its own. It should include the name of the project and a short description for what this file (among the many in your project) does specifically. You may optionally wish to include the inputs and outputs of the script as well, though the next section makes this significantly less necessary. It can be very helpful to include your name and email address as well so others can identify who wrote the code. This is unnecessary if you are using version control (git/GitHub
) becuase that information will be tracked by commits.
#-------------------------------------------------------------------------------
# @Organization - Example Organization
# @Project - Example Project
# @Author - Your name, and possibly email address (if appropriate)
# @Description - This file is responsible for [...]
#-------------------------------------------------------------------------------
Consider using RStudio’s code folding feature to collapse and expand different sections of your code. Any comment line with at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section. Delimiters for chunks are a personal preference and all work equally well. For example:
# Section 1 ------------------
works equally well as
##############################
# Section 1 ##################
##############################
4.2.3 Function documentation
Every function you write must include a header to document its purpose, inputs, and outputs. For any reproducible workflows, they are essential, because R is dynamically typed. This means, you can pass a string
into an argument that is meant to be a data.table
, or a list
into an argument meant for a tibble
. It is the responsibility of a function’s author to document what each argument is meant to do and its basic type. This is an example for documenting a function (inspired by JavaDocs, R’s Plumber API docs, and Roxygen2):
#-----------------------------------------------------------------------------
# Documentation: calc_fluseas_mean
# Usage: calc_fluseas_mean(data, yname)
# @description: Make a dataframe with rows for flu season and site
# and the number of patients with an outcome, the total patients,
# and the percent of patients with the outcome
# Arguments/Options:
# @param data: a data frame with variables flu_season, site, studyID, and yname
# @param yname: a string for the outcome name
# @param silent: a boolean specifying whether the function shouldn't output anything to the console (DEFAULT: TRUE)
# @return: the dataframe as described above
# @output: prints the data frame described above if silent is not True
#-----------------------------------------------------------------------------
calc_fluseas_mean = function(data, yname, silent = TRUE) {
### function code here
}
The header tells you what the function does, its various inputs, and how you might go about using the function to do what you want. Also notice that all optional arguments (i.e. ones with pre-specified defaults) follow arguments that require user input.
Note: As someone trying to call a function, it is possible to access a function’s documentation (and internal code) by
CMD-Left-Click
ing the function’s name in RStudioNote: Depending on how important your function is, the complexity of your function code, and the complexity of different types of data in your project, you can also add “type-checking” to your function with the
assertthat::assert_that()
function. You can, for example,assert_that(is.data.frame(statistical_input))
, which will ensure that collaborators or reviewers of your project attempting to use your function are using it in the way that it is intended by calling it with (at the minimum) the correct type of arguments. You can extend this to ensure that certain assumptions regarding the inputs are fulfilled as well (i.e. thattime_column
,location_column
,value_column
, andpopulation_column
all exist within thestatistical_input
tibble).
4.3 Object naming
Generally we recommend using nouns for objects and verbs for functions. This is because functions are performing actions, while objects are not.
Try to make your variable names both more expressive and more explicit. Being a bit more verbose is useful and easy in the age of autocompletion! For example, instead of naming a variable vaxcov_1718
, try naming it vaccination_coverage_2017_18
. Similarly, flu_res
could be named absentee_flu_residuals
, making your code more readable and explicit.
- For more help, check out Be Expressive: How to Give Your Variables Better Names
We recommend you use Snake_Case.
Base R allows
.
in variable names and functions (such asread.csv()
), but this goes against best practices for variable naming in many other coding languages. For consistency’s sake,snake_case
has been adopted across languages, and modern packages and functions typically use it (i.e.readr::read_csv()
). As a very general rule of thumb, if a package you’re using doesn’t usesnake_case
, there may be an updated version or more modern package that does, bringing with it the variety of performance improvements and bug fixes inherent in more mature and modern software.Note: you may also see
camelCase
throughout the R code you come across. This is okay but not ideal – try to stay consistent across all your code withsnake_case
.Note: there is nothing inherently wrong with using
.
in variable names, just that it goes against style best practices that are cropping up in data science, so its worth getting rid of these bad habits now.
4.4 Function calls
In a function call, use “named arguments” and separate arguments by to make your code more readable.
Here’s an example of what not to do when calling the function a function calc_fluseas_mean
(defined above):
mean_Y = calc_fluseas_mean(flu_data, "maari_yn", FALSE)
And here it is again using the best practices we’ve outlined:
mean_Y = calc_fluseas_mean(
data = flu_data,
yname = "maari_yn",
silent = FALSE
)
4.5 The here
package
The here
package is one great R package that helps multiple collaborators deal with the mess that is working directories within an R project structure. Let’s say we have an R project at the path /home/oski/Some-R-Project
. My collaborator might clone the repository and work with it at some other path, such as /home/bear/R-Code/Some-R-Project
. Dealing with working directories and paths explicitly can be a very large pain, and as you might imagine, setting up a Config with paths requires those paths to flexibly work for all contributors to a project. This is where the here
package comes in and this a great vignette describing it.
For more motivation on why you should use the here
and R projects (.Rproj
), read this excellent blog post from Tidyverse.
4.6 Tidyverse
Throughout this document there have been references to the Tidyverse, but this section is to explicitly show you how to transform your Base R tendencies to Tidyverse (or Data.Table, Tidyverse’s performance-optimized competitor). For most of our work that does not utilize very large datasets, we recommend that you code in Tidyverse rather than Base R. Tidyverse is quickly becoming the gold standard in R data analysis and modern data science packages and code should use Tidyverse style and packages unless there’s a significant reason not to (i.e. big data pipelines that would benefit from Data.Table’s performance optimizations).
The package author has published a great textbook on R for Data Science, which leans heavily on many Tidyverse packages and may be worth checking out.
The following list is not exhaustive, but is a compact overview to begin to translate Base R into something better:
Base R | Better Style, Performance, and Utility |
---|---|
_ | _ |
read.csv() |
readr::read_csv() or data.table::fread() |
write.csv() |
readr::write_csv() or data.table::fwrite() |
readRDS |
readr::read_rds() |
saveRDS() |
readr::write_rds() |
_ | _ |
data.frame() |
tibble::tibble() or data.table::data.table() |
rbind() |
dplyr::bind_rows() |
cbind() |
dplyr::bind_cols() |
df$some_column |
df %>% dplyr::pull(some_column) |
df$some_column = ... |
df %>% dplyr::mutate(some_column = ...) |
df[get_rows_condition,] |
df %>% dplyr::filter(get_rows_condition) |
df[,c(col1, col2)] |
df %>% dplyr::select(col1, col2) |
merge(df1, df2, by = ..., all.x = ..., all.y = ...) |
df1 %>% dplyr::left_join(df2, by = ...) or dplyr::full_join or dplyr::inner_join or dplyr::right_join |
_ | _ |
str() |
dplyr::glimpse() |
grep(pattern, x) |
stringr::str_which(string, pattern) |
gsub(pattern, replacement, x) |
stringr::str_replace(string, pattern, replacement) |
ifelse(test_expression, yes, no) |
if_else(condition, true, false) |
Nested: ifelse(test_expression1, yes1, ifelse(test_expression2, yes2, ifelse(test_expression3, yes3, no))) |
case_when(test_expression1 ~ yes1, test_expression2 ~ yes2, test_expression3 ~ yes3, TRUE ~ no) |
proc.time() |
tictoc::tic() and tictoc::toc() |
stopifnot() |
assertthat::assert_that() or assertthat::see_if() or assertthat::validate_that() |
For a more extensive set of syntactical translations to Tidyverse, you can check out this document.
Working with Tidyverse within functions can be somewhat of a pain due to non-standard evaluation (NSE) semantics. If you’re an avid function writer, we’d recommend checking out the following resources:
- Tidy Eval in 5 Minutes (video)
- Tidy Evaluation (e-book)
- Data Frame Columns as Arguments to Dplyr Functions (blog)
- Standard Evaluation for *_join (stackoverflow)
- Programming with dplyr (package vignette)
4.7 Coding with R and Python
If you’re using both R and Python, you may wish to check out the Feather package for exchanging data between the two languages extremely quickly.
4.2.2 Comments in the body of your script
Commenting your code is an important part of reproducibility and helps document your code for the future. When things change or break, you’ll be thankful for comments. When you revisit code you wrote two years earlier, you’ll be thankful for comments. There’s no need to comment excessively or unnecessarily, but a comment describing what a large or complex chunk of code does is always helpful. See this file for an example of how to comment your code and notice that comments are always in the form of:
# This is a comment -- first letter is capitalized and spaced away from the pound sign