Proctor Data Science Handbook
Welcome!
1
Introduction: Work Flow and Reproducible Analyses
1.1
Workflow
1.2
Reproducibility
1.3
Automation
2
Workflows
3
Directory Structure and Code Repositories
3.1
Small and large projects
3.2
Directory Structure
3.2.1
First level: data and analyses
3.2.2
Second level: data
3.2.3
Second level: projects
3.3
Code Repositories
3.3.1
.Rproj
files
3.3.2
Configuration (‘config’) File
3.3.3
Shared Functions File
3.3.4
Order Files and Subdirectories
3.3.5
Use Bash scripts or R scripts to ensure reproducibility
3.3.6
Alternative approach for code repos
4
Coding Practices
4.1
Organizing scripts
4.2
Documenting your code
4.2.1
File headers
4.2.2
Comments in the body of your script
4.2.3
Function documentation
4.3
Object naming
4.4
Function calls
4.5
The
here
package
4.6
Tidyverse
4.7
Coding with R and Python
5
Coding Style
5.1
Line breaks
5.2
Automated Tools for Style and Project Workflow
5.2.1
Styling
6
Data Wrangling
6.1
Overview
6.2
Cardinal rule
6.3
Data input/output (I/O)
6.3.1
Excel files
6.3.2
.RDS
vs
.RData
Files
6.3.3
.CSV
Files
6.4
Documenting datasets
6.5
Mapping data from untouched -> final
6.6
Relational data
6.7
Be careful with joins / merges
6.8
Reshaping data
6.9
Data cleaning
7
Reproducible Environments
7.1
Basics
7.2
renv
7.3
Docker
7.4
Putting
renv
and Docker together
8
Making Data Public
8.1
Overview
8.2
Removing PHI
8.2.1
Personal information
8.2.2
Dates
8.2.3
Geographic information
8.3
Create public IDs
8.3.1
Rationale
8.3.2
A single set of public IDs for each study
8.3.3
Example scripts
8.4
Create a data repository
8.4.1
Steps for creating an Open Science Framework (OSF) repository:
8.4.2
Steps for creating a Dryad data repository:
8.5
Edit and test analysis scripts
8.6
Create a public GitHub page for public scripts
8.7
Go live
9
Working with Big Data
9.1
Basics
9.2
Using downsampled data
9.3
Unix
9.4
SQL and
dbplyr
9.5
data.table
and
dtplyr
9.6
ff
,
bigmemory
,
biglm
9.7
Parallel computing
9.7.1
Embarrassingly Parallel Problems
9.7.2
Packages
9.7.3
GPU’s
9.7.4
The
MapReduce
paradigm
9.8
Optimal RStudio set up
10
GitHub and Version Control
10.1
Basics
10.2
Git Branching
10.3
Example Workflow
10.4
Commonly Used Git Commands
10.5
How often should I commit?
10.6
What should be pushed to Github?
10.7
How should I describe my commit?
11
UNIX Commands
11.1
Environment
11.2
Basics
11.3
Syntax for both Mac/Windows
11.4
Running Bash Scripts
11.5
Running Rscripts in Windows
11.5.1
Common Mistakes
11.6
Checking tasks and killing jobs
11.7
Running big jobs
12
Building Automated Markdown Reports
12.1
Setting up the markdown
12.1.1
YAML
12.1.2
Directory structure
12.2
Automating the data export process
12.3
Loading the data
12.4
Data cleaning/processing
12.5
Data monitoring
12.5.1
Data validation
12.5.2
Data presentation (DSMC-style)
12.5.3
Appendix (if necessary)
12.6
Summary
13
Communication and Coordination
13.1
Slack
13.2
Email
13.3
Trello
13.4
Google Drive
13.5
Calendar / Meetings
14
Code of conduct
14.1
Group culture
14.2
Protecting human subjects
14.3
Authorship
14.4
Work hours
15
Additional Resources
References
Published with bookdown
Proctor Foundation Data Science Handbook
References