Chapter 10 GitHub and Version Control

Contributors: Stephanie Djajadi, Nolan Pokpongkiat, and Ben Arnold

10.1 Basics

Git is a version control system. It has very good documentation online: https://git-scm.com/doc

If you get into trouble with Git, the docs will help a lot!

Git is often used in conjunction with GitHub, which is an online platform to help teams collaborate while using Git for version control: https://github.com.

  • A detailed tutorial of Git can be found here on UC Berkeley’s CS61B website.
  • If you are already familiar with Git, you can reference the summary at the end of Section B.
  • If you have made a mistake in Git, you can refer to this article to undo, fix, or remove commits in git.

There have been some great articles in PLOS Computational Biology that describe using GitHub for academic research. These papers are great introductions and summary of best practices!

Blischak JD, Davenport ER, Wilson G. A Quick Introduction to Version Control with Git and GitHub. PLoS Comput Biol. 2016;12: e1004668. https://www.ncbi.nlm.nih.gov/pubmed/26785377

Perez-Riverol Y, Gatto L, Wang R, Sachsenberg T, Uszkoreit J, Leprevost F da V, et al. Ten Simple Rules for Taking Advantage of Git and GitHub. PLoS Comput Biol. 2016;12: e1004947. https://www.ncbi.nlm.nih.gov/pubmed/27415786

10.2 Git Branching

A terrific overview of branching workflow and its rationale is here: https://guides.github.com/introduction/flow/

Branches allow you to keep track of multiple versions of your work simultaneously, and you can easily switch between versions and merge branches together once you’ve finished working on a section and want it to join the rest of your code. Here are some cases when it may be a good idea to branch:

  • You may want to make a dramatic change to your existing code (called refactoring) but it will break other parts of your project. But you want to be able to simultaneously work on other parts or you are collaborating with others, and you don’t want to break the code for them.
  • You want to start working on a new part of the project, but you aren’t sure yet if your changes will work and make it to the final product.
  • You are working with others and don’t want to mix up your current work with theirs, even if you want to bring your work together later in the future.

A detailed tutorial on Git Branching can be found here. You can also find instructions on how to handle merge conflicts when joining branches together.

10.3 Example Workflow

A standard workflow when starting on a new project and contributing code looks like this:

Command Description
SETUP: FIRST TIME ONLY: git clone <url> <directory_name> Clone the repo. This copies of all the project files in its current state on Github to your local computer.
1. git pull origin master update the state of your files to match the most current version on GitHub
2. git checkout -b <new_branch_name> create new branch that you’ll be working on and go to it
3. Make some file changes work on your feature/implementation
4. git add <filename> add file to stage for commit
5. git commit -m <commit message> commit file with a message
6. git push -u origin <branch_name> push branch to remote and set to track (-u only works if this is first push)
7. Repeat step 4-5. work and commit often
8. git push push work to remote branch for others to view
9. Follow the link given from the git push command to submit a pull request (PR) on GitHub online PR merges in work from your branch into master
(10.) Your changes and PR get approved, your reviewer deletes your remote branch upon merging
11. git fetch --all --prune clean up your local git by untracking deleted remote branches

Other helpful commands are listed below.

10.4 Commonly Used Git Commands

Command Description
git clone <url> <directory_name> clone a repository, only needs to be done the first time
git pull origin master pull before making any changes
git branch check what branch you are on
git branch -a check what branch you are on + all remote branches
git checkout -b <new_branch_name> create new branch and go to it (only necessary when you create a new branch)
git checkout <branch name> switch to branch
git add <file name> add file to stage for commit
git commit -m <commit message> commit file with a message
git push -u origin <branch_name> push branch to remote and set to track (-u only works if this is first push)
git branch --set-upstream-to origin <branch_name> set upstream to origin/ (use if you forgot -u on first push)
git push origin <branch_name> push work to branch
git checkout --track origin/<branch_name> pulls a remote branch and creates a local branch to track it (use when trying to pull someone else’s branch onto your local computer)
git push --delete <remote_name> <branch_name> delete remote branch
git branch -d <branch_name> deletes local branch, -D to force
git fetch --all --prune untrack deleted remote branches

10.5 How often should I commit?

Stephanie and Nolan (trained in CS and Data Science) suggest it is good practice to commit every 15 minutes (a time-based guideline), or every time you make a significant change (progress-based guideline). Ben’s perspective aligns with this view, but is weighted toward committing around completion of discrete chunks of work; for him, a discrete chunk of work will often take quite a bit longer than 15 minutes time. Take home message: It is better to commit more rather than less.

10.6 What should be pushed to Github?

In general, it is better to track text-based files (.R, .Rmd, .md,.txt,etc…) compared with binary files (.pdf,.png,.docx, etc…) because Git will store changes to a binary file as a completely new file in your Git directory. If you store 100 versions of the same binary file, your directory will quickly become very bloated. If the binary files don’t change often, then you could consider including them under version control, but it is usually cleaner to keep them under a separate version control, such as through an Open Science Framework project with a specific data component.

Be careful before you push .Rout log files! If someone else runs an R script and creates an .Rout file at the same time and both of you try to push to github, it is incredibly difficult to reconcile these two logs. If you run logs, keep them on your own system or (preferably) set up a shared directory where all logs are name and date timestamped.

There is a standardized .gitignore for R which you can download and add to your project. This ensures you’re not committing log files or things that would otherwise best be left ignored to GitHub. This is a great discussion of project-oriented workflows, extolling the virtues of a self-contained, portable projects, for your reference.

10.7 How should I describe my commit?

When you commit, always include a short commit message that describes what the commit does. In the command line, you can achieve this after you have staged files to commit with the commit -m <"your commit message here"> syntax. This helps track-back through work flow. For example, if the commit message is new, that doesn’t provide any information about what the commit includes. A more descriptive commit message would be Create first draft of Fig 1 distribution plot or Change color scheme for distribution plot.

For more lengthy and detailed commit messages, which go beyond the simple, single line commit -m <your commit message here> synatax, this Medium post on the anatomy of a good commit message includes additional discussion. (Note, we don’t typically use commits that are this detailed!)