Why care about project organization?

Project organization is unglamorous but extremely important. Organizing your project files well from the start will make life much easier for you and anyone else who might be collaborating on the project. Many times in my life I’ve come back to a project that I previously looked at weeks, months or years ago, and had a very difficult time figuring out what I was doing and where I’d put critical files. A lot of this pain could have been avoided by following the suggestions in this notebook!

Starting a new project

Let’s say that you’re about to start a project on memory for pitch, and you’d like to create a new repository called pitchmemory. Danielle Navarro has created a template that you can use to initialize your new repository. Go to https://github.com/djnavarro/newproject/ and click on the green button labelled “Use this template”:

A window will come up allowing you to name the new repository – call it pitchmemory. Now navigate into your Desktop/CHDSS folder and clone the repository. I’d type > git clone https://github.com/cskemp/pitchmemory.git

Folder structure

If you open up the pitchmemory folder you’ll see something like:

Each folder includes a README, so poking around and reading them will give you a sense of the intended purpose of each folder. Here we’ll give a high-level overview.

experiments

If this is an experimental project, the experiments folder will contain a sub-folder for each experiment. These sub-folders include source code for the experiment (if it is computerized) and raw data.

preprocessing

Scripts for cleaning the data are in preprocessing — these scripts read from the raw data files in experiments and write to data. You should never edit the raw data files manually—all preprocessing (e.g. dropping incomplete records) should be done by scripts which allows others to reproduce exactly what you did.

data

The data folder includes cleaned data files only.

analysis

Scripts for analyzing the cleaned data and running statistical tests go in the analysis folder.

model

If the project includes one or more computational models, they go in the model folder.

writeup

The writeup folder includes talks, posters and manuscripts that describe the project.

docs

The “docs” folder on GitHub is special: if your project is set up to allow it, any files in this folder will appear as a GitHub Pages website. We therefore suggest reserving this folder in case you’d like to put the project on the web at some stage, and placing project-related documents in other directories (e.g. writeup)

Other organisation ideas

The newproject template is a good starting point but it’s worth thinking about how you could add additional structure for your own projects. For example,

For more thoughts about project organisation, see this guide to reproducible code.

File names

File names matter, and folders and files should be named according to a consistent scheme that is easily readable by both humans and machines. For human readability,

For machine readability.

Making your repository public

Some people make their project repositories public from the very beginning — this approach is consistent with the philosophy behind born-open data.

In my group we normally start with a private repository, but aim to make the repository public when a project is submitted or accepted. If you follow this approach it’s important to remember that GitHub keeps a record of all the changes you’ve made along the way as you work on your repository. This is a feature not a bug! It means that you can easily go back to previous versions of your files if you realise that you’ve deleted something that you need. But it does mean that you should be careful about what goes into the repository. For example, suppose that you add a data file to the repository that includes personal information about your participants. Before making the repository public, it is somewhat tricky to remove all of the sensitive information from the repository (just scrubbing the current version of the files is not enough — you’d also need to remove the sensitive information from the repository history). Besides, if the information were really sensitive it should probably not have been uploaded to the cloud in the first place.

RStudio projects

The folder for the pitchmemory project is now well on the way, and normally you will also have folders for several other projects that you work on from time to time. It’s useful to turn each of these projects into an RStudio project. According to the developers, these RStudio projects

make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents

Normally, R keeps track of various different events that have happened in your previous work, but it has no idea which events are associated with which “project”. An RStudio project creates a .Rproj file that links the different scripts, data sets, etc within a particular folder on your computer. So, why work in projects? As I see it, there are two kinds of reason, both of which are valid:

Creating a project

The pitchstudio folder already has a file called newproject.Rproj that came from Dani’s template. Rename this file to pitchmemory.Rproj to match the name of your repository.

To see how to create a RStudio project from scratch, let’s add one to the summerschool repository you created when learning about Git. Go to the the little blue menu in the top the top right corner in RStudio, click on the dropdown menu, and select “New Project”.

This will bring up a dialog box that provides a few different optons. Because we’re going to work in the existing folder that we created for the git tutorial (~\Desktop\CHDSS\summerschool), select “existing directory” and then browse for the correct location.

Once you’ve created the project, if you have a look at the folder in Windows Explorer / Mac Finder, you’ll see a new file called summerschool.Rproj:

Done. You now have an RStudio project. Any time you want to switch between projects, use the drop down menu. RStudio will automatically change the working directory, start a new R session for you, and open up whatever files you had open last time you were using it.

R Markdown documents

Documentation is often written using markdown — for example, you used markdown earlier today when writing the README for your summerschool repository. R Markdown allows you to combine documentation and code in a way that is extremely convenient.

Let’s create a R Markdown document in your summerschool repository. First switch to the summerschool project in RStudio if you are not there already. Then use the Rstudio file menu to create a new R Markdown document:

Give your document a title and choose output type, which we’ll assume to be HTML.

This creates an untitled R Markdown document in the source pane:

Save it in your summerschool repository with the name exploration.Rmd. So now you have this:

Render your R Markdown document by clicking the Knit button. The result is an .html file that can be seen in the Viewer pane on the right side of RStudio.

Structure of an R Markdown document

  • the bit at the top is the “yaml header” (ignore it for now)
  • anything shaded in grey (between the backticks) is a code chunk and is treated just like an R script
  • anything in white treated like Markdown.

Play around with this for a bit, and get a feel for how it works!

Using a R Markdown document

If you’re working interactively with a R Markdown document, it’s often useful to run a single code chunk. You can do this by clicking the “Run Current Chunk” button (small right-facing green arrow at the top of the chunk).

First, though, you’ll often need to run the previous chunks so that the variables they compute are available. You can achieve this using the “Run All Chunks Above” button (the one with the arrow pointing downward).

Suppose you Knit a document and see an error — e.g. because one chunk uses a function from a library that you haven’t installed. If you don’t want to fix the problem you can set eval=FALSE at the top of the chunk so that the chunk is not run when Knit is invoked. Other chunk options are described here. One worth knowing about is cache=TRUE, which means that a chunk gets rerun only if it has changed since the last time you ran it. Caching a chunk is handy if the chunk takes a long time to run.

When to use R Markdown documents

R Markdown can be used for many purposes, and we give only a handful of examples here.

Resources