Good Workflow Practices for Reproducible Data Science

Dr. Mine Dogucu

Naming files

Three principles of naming files

  • machine readable
  • human readable
  • plays well with default ordering (e.g. alphabetical and numerical ordering)

(Jenny Bryan)

for the purposes of this course an additional principle is that file names follow

  • tidyverse style (all lower case letters, words separated by HYPHEN)

README.md

  • README file is the first file users read. In our case a user might be our future self, a teammate, or (if open source) anyone.

  • There can be multiple README files within a single directory: e.g. for the general project folder and then for a data subfolder. Data folder README’s can possibly contain codebook (data dictionary).

  • It should be brief but detailed enough to help user navigate.

  • a README should be up-to-date (can be updated throughout a project’s lifecycle as needed).

  • On GitHub we use markdown for README file (README.md). Good news: emojis are supported.

README examples

R Packages

Default

Microsoft products have Copyright. Images used based on fair use for educational purposes.

R Packages

Optional

R packages

  • When you download R, you actually download base R.

  • But there are MANY optional packages you can download.

  • Good part: There is an R package for (almost) everything, from complex statistical modeling packages to baby names.

  • Bad part: At the beginning it can feel overwhelming.

  • All this time we have actually been using R packages.

R packages

What do R packages have? All sorts of things but mainly

  • functions

  • datasets

Exercise

Refer back to older slides and identify the packages for each of the functions we commonly use in this class.

Importing .csv Data

readr::read_csv("dataset.csv")

Importing Excel Data

readxl::read_excel("dataset.xlsx")

Importing Excel Data

readxl::read_excel("dataset.xlsx", sheet = 2)

Importing SAS, SPSS, Stata Data

library(haven)
# SAS
read_sas("dataset.sas7bdat")
# SPSS
read_sav("dataset.sav")
# Stata
read_dta("dataset.dta")

Where is the dataset file?

Importing data will depend on where the dataset is on your computer. However we use the help of here::here() function. This function sets the working directory to the project folder (i.e. where the .Rproj file is).

read_csv(here::here("data/dataset.csv"))

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

If each change is made by one collaborator at a time, this would not be an efficient workflow.

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

1 - commit

2 - pull (very important)

3 - push

Collaboration on GitHub

Collaboration on GitHub

Collaboration on GitHub

Opening an issue

We can create an issue to keep a list of mistakes to be fixed, ideas to check with teammates, or note a to-do task. You can assign tasks to yourself or teammates.

Closing an issue

If you are working on an issue, it makes sense to refer to issue number in your commit message (e.g. “add first draft of alternate texts for #4”). If your commit resolves the issue then you can use key words such as “fixes #4” or “closes #4” to close the issue. Issues can also be manually closed.

.gitignore

A .gitignore file contains the list of files which Git has been explicitly told to ignore.

For instance README.html can be git ignored.

You may consider git ignoring confidential files (e.g. some datasets) so that they would not be pushed by mistake to GitHub.

A file can be git ignored either by point-and-click using RStudio’s Git pane or by adding the file path to the .gitignore file. For instance weather.csv data file in a data folder need to be added as data/weather.csv

Files with certain files (e.g. all .log files) can also be ignored. See git ignore patterns.

It is also a good practice to save session information as package versions change, in order to be able to reproduce results from an analysis we need to know under what technical conditions the analysis was conducted.

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS 15.1.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.4.1    fastmap_1.2.0     cli_3.6.3         tools_4.4.1      
 [5] htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10       rmarkdown_2.29   
 [9] knitr_1.49        jsonlite_1.8.9    xfun_0.49         digest_0.6.37    
[13] rlang_1.1.4       evaluate_1.0.1