Lessons learned from working reproducibly with others
In September 2022, the Open Pediatric Brain Tumor Atlas (OpenPBTA) project culminated (for now) in a preprint on bioRxiv. This project, started in late 2019 and co-organized with the Center for Data Driven Discovery in Biomedicine (D3b) at Children’s Hospital of Philadelphia (CHOP), is a collaborative effort to comprehensively describe the Pediatric Brain Tumor Atlas (PBTA), a collection of multiple data types from tens of tumor types (read more about why crowdsourcing expertise for the study of pediatric brain tumors is important here). The project is designed to allow for contributions from experts across multiple institutions. We’ve conducted analysis and drafting of the manuscript openly on the version-control platform GitHub from the project’s inception to facilitate those contributions.
We’ve written about OpenPBTA in our blog before, documenting some of what we found challenging. And we’ve shared some of how we think about processes and what we think goes into setting up data-intensive research for success. In this post, we’ll reflect on how carrying out the work of OpenPBTA influenced the Data Lab’s own philosophy and processes in the time since its inception.
Previously unknown mechanisms for collaboration
We wanted to allow for people beyond the core group of organizers to participate in the project, but how did we facilitate their contributions? An overview can be found in the figure below.
There are three main steps a contributor takes (although it’s not always so linear):
1. Proposing an analysis
2. Implementing an analysis
3. Recording the analysis (i.e., the methods and results) in the manuscript
Different technologies or platforms underlie each of the steps illustrated in the figure above, and contributors receive feedback throughout this process. To propose and ultimately discuss an analysis, we take advantage of the fact that GitHub allows you to file and comment on issues alongside the code it hosts in a repository. Contributors write analytical code (e.g., R Notebooks) in their own fork of the repository hosted on GitHub (a fork is a copy of the repository that a contributor manages themselves). The analytical code is reviewed using GitHub functionality called a pull request. You can think of a pull request as a contributor asking to add the code additions or changes they wrote in their fork to the main repository. Pull requests allow for reviewers to comment on individual sections of code as well as overall (more on what else happens during pull requests below). Finally, we use Manubot to manage the “source text” of the manuscript itself on GitHub and the same use of issues and pull requests applies.
Between you and me, the arrows in the diagram are really doing some heavy lifting! Working on a project over multiple years, with multiple contributors, and when you expect data will change over time presents some obstacles. For the purpose of this post, we’ll focus on the analysis portion of the project. Some considerations, in no particular order:
- Contributors need to be onboarded asynchronously, i.e., through documentation.
- All contributors need access to the most recent version of the data, including output of analytical code (e.g., putative oncogenic fusions, consensus copy number, Tumor Mutation Burden).
- Ideally, contributors would have a consistent computing environment & software versions. At the very least, the data and figures that underlie or are included in the manuscript must be run in a consistent environment as a final step.
- The underlying code must be robust to changes that arise when new data is released.
- Code needs to be read, understood, and very possibly fixed by someone who didn’t write it.
We fully expect that some of these challenges overlap with challenges that are documented in open-source software, but our science team had little to no experience in that realm! Said another way, these mechanisms were previously unknown to us but not overall.
Documentation for you (contributor) and me (organizer)
For the first point – onboarding new contributors at any time day or night – we need to talk about the documentation included in the project. Some of the components for the analysis repository include (not at all an exhaustive list):
- A friendly introduction (i.e., what’s the project all about?)
- Information about the data themselves
- How to participate, including how to propose an analysis
- How to add an analysis (e.g., what are the folder and script or notebook conventions)
- How to file a pull request
- How peer review is intended to work
Individual analysis modules tend to have their own assumptions and ways they are intended to be run that are worth documenting. This practice is particularly handy when someone other than the main author of the module needs to make revisions. Let’s take an analysis module called [.inline-snippet]transcriptomic-dimension-reduction[.inline-snippet] as an example. As you may have guessed from the name, this module performs dimension reduction on the RNA-seq data included in the project and makes associated plots. The README for this module includes the following information:
- How the entire module is intended to be run. We assume that anyone interacting with this module is most likely going to take this action.
- How to run UMAP multiple times from the command line using different seeds.
- How to use functions in the module. For example, a contributor might want to plot more than just the first two Principal Components.
To summarize, the README covers how you expect to run the module most of the time and how you can use the constituent parts to do other things you might reasonably expect to need to do during the course of the project.
To be completely frank, we don’t know if the documentation throughout this project is adequate – we haven’t talked to everyone who has read it, so we don’t know who has found it challenging to follow and why. Like many things included in this post, we’ve taken some documentation lessons from OpenPBTA into projects we’ve started since.
Creating a single source of data truth
Let’s talk in more detail about what specific technical challenges arise during the analysis portion of a project of this nature and the solutions we implemented, starting with a “single source of truth” for the project data.
We need to make sure that anyone working on the project is able to have access to the most up-to-date data. The most up-to-date is going to serve as our single source of truth – all analyses, figures, etc. should use this data version as the starting point, which means we need to include any results that we can expect will be used in multiple analyses from multiple contributors. For example, if the tumor mutation burden (TMB) metric calculated in one analysis module will be used in one or more additional analyses, those values need to be included in the single source of truth. Otherwise, a given contributor's results will be derived from whatever version of the TMB file they happen to have locally.
The main process we utilize, which you may be familiar with if you’ve worked on any large data projects, is data releases. In this case, a data release is a set of data files frozen at a particular point in time. Each data release comes with its own set of release notes documenting the version number, date of release, changes between releases, and included files. These releases are available via CAVATICA, as well as from a public AWS S3 bucket.
Data releases are the first step, but we need a way to uniformly distribute them. Put another way, we need to facilitate contributors starting from the same point on their local machines. We can’t track the data in the Git repository because the files are too large. Instead, we maintain a bash script that downloads the most recent data release to [.inline-snippet]data/release/[.inline-snippet] in someone’s local copy of the repository. (Each time there’s a new release, this script is updated accordingly.) The script also checks to make sure the md5 checksums for the files downloaded locally match what is expected, i.e., to make sure files were not corrupted or truncated during download. One of the last steps is to symbolically link (or symlink) the files in [.inline-snippet]data/release/[.inline-snippet] to [.inline-snippet]data/[.inline-snippet]. By symlinking to [.inline-snippet]data/[.inline-snippet], contributors can write code that will work on any release.
Let’s take a look at a concrete example of the benefit of symlinking using the TMB file as mentioned above. Say I started developing an analysis that used TMB when v14 was the most recent release. I could write R code that looked like (assumes the working directory is the project root):
There are at least two problems with this: 1. For this code to run for someone who isn’t on my computer, they’d need to have a local copy of this older release. 2. If the TMB calculations change in a later release, my code would still be using the outdated data unless I explicitly went in and changed it. Instead, with symlinking, everyone can write R code that looks like (assumes the working directory is the project root):
The end result is code that should still work (well, at least this step 🤞) regardless of what the most current release is and uses the most up-to-date version of the data someone has locally. It also meant that we could ensure all the results that made it into the manuscript that was submitted were run using the same data release because we rerun everything at the end.
Verifying that analyses are functional over the project’s lifespan
We must be able to run all the code required for the analyses that are included in the manuscript using the most up-to-date version of the data right before submission (as is good practice). That means that code written multiple months or years ago still needs to run without error. Individual differences in environments (e.g., versions of R and R packages used for analyses) can also produce different results, so we wanted to make sure that the code can produce results in whatever environment will be used for the final run™. Without taking care to manage the software dependencies throughout the life cycle of the project, we could end up in a situation where a contributor’s code ran on their local machine when they submitted their pull request but we can’t reproduce their results when it’s time for submission. And remember, the underlying is subject to change as pipelines or analyses are refined and new releases are cut.
We can address our concerns about dependencies and code robustness over time by using a few key technologies and practices. First, we use a single Docker container that contains all of the dependencies required to carry out the project. In this particular case, we build our image on top of the tidyverse image from the Rocker Project. Use of this image allows contributors to access an RStudio interface, and many of the analyses are written in R. We also leverage a continuous integration (CI) service to ensure that code will run at the time of pull request submission. To save on both memory and compute time, we specifically create smaller test files that have a subset of samples and the full features for all the files that are in a release to be used in CI. The very first step in the CI pipeline is to ensure that the Docker image builds and that we can download the data. Subsequent steps are run inside the Docker container (we don’t need to rebuild the image at each step, we can rely on Docker layer caching) and on the test data.
With this setup we verified the following:
The dependencies required for analyses can be installed on the Docker image because the Docker image builds and the analyses can be run in the container.
The analysis is at least robust to running testing data, not just the current release of the full data, if it runs without error in CI. You can catch things like hardcoding a sample identifier or the number of samples at this step, which should be avoided since we expect the data will continue to change over time. You’d also find out if a variable had not been assigned a value or if a required library had not been loaded.
Every analysis covered in CI is run every time a pull request to the default branch is filed, so we can detect if something breaks over time.
Similarly, every time there’s a new data release, there’s a new release of testing data. You may catch that something is wrong with files in the data release or reveal code brittleness with respect to the new release at this stage.
It’s worth noting that we’re only checking whether or not analyses run without error; we’re not using unit tests to check the functions contributors write for correctness, for example, but we could use this approach in future projects.
Every pull request is also subject to analytical code review, i.e., an organizer or another contributor with relevant expertise will review the code for things like correctness, generalizability, setting seeds for reproducibility, and even principles such as “don’t repeat yourself” (DRY). We’re big fans of analytical code review at the Data Lab, and will take this opportunity to point you to Parker 2017, which has influenced our thinking on the matter.
How we applied lessons from OpenPBTA to internal projects
Alt title: True Life - I Didn’t Know About Continuous Integration
We knew about and used some of these practices before OpenPBTA at the Data Lab – namely, analytical code review and Docker. Before we began working remotely – and our data science team grew in size – it was a bit easier to make sure we were working from the same data during development (in the sense that we didn’t give it as much thought as we probably should have). Let’s dive into just a few examples of where we applied lessons from OpenPBTA to other Data Lab endeavors.
No more human spell check
In 2020, we worked to revise the material we wrote to demonstrate different analyses you might perform using data downloaded from our product refine.bio – termed refine.bio examples – because usability evaluations revealed some problems we needed to address. After working on OpenPBTA, we knew we wanted to make sure everyone was working from the same source of data and to leverage our new favorite thing: CI. Here are some features of refine.bio examples development influenced by working on OpenPBTA:
- All datasets used during development were put into S3. That allowed Data Lab members reviewing each other's code to work from the same data, but it also allowed us to (re-)render all the R Notebooks when pull requests were merged.
- We wrote style guidelines documentation to stay on the same page as much as possible.
- We used a CI workflow to automatically spell check and style R Markdown files using the spelling and styler R packages, respectively.
As you can probably tell from the header of this section, the spell check workflow sparks the most joy.
Training the trainers to automate
For our training workshops, we use R Notebooks for instruction. We use live coding for a portion of instruction, but we want to be able to easily point training participants to completed notebooks for their reference. We do that by linking to the HTML versions of the completed notebooks, served via GitHub Pages, in our schedules. To minimize the burden of manually rendering notebooks or reviewing them to make sure they run, we rely on CI workflows to do the following:
- Ensure that notebooks will render without error during pull request review.
- Remove code that’s intended to be typed live during instruction.
- Did I mention that I love spell check? I do.
Functional testing of documentation
No documentation will ever be perfect, but we now perform functional testing of documentation. Functional testing means we have someone in the lab that isn’t directly involved in a project attempt to set up the dependencies or run an analysis by following our own projects’ documentation. We recently released a workflow designed to be used after downloading data from the Single-cell Pediatric Cancer Atlas Portal. We are performing usability evaluations of that pipeline (you can sign up here if you’re interested!), but in parallel, we set up this workflow on different machines to find any issues with the documentation (like this one).
We’re always eager to share what we’ve learned through projects like OpenPBTA. Subscribe to our blog to read more about the challenges we have encountered, the solutions we have implemented, and the processes that help the Data Lab run most efficiently!
In September 2022, the Open Pediatric Brain Tumor Atlas (OpenPBTA) project culminated (for now) in a preprint on bioRxiv. This project, started in late 2019 and co-organized with the Center for Data Driven Discovery in Biomedicine (D3b) at Children’s Hospital of Philadelphia (CHOP), is a collaborative effort to comprehensively describe the Pediatric Brain Tumor Atlas (PBTA), a collection of multiple data types from tens of tumor types (read more about why crowdsourcing expertise for the study of pediatric brain tumors is important here). The project is designed to allow for contributions from experts across multiple institutions. We’ve conducted analysis and drafting of the manuscript openly on the version-control platform GitHub from the project’s inception to facilitate those contributions.
We’ve written about OpenPBTA in our blog before, documenting some of what we found challenging. And we’ve shared some of how we think about processes and what we think goes into setting up data-intensive research for success. In this post, we’ll reflect on how carrying out the work of OpenPBTA influenced the Data Lab’s own philosophy and processes in the time since its inception.
Previously unknown mechanisms for collaboration
We wanted to allow for people beyond the core group of organizers to participate in the project, but how did we facilitate their contributions? An overview can be found in the figure below.
There are three main steps a contributor takes (although it’s not always so linear):
1. Proposing an analysis
2. Implementing an analysis
3. Recording the analysis (i.e., the methods and results) in the manuscript
Different technologies or platforms underlie each of the steps illustrated in the figure above, and contributors receive feedback throughout this process. To propose and ultimately discuss an analysis, we take advantage of the fact that GitHub allows you to file and comment on issues alongside the code it hosts in a repository. Contributors write analytical code (e.g., R Notebooks) in their own fork of the repository hosted on GitHub (a fork is a copy of the repository that a contributor manages themselves). The analytical code is reviewed using GitHub functionality called a pull request. You can think of a pull request as a contributor asking to add the code additions or changes they wrote in their fork to the main repository. Pull requests allow for reviewers to comment on individual sections of code as well as overall (more on what else happens during pull requests below). Finally, we use Manubot to manage the “source text” of the manuscript itself on GitHub and the same use of issues and pull requests applies.
Between you and me, the arrows in the diagram are really doing some heavy lifting! Working on a project over multiple years, with multiple contributors, and when you expect data will change over time presents some obstacles. For the purpose of this post, we’ll focus on the analysis portion of the project. Some considerations, in no particular order:
- Contributors need to be onboarded asynchronously, i.e., through documentation.
- All contributors need access to the most recent version of the data, including output of analytical code (e.g., putative oncogenic fusions, consensus copy number, Tumor Mutation Burden).
- Ideally, contributors would have a consistent computing environment & software versions. At the very least, the data and figures that underlie or are included in the manuscript must be run in a consistent environment as a final step.
- The underlying code must be robust to changes that arise when new data is released.
- Code needs to be read, understood, and very possibly fixed by someone who didn’t write it.
We fully expect that some of these challenges overlap with challenges that are documented in open-source software, but our science team had little to no experience in that realm! Said another way, these mechanisms were previously unknown to us but not overall.
Documentation for you (contributor) and me (organizer)
For the first point – onboarding new contributors at any time day or night – we need to talk about the documentation included in the project. Some of the components for the analysis repository include (not at all an exhaustive list):
- A friendly introduction (i.e., what’s the project all about?)
- Information about the data themselves
- How to participate, including how to propose an analysis
- How to add an analysis (e.g., what are the folder and script or notebook conventions)
- How to file a pull request
- How peer review is intended to work
Individual analysis modules tend to have their own assumptions and ways they are intended to be run that are worth documenting. This practice is particularly handy when someone other than the main author of the module needs to make revisions. Let’s take an analysis module called [.inline-snippet]transcriptomic-dimension-reduction[.inline-snippet] as an example. As you may have guessed from the name, this module performs dimension reduction on the RNA-seq data included in the project and makes associated plots. The README for this module includes the following information:
- How the entire module is intended to be run. We assume that anyone interacting with this module is most likely going to take this action.
- How to run UMAP multiple times from the command line using different seeds.
- How to use functions in the module. For example, a contributor might want to plot more than just the first two Principal Components.
To summarize, the README covers how you expect to run the module most of the time and how you can use the constituent parts to do other things you might reasonably expect to need to do during the course of the project.
To be completely frank, we don’t know if the documentation throughout this project is adequate – we haven’t talked to everyone who has read it, so we don’t know who has found it challenging to follow and why. Like many things included in this post, we’ve taken some documentation lessons from OpenPBTA into projects we’ve started since.
Creating a single source of data truth
Let’s talk in more detail about what specific technical challenges arise during the analysis portion of a project of this nature and the solutions we implemented, starting with a “single source of truth” for the project data.
We need to make sure that anyone working on the project is able to have access to the most up-to-date data. The most up-to-date is going to serve as our single source of truth – all analyses, figures, etc. should use this data version as the starting point, which means we need to include any results that we can expect will be used in multiple analyses from multiple contributors. For example, if the tumor mutation burden (TMB) metric calculated in one analysis module will be used in one or more additional analyses, those values need to be included in the single source of truth. Otherwise, a given contributor's results will be derived from whatever version of the TMB file they happen to have locally.
The main process we utilize, which you may be familiar with if you’ve worked on any large data projects, is data releases. In this case, a data release is a set of data files frozen at a particular point in time. Each data release comes with its own set of release notes documenting the version number, date of release, changes between releases, and included files. These releases are available via CAVATICA, as well as from a public AWS S3 bucket.
Data releases are the first step, but we need a way to uniformly distribute them. Put another way, we need to facilitate contributors starting from the same point on their local machines. We can’t track the data in the Git repository because the files are too large. Instead, we maintain a bash script that downloads the most recent data release to [.inline-snippet]data/release/[.inline-snippet] in someone’s local copy of the repository. (Each time there’s a new release, this script is updated accordingly.) The script also checks to make sure the md5 checksums for the files downloaded locally match what is expected, i.e., to make sure files were not corrupted or truncated during download. One of the last steps is to symbolically link (or symlink) the files in [.inline-snippet]data/release/[.inline-snippet] to [.inline-snippet]data/[.inline-snippet]. By symlinking to [.inline-snippet]data/[.inline-snippet], contributors can write code that will work on any release.
Let’s take a look at a concrete example of the benefit of symlinking using the TMB file as mentioned above. Say I started developing an analysis that used TMB when v14 was the most recent release. I could write R code that looked like (assumes the working directory is the project root):
There are at least two problems with this: 1. For this code to run for someone who isn’t on my computer, they’d need to have a local copy of this older release. 2. If the TMB calculations change in a later release, my code would still be using the outdated data unless I explicitly went in and changed it. Instead, with symlinking, everyone can write R code that looks like (assumes the working directory is the project root):
The end result is code that should still work (well, at least this step 🤞) regardless of what the most current release is and uses the most up-to-date version of the data someone has locally. It also meant that we could ensure all the results that made it into the manuscript that was submitted were run using the same data release because we rerun everything at the end.
Verifying that analyses are functional over the project’s lifespan
We must be able to run all the code required for the analyses that are included in the manuscript using the most up-to-date version of the data right before submission (as is good practice). That means that code written multiple months or years ago still needs to run without error. Individual differences in environments (e.g., versions of R and R packages used for analyses) can also produce different results, so we wanted to make sure that the code can produce results in whatever environment will be used for the final run™. Without taking care to manage the software dependencies throughout the life cycle of the project, we could end up in a situation where a contributor’s code ran on their local machine when they submitted their pull request but we can’t reproduce their results when it’s time for submission. And remember, the underlying is subject to change as pipelines or analyses are refined and new releases are cut.
We can address our concerns about dependencies and code robustness over time by using a few key technologies and practices. First, we use a single Docker container that contains all of the dependencies required to carry out the project. In this particular case, we build our image on top of the tidyverse image from the Rocker Project. Use of this image allows contributors to access an RStudio interface, and many of the analyses are written in R. We also leverage a continuous integration (CI) service to ensure that code will run at the time of pull request submission. To save on both memory and compute time, we specifically create smaller test files that have a subset of samples and the full features for all the files that are in a release to be used in CI. The very first step in the CI pipeline is to ensure that the Docker image builds and that we can download the data. Subsequent steps are run inside the Docker container (we don’t need to rebuild the image at each step, we can rely on Docker layer caching) and on the test data.
With this setup we verified the following:
The dependencies required for analyses can be installed on the Docker image because the Docker image builds and the analyses can be run in the container.
The analysis is at least robust to running testing data, not just the current release of the full data, if it runs without error in CI. You can catch things like hardcoding a sample identifier or the number of samples at this step, which should be avoided since we expect the data will continue to change over time. You’d also find out if a variable had not been assigned a value or if a required library had not been loaded.
Every analysis covered in CI is run every time a pull request to the default branch is filed, so we can detect if something breaks over time.
Similarly, every time there’s a new data release, there’s a new release of testing data. You may catch that something is wrong with files in the data release or reveal code brittleness with respect to the new release at this stage.
It’s worth noting that we’re only checking whether or not analyses run without error; we’re not using unit tests to check the functions contributors write for correctness, for example, but we could use this approach in future projects.
Every pull request is also subject to analytical code review, i.e., an organizer or another contributor with relevant expertise will review the code for things like correctness, generalizability, setting seeds for reproducibility, and even principles such as “don’t repeat yourself” (DRY). We’re big fans of analytical code review at the Data Lab, and will take this opportunity to point you to Parker 2017, which has influenced our thinking on the matter.
How we applied lessons from OpenPBTA to internal projects
Alt title: True Life - I Didn’t Know About Continuous Integration
We knew about and used some of these practices before OpenPBTA at the Data Lab – namely, analytical code review and Docker. Before we began working remotely – and our data science team grew in size – it was a bit easier to make sure we were working from the same data during development (in the sense that we didn’t give it as much thought as we probably should have). Let’s dive into just a few examples of where we applied lessons from OpenPBTA to other Data Lab endeavors.
No more human spell check
In 2020, we worked to revise the material we wrote to demonstrate different analyses you might perform using data downloaded from our product refine.bio – termed refine.bio examples – because usability evaluations revealed some problems we needed to address. After working on OpenPBTA, we knew we wanted to make sure everyone was working from the same source of data and to leverage our new favorite thing: CI. Here are some features of refine.bio examples development influenced by working on OpenPBTA:
- All datasets used during development were put into S3. That allowed Data Lab members reviewing each other's code to work from the same data, but it also allowed us to (re-)render all the R Notebooks when pull requests were merged.
- We wrote style guidelines documentation to stay on the same page as much as possible.
- We used a CI workflow to automatically spell check and style R Markdown files using the spelling and styler R packages, respectively.
As you can probably tell from the header of this section, the spell check workflow sparks the most joy.
Training the trainers to automate
For our training workshops, we use R Notebooks for instruction. We use live coding for a portion of instruction, but we want to be able to easily point training participants to completed notebooks for their reference. We do that by linking to the HTML versions of the completed notebooks, served via GitHub Pages, in our schedules. To minimize the burden of manually rendering notebooks or reviewing them to make sure they run, we rely on CI workflows to do the following:
- Ensure that notebooks will render without error during pull request review.
- Remove code that’s intended to be typed live during instruction.
- Did I mention that I love spell check? I do.
Functional testing of documentation
No documentation will ever be perfect, but we now perform functional testing of documentation. Functional testing means we have someone in the lab that isn’t directly involved in a project attempt to set up the dependencies or run an analysis by following our own projects’ documentation. We recently released a workflow designed to be used after downloading data from the Single-cell Pediatric Cancer Atlas Portal. We are performing usability evaluations of that pipeline (you can sign up here if you’re interested!), but in parallel, we set up this workflow on different machines to find any issues with the documentation (like this one).
We’re always eager to share what we’ve learned through projects like OpenPBTA. Subscribe to our blog to read more about the challenges we have encountered, the solutions we have implemented, and the processes that help the Data Lab run most efficiently!