Working reproducibly with others on OpenScPCA

September 30, 2024

Earlier this year, we launched the Open Single-cell Pediatric Cancer Atlas (OpenScPCA) project, a collaborative project to openly analyze the data in the Single-cell Pediatric Cancer Atlas Portal on GitHub. We hope this project will bring transparently and expertly assigned cell type labels to the data in the Portal, help the community understand the strengths and limitations of applying existing single-cell methods to pediatric cancer data, and, frankly, allow us to meet more scientists in our community working with single-cell data (maybe you? 😄).

From experience, we know the importance of ensuring reproducibility in an open analysis project. We’re also aware that this is challenging. Luckily, we weren’t starting entirely from scratch. OpenScPCA builds off the success of a project we published last year: the Open Pediatric Brain Tumor Atlas (OpenPBTA) project, which was co-organized with the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia. We’re not satisfied with just building on our success here–we want to make sure we’re applying the lessons learned along the way. In fact, Jo Lynne Rokita's blog post details what D3b learned, and I wrote about what the Data Lab learned from my blog post.

Besides taking contributor-facing documentation more seriously—you can read more about it in Deepa Prasad’s most recent blog post—and making grants available, we think a few key differences in our approach to OpenScPCA vs. OpenPBTA are worth sharing. This blog post will focus on techniques we're using to ensure reproducibility over time and our testing strategy, which is just about checking that code can execute, not correctness (we use peer review to help with that!). 

If it ain’t broken… wait, are you sure it's not broken?

If an analysis project is around long enough, something is bound to break. In our experience, open, collaborative projects are long-lived–we want the code to be useful for a while! Combine that with multiple people from different labs writing code to analyze the data, and you have created the perfect conditions for things to break in expected and unexpected ways. In this way, OpenScPCA and OpenPBTA are not different–we need to make sure the analysis code still runs as new data releases come out and that underlying results are reproducible over time. (Reproducibility and its benefits and goals are more extensively discussed in Peng and Hicks. 2021.)

Are we doomed to try to play (and win 🤞) bug-whack-a-mole when we go to publish parts of OpenScPCA? No! We can borrow a collection of strategies from our colleagues in software development to mitigate these problems, namely continuous integration/continuous delivery (CI/CD) and dependency management (i.e., tracking and pinning the versions of libraries or packages our code depends on).

Ensuring reproducibility in OpenPBTA

In OpenPBTA, our strategies for ensuring reproducibility throughout the project were simple. We had one monolithic Docker image containing all the dependencies for every analysis and one CI/CD workflow to check that all code could be run. We created test data that included all features for a subset of samples. The workflow would first build the Docker image, run a container from the recently-built image, download the test data, and then run every analysis module using the test data within the container. Before merging a pull request, the workflow had to pass regardless of what files the author was altering in the pull request.

Our approach mitigated or at least made us aware of some problems that arise in analytical projects. For example, running code on test data ensured that the syntax was correct, and if, for example, code made assumptions about the presence of specific samples (i.e., by hardcoding sample identifiers), a workflow failure would let us know we needed to make the code more general. If an analysis module’s dependencies were missing from the Docker container, the workflow would fail, and we would know we needed to add the dependencies. Ultimately, we could successfully run all the steps contributing to the OpenPBTA publication in the project Docker container using the final data release.

The approach’s simplicity was appealing, but as the project continued, it created pain points. Large Docker images take a long time to build and download. As the number of analysis modules grew, the number of steps in the workflow also increased. In the latter stages of the project, the workflow would sometimes time out (i.e., fail because it took too long), and even when it finished in the required time, it could be frustrating to wait over an hour to find out that there was a syntax error in the pull request. Heading into OpenScPCA, we wanted to retain the benefits of our OpenPBTA reproducibility strategy while alleviating these pain points.

Ensuring reproducibility in OpenScPCA

Like in OpenPBTA, we expect analyses to be organized into subdirectories or modules. Often, these modules will be entirely independent of each other. For example, if we are adding cell type labels to two projects–particularly if we use different methods–the code can be organized into separate modules. If Project A’s cell typing code is broken, that has no bearing on Project B’s cell typing code. We can use this independence to our advantage when creating our dependency management and testing strategy.

In OpenScPCA, every module has its own environment–i.e., the set of dependencies required to run its code–and its own CI/CD workflows: one to run the module’s code on test data and another to test building and push the module-specific Docker image to AWS Elastic Container Registry. To help contributors and project organizers with setup, we use a script to initialize new modules with the required files for managing environments and testing. We use a mix of renv and conda to manage dependencies, depending on the language used in the module, and we can use both of these technologies in conjunction with Docker. When we test a module’s code using the test data, which is a set of 100 simulated cells for each library, we will activate the environment or run it in the module’s container when applicable. These workflows only get triggered when a pull request changes relevant files: if Project A’s cell typing code changes, only Project A’s cell typing code gets tested. We hope this will allow us to develop and merge code more efficiently.

Running all the code on every pull request in OpenPBTA did have an advantage–if a module hadn’t been touched in a while but no longer worked with a new data release, we’d find out pretty quickly. If we only test OpenScPCA code when it changes, we could be in a situation where code has been broken for months without us being aware. To guard against this possibility, we also build Docker images and run the module code using the test data on a monthly schedule. Then, we have an opportunity to fix any problems before we attempt to rerun something way down the line and are unpleasantly surprised.

Of course, given our focus on cell type labels, we must also prepare for the possibility that modules will build off one another (i.e., an analysis uses cell type labels someone else generated as input). Keep your eyes peeled for Josh Shapiro’s part II of this blog post for more on how we’re thinking about this problem!

Join us for OpenScPCA!

Are you a researcher with experience or expertise in pediatric cancer, single-cell data, labeling cell types or cell states, and/or pan-cancer analyses? We invite you to explore and contribute your ideas to OpenScPCA!

Earlier this year, we launched the Open Single-cell Pediatric Cancer Atlas (OpenScPCA) project, a collaborative project to openly analyze the data in the Single-cell Pediatric Cancer Atlas Portal on GitHub. We hope this project will bring transparently and expertly assigned cell type labels to the data in the Portal, help the community understand the strengths and limitations of applying existing single-cell methods to pediatric cancer data, and, frankly, allow us to meet more scientists in our community working with single-cell data (maybe you? 😄).

From experience, we know the importance of ensuring reproducibility in an open analysis project. We’re also aware that this is challenging. Luckily, we weren’t starting entirely from scratch. OpenScPCA builds off the success of a project we published last year: the Open Pediatric Brain Tumor Atlas (OpenPBTA) project, which was co-organized with the Center for Data-Driven Discovery in Biomedicine (D3b) at the Children’s Hospital of Philadelphia. We’re not satisfied with just building on our success here–we want to make sure we’re applying the lessons learned along the way. In fact, Jo Lynne Rokita's blog post details what D3b learned, and I wrote about what the Data Lab learned from my blog post.

Besides taking contributor-facing documentation more seriously—you can read more about it in Deepa Prasad’s most recent blog post—and making grants available, we think a few key differences in our approach to OpenScPCA vs. OpenPBTA are worth sharing. This blog post will focus on techniques we're using to ensure reproducibility over time and our testing strategy, which is just about checking that code can execute, not correctness (we use peer review to help with that!). 

If it ain’t broken… wait, are you sure it's not broken?

If an analysis project is around long enough, something is bound to break. In our experience, open, collaborative projects are long-lived–we want the code to be useful for a while! Combine that with multiple people from different labs writing code to analyze the data, and you have created the perfect conditions for things to break in expected and unexpected ways. In this way, OpenScPCA and OpenPBTA are not different–we need to make sure the analysis code still runs as new data releases come out and that underlying results are reproducible over time. (Reproducibility and its benefits and goals are more extensively discussed in Peng and Hicks. 2021.)

Are we doomed to try to play (and win 🤞) bug-whack-a-mole when we go to publish parts of OpenScPCA? No! We can borrow a collection of strategies from our colleagues in software development to mitigate these problems, namely continuous integration/continuous delivery (CI/CD) and dependency management (i.e., tracking and pinning the versions of libraries or packages our code depends on).

Ensuring reproducibility in OpenPBTA

In OpenPBTA, our strategies for ensuring reproducibility throughout the project were simple. We had one monolithic Docker image containing all the dependencies for every analysis and one CI/CD workflow to check that all code could be run. We created test data that included all features for a subset of samples. The workflow would first build the Docker image, run a container from the recently-built image, download the test data, and then run every analysis module using the test data within the container. Before merging a pull request, the workflow had to pass regardless of what files the author was altering in the pull request.

Our approach mitigated or at least made us aware of some problems that arise in analytical projects. For example, running code on test data ensured that the syntax was correct, and if, for example, code made assumptions about the presence of specific samples (i.e., by hardcoding sample identifiers), a workflow failure would let us know we needed to make the code more general. If an analysis module’s dependencies were missing from the Docker container, the workflow would fail, and we would know we needed to add the dependencies. Ultimately, we could successfully run all the steps contributing to the OpenPBTA publication in the project Docker container using the final data release.

The approach’s simplicity was appealing, but as the project continued, it created pain points. Large Docker images take a long time to build and download. As the number of analysis modules grew, the number of steps in the workflow also increased. In the latter stages of the project, the workflow would sometimes time out (i.e., fail because it took too long), and even when it finished in the required time, it could be frustrating to wait over an hour to find out that there was a syntax error in the pull request. Heading into OpenScPCA, we wanted to retain the benefits of our OpenPBTA reproducibility strategy while alleviating these pain points.

Ensuring reproducibility in OpenScPCA

Like in OpenPBTA, we expect analyses to be organized into subdirectories or modules. Often, these modules will be entirely independent of each other. For example, if we are adding cell type labels to two projects–particularly if we use different methods–the code can be organized into separate modules. If Project A’s cell typing code is broken, that has no bearing on Project B’s cell typing code. We can use this independence to our advantage when creating our dependency management and testing strategy.

In OpenScPCA, every module has its own environment–i.e., the set of dependencies required to run its code–and its own CI/CD workflows: one to run the module’s code on test data and another to test building and push the module-specific Docker image to AWS Elastic Container Registry. To help contributors and project organizers with setup, we use a script to initialize new modules with the required files for managing environments and testing. We use a mix of renv and conda to manage dependencies, depending on the language used in the module, and we can use both of these technologies in conjunction with Docker. When we test a module’s code using the test data, which is a set of 100 simulated cells for each library, we will activate the environment or run it in the module’s container when applicable. These workflows only get triggered when a pull request changes relevant files: if Project A’s cell typing code changes, only Project A’s cell typing code gets tested. We hope this will allow us to develop and merge code more efficiently.

Running all the code on every pull request in OpenPBTA did have an advantage–if a module hadn’t been touched in a while but no longer worked with a new data release, we’d find out pretty quickly. If we only test OpenScPCA code when it changes, we could be in a situation where code has been broken for months without us being aware. To guard against this possibility, we also build Docker images and run the module code using the test data on a monthly schedule. Then, we have an opportunity to fix any problems before we attempt to rerun something way down the line and are unpleasantly surprised.

Of course, given our focus on cell type labels, we must also prepare for the possibility that modules will build off one another (i.e., an analysis uses cell type labels someone else generated as input). Keep your eyes peeled for Josh Shapiro’s part II of this blog post for more on how we’re thinking about this problem!

Join us for OpenScPCA!

Are you a researcher with experience or expertise in pediatric cancer, single-cell data, labeling cell types or cell states, and/or pan-cancer analyses? We invite you to explore and contribute your ideas to OpenScPCA!

Back To Blog