Building reproducible workflows for testing and reproducible results in OpenScPCA
In our last blog post, we shared some of the tools and methods we are using in the Open Single-cell Pediatric Cancer Atlas (OpenScPCA) project to ensure that the analysis code remains usable and runnable throughout the project. That post mainly focused on some of the most dynamic phases of the project, when contributors are adding new analysis modules and updating existing ones with more refined results. Here, we will discuss the test data that enables the methods and our approach to running the full set of analyses on real data.
Eventually, analysis modules will be “stable” or no longer actively developed, and we’ll want to run them repeatedly when we cut new data releases or write up the results. We need a system for running the entire set of analysis modules which will ensure that any analysis modules using the output from another module will have those results available and up-to-date. What we really need is a workflow.
We had previously built a fairly large workflow with Nextflow to do the initial mapping and processing of single-cell data destined for the ScPCA portal, so we leveraged our experience there to design a new workflow, called [.inline-snippet]OpenScPCA-nf[.inline-snippet], that would allow us to reproducibly and efficiently run all of the OpenScPCA analyses.
Get it small: building synthetic datasets for testing
When analyses are in flux, tests have to be run frequently. Unfortunately, a consequence of this is that running the full analysis repeatedly becomes expensive; we can’t run all samples with full data sets through all analyses all the time!
Our first thought was to use a small subset of samples for testing, as we had in the OpenPBTA project, but that required consistent effort to be sure we could test different parts of the analysis that relied on specific sets of samples. Instead, we decided to include all samples, but replacing the original data with small synthetic data. This also allows us to provide free access to the test data, so that anyone could work with the test data to start building new analyses, even if they had not yet formally joined the OpenScPCA project.
We started by simulating small data sets, using the very useful [.inline-snippet]splatter[.inline-snippet] package to quickly create synthetic data sets with the same sets of genes as our original data, but with only 100 cells. The metadata for these simulated data sets was derived from the original data and permuted to break the relationships among fields. We also ensured that all cell type labels present in the original data were still represented in the simulated data sets, to minimize the chances that code that expected a particular cell type to be present would break when using the simulated data.
All of this was implemented as the first element of the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow. Whenever we release new data on the ScPCA portal, we run the workflow to update the simulated data in parallel. This is the data used in our continuous integration (CI) tests back in the [.inline-snippet]OpenScPCA-analysis[.inline-snippet] repository, and helps ensure that we will catch places where changes in the underlying data might cause an analysis module to break.
Make a batch: transferring and optimizing analysis modules
Once an analysis module from [.inline-snippet]OpenScPCA-analysis[.inline-snippet] is tested and stable, we are ready to move it to the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow. Specifically, we can adapt the analysis code to run as part of the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow.
The first step is to move over all of the scripts that make up the [.inline-snippet]OpenScPCA-analysis[.inline-snippet] module to a new module within the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow. As part of this, we also have to make sure that all the dependencies the scripts require are available when running as part of the workflow. Luckily, we already solved that! As discussed in Part I, one of the principles of our approach in OpenScPCA was that each analysis module should manage its dependencies and run in its own Docker container. We can use the exact same Docker images in our Nextflow workflow and know that all of the dependencies that might be needed are already there.
As we move the code over to Nextflow, we also have the opportunity to optimize a bit for faster and more resilient running. During development, we expect that most modules will be set up to run via a primary script (usually a Bash script) that runs each step of the analysis in sequence. Usually, this would mean running all samples through each step before proceeding to the next step or running each sample through all steps before proceeding to the next sample. Either way, we are running each part in sequence, one at a time. This has the advantage of simplicity, but if we break the steps up a bit, we can let Nextflow handle which steps need to be computed and in what order. Even better, we can direct Nextflow to submit each step to a computing cluster (AWS Batch, in our case). This lets us run many jobs in parallel and, even better, allows us to customize the computing resources required for each step. Simple steps can run without many CPUs or memory, while more computationally expensive tasks can be given many CPUs and/or lots of memory if needed.
It’s a workflow: keeping results up to date
The Nextflow system also helps us handle a situation we struggled with in OpenPBTA: How do we handle rerunning modules where one analysis depends on another? In OpenPBTA, we mostly dealt with this on an ad-hoc basis, keeping track of dependencies with a table and rerunning steps manually as required. With Nextflow, it is much easier to encode these dependencies within the workflow, ensuring that any updated results are calculated before another module needs them, and every module is using the same version of the data.
We also use the workflow to publish the results from each analysis module to an S3 bucket, which can be used directly by contributors developing downstream modules. For example, the [.inline-snippet]doublet-detection[.inline-snippet] module produces a table classifying droplets that may contain more than one cell. Any downstream analysis can use those results both during development and, eventually, as they are incorporated into the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow. We also run all steps using the simulated data we built so that the simulated results are also available for testing as needed.
When everything is working as it should, we can run the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow whenever we add or update a workflow analysis module or when new data is added to the Single-cell Pediatric Atlas Portal, ensuring that all results are up to date as quickly as possible. If you are interested in contributing data to the portal or code to the analysis, we are always happy to hear from you!
Get started!
- Explore the [.inline-snippet]OpenScPCA-analysis[.inline-snippet] repository and join the conversation on GitHub Discussions.
- Contact us by filling out the OpenScPCA intake form or ScPCA data contribution interest form.
In our last blog post, we shared some of the tools and methods we are using in the Open Single-cell Pediatric Cancer Atlas (OpenScPCA) project to ensure that the analysis code remains usable and runnable throughout the project. That post mainly focused on some of the most dynamic phases of the project, when contributors are adding new analysis modules and updating existing ones with more refined results. Here, we will discuss the test data that enables the methods and our approach to running the full set of analyses on real data.
Eventually, analysis modules will be “stable” or no longer actively developed, and we’ll want to run them repeatedly when we cut new data releases or write up the results. We need a system for running the entire set of analysis modules which will ensure that any analysis modules using the output from another module will have those results available and up-to-date. What we really need is a workflow.
We had previously built a fairly large workflow with Nextflow to do the initial mapping and processing of single-cell data destined for the ScPCA portal, so we leveraged our experience there to design a new workflow, called [.inline-snippet]OpenScPCA-nf[.inline-snippet], that would allow us to reproducibly and efficiently run all of the OpenScPCA analyses.
Get it small: building synthetic datasets for testing
When analyses are in flux, tests have to be run frequently. Unfortunately, a consequence of this is that running the full analysis repeatedly becomes expensive; we can’t run all samples with full data sets through all analyses all the time!
Our first thought was to use a small subset of samples for testing, as we had in the OpenPBTA project, but that required consistent effort to be sure we could test different parts of the analysis that relied on specific sets of samples. Instead, we decided to include all samples, but replacing the original data with small synthetic data. This also allows us to provide free access to the test data, so that anyone could work with the test data to start building new analyses, even if they had not yet formally joined the OpenScPCA project.
We started by simulating small data sets, using the very useful [.inline-snippet]splatter[.inline-snippet] package to quickly create synthetic data sets with the same sets of genes as our original data, but with only 100 cells. The metadata for these simulated data sets was derived from the original data and permuted to break the relationships among fields. We also ensured that all cell type labels present in the original data were still represented in the simulated data sets, to minimize the chances that code that expected a particular cell type to be present would break when using the simulated data.
All of this was implemented as the first element of the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow. Whenever we release new data on the ScPCA portal, we run the workflow to update the simulated data in parallel. This is the data used in our continuous integration (CI) tests back in the [.inline-snippet]OpenScPCA-analysis[.inline-snippet] repository, and helps ensure that we will catch places where changes in the underlying data might cause an analysis module to break.
Make a batch: transferring and optimizing analysis modules
Once an analysis module from [.inline-snippet]OpenScPCA-analysis[.inline-snippet] is tested and stable, we are ready to move it to the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow. Specifically, we can adapt the analysis code to run as part of the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow.
The first step is to move over all of the scripts that make up the [.inline-snippet]OpenScPCA-analysis[.inline-snippet] module to a new module within the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow. As part of this, we also have to make sure that all the dependencies the scripts require are available when running as part of the workflow. Luckily, we already solved that! As discussed in Part I, one of the principles of our approach in OpenScPCA was that each analysis module should manage its dependencies and run in its own Docker container. We can use the exact same Docker images in our Nextflow workflow and know that all of the dependencies that might be needed are already there.
As we move the code over to Nextflow, we also have the opportunity to optimize a bit for faster and more resilient running. During development, we expect that most modules will be set up to run via a primary script (usually a Bash script) that runs each step of the analysis in sequence. Usually, this would mean running all samples through each step before proceeding to the next step or running each sample through all steps before proceeding to the next sample. Either way, we are running each part in sequence, one at a time. This has the advantage of simplicity, but if we break the steps up a bit, we can let Nextflow handle which steps need to be computed and in what order. Even better, we can direct Nextflow to submit each step to a computing cluster (AWS Batch, in our case). This lets us run many jobs in parallel and, even better, allows us to customize the computing resources required for each step. Simple steps can run without many CPUs or memory, while more computationally expensive tasks can be given many CPUs and/or lots of memory if needed.
It’s a workflow: keeping results up to date
The Nextflow system also helps us handle a situation we struggled with in OpenPBTA: How do we handle rerunning modules where one analysis depends on another? In OpenPBTA, we mostly dealt with this on an ad-hoc basis, keeping track of dependencies with a table and rerunning steps manually as required. With Nextflow, it is much easier to encode these dependencies within the workflow, ensuring that any updated results are calculated before another module needs them, and every module is using the same version of the data.
We also use the workflow to publish the results from each analysis module to an S3 bucket, which can be used directly by contributors developing downstream modules. For example, the [.inline-snippet]doublet-detection[.inline-snippet] module produces a table classifying droplets that may contain more than one cell. Any downstream analysis can use those results both during development and, eventually, as they are incorporated into the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow. We also run all steps using the simulated data we built so that the simulated results are also available for testing as needed.
When everything is working as it should, we can run the [.inline-snippet]OpenScPCA-nf[.inline-snippet] workflow whenever we add or update a workflow analysis module or when new data is added to the Single-cell Pediatric Atlas Portal, ensuring that all results are up to date as quickly as possible. If you are interested in contributing data to the portal or code to the analysis, we are always happy to hear from you!
Get started!
- Explore the [.inline-snippet]OpenScPCA-analysis[.inline-snippet] repository and join the conversation on GitHub Discussions.
- Contact us by filling out the OpenScPCA intake form or ScPCA data contribution interest form.