Choosing wisely: A behind-the-scenes look at how we selected cell type annotation platforms for the ScPCA Portal
So you recently did some single-cell RNA sequencing and are working on analyzing your data. You’ve already quantified the gene expression data, performed any filtering, and normalized your data, but now what? You know you want to perform differential expression analysis or that you need to annotate the cell types found in your data, but there are so many different tools and methods for performing these analyses. How do you know which one is the best method for your dataset? Don’t worry, we’ve all been there – even experts in the single-cell field have been there.
When we were building our Single-cell Pediatric Cancer Atlas (ScPCA) Portal, we too had to identify the best methods for processing the myriad of datasets we were working with. For each step in our pipeline, from quantifying gene expression data to normalizing data, we had to decide which method was most appropriate. This involved picking a few tools to compare, testing tools using different parameters, and using quantitative metrics to identify the best-performing tool. Read more in this blog about building [.inline-snippet]scpca-nf[.inline-snippet], our open-source pipeline for processing single-cell and single-nuclei RNA-seq data.
More recently, we have added new features to the Portal, including the addition of cell type annotations to all samples. To do that, we identified the method(s) that would perform well when annotating cell types across datasets from over 50 different cancer types. If you’ve ever tried to cell type a single-cell dataset, you know this isn’t an easy task! Ultimately, we identified two methods, SingleR and CellAssign, that we incorporated into our [.inline-snippet]scpca-nf[.inline-snippet] workflow to add cell type annotations to the output data files.
Pick your poison
Not every method is created equal, so we did some light benchmarking to find the cell type annotation tool and reference combination that worked best for our specific project. Before we could actually benchmark, we needed to identify a list of candidate tools for annotating cell types that were feasible to be added to our workflow.
At the Data Lab, we consider the following questions when choosing a tool to implement.
- Is this tool actively maintained? Picking a tool that is actively maintained can be really helpful in avoiding future bugs. When software has bugs, it's important that a team of developers will actually fix them.
- How easy is it to implement this tool? Sometimes tools have a long list of dependencies, meaning they require other packages to be able to work properly. This can determine how easy it is to install and run that tool. Also, how are those dependencies handled? Picking a tool that handles the dependencies internally will make things much easier on you, the user.
- How much does running this tool cost? Each tool is going to have its own set of resource requirements. We like to find tools that are both time and memory-efficient, which can be helpful in reducing run time and processing costs!
- What additional information do I need to run this tool? Many tools will require additional input besides the data files. For cell type annotation in particular, a reference file, such as a previously annotated dataset, is usually required, so we needed tools that use publicly available references.
- What format is the output in? Be sure to look at what is included in the output and what format the output is in. Also, it may be helpful if metrics are included in the output. With cell type annotation it was important to us to use a tool that provided an associated metric that can be used to evaluate the quality of the cell type annotations.
When preparing to perform your analysis, find some tools that fit your criteria. Then you can test each tool on your datasets and find the one that’s best for you!
Pro Tip: All of our benchmarking is available in our public GitHub repositories. Visit [.inline-snippet]AlexsLemonade/sc-data-integration[.inline-snippet] to see our benchmarking for both integration of multiple single-cell datasets and cell-type annotation methods.
Design your experiment
After considering each of these questions, we identified a few tools that fit our criteria for cell-type annotation, SingleR and CellAssign.
- SingleR is a reference-based method that works seamlessly with [.inline-snippet]SingleCellExperiment[.inline-snippet] objects, uses publicly available reference datasets, is cost-efficient, and provides metrics that can be helpful in evaluating the quality of annotations.
- CellAssign is a marker gene-based method that is well-maintained, easy to use, and provides helpful metrics for evaluating the quality of annotations.
The biggest challenge with cell type annotation is not only finding the optimal tool for our datasets, but also finding the best reference for your dataset. So most of our benchmarking focused on identifying the best reference to use with each tool we identified. We also compared the annotations from each tool to each other to look for consistency in annotations. When multiple tools give the same result, it often means the result is more reliable.
Use controls
Just like when conducting a wet lab experiment, we always need controls! It’s important to be able to test if the tool is doing what is expected. We therefore needed to identify a dataset that we expected to provide a specific result as the positive control and a dataset that we expected should fail as the negative control.
For each cell type annotation method, we created a nonsense reference, or a reference with cell types that were not expected in the dataset we were cell typing. This served as our negative control because we would expect the method to fail to assign cells given a reference that did not match the dataset. For our positive control, we used a dataset with known ground-truth annotations so that we could compare the annotations provided by each method to the known annotations. We were fortunate enough to have access to data that had already been annotated, but if we did not have any datasets like this readily available, we would have used a simulated dataset. They can be really helpful in making sure that the tool is performing as expected, but remember, real data always comes with a little mess.
Identify a set of metrics
The first thing we did was identify a set of quantitative metrics to evaluate each of the tools that we collected on a subset of samples. For SingleR, we measured how confident SingleR was in each cell type annotation. We then compared this metric across multiple references to identify the most appropriate reference. For CellAssign, we looked at the distribution of the prediction scores. The more cells with a given cell type label with a prediction score near 1, the more confident CellAssign was in the assignment. Making these comparisons across different references allowed us to identify robust references to use for our ScPCA datasets.
For a subset of samples, we compared the annotations from both methods to a ground-truth annotation provided by the original submitter of the ScPCA dataset. We found that both methods seemed to perform equally well, depending on the reference of choice.
Ultimately, we chose to provide cell type annotations from both methods and include a comparison between the two methods in a cell type report available with each sample on the Portal.
To the Portal
After identifying the best tools to use and the best references, we were able to implement a new module for annotating cell types in [.inline-snippet]scpca-nf[.inline-snippet]. We then processed all datasets from the Portal through the updated pipeline and released them to the community. Through this process, we learned that cell type annotation can be very tricky, especially when using references built from normal tissue to annotate samples from cancer tissue. To help you, the user, evaluate the cell type annotations we provided, we included a supplemental cell type report with each sample that displays many of the metrics that we mentioned above.
You can head to the ScPCA Portal to download samples with added cell type annotations and explore some of the metrics we discussed for yourselves!
So you recently did some single-cell RNA sequencing and are working on analyzing your data. You’ve already quantified the gene expression data, performed any filtering, and normalized your data, but now what? You know you want to perform differential expression analysis or that you need to annotate the cell types found in your data, but there are so many different tools and methods for performing these analyses. How do you know which one is the best method for your dataset? Don’t worry, we’ve all been there – even experts in the single-cell field have been there.
When we were building our Single-cell Pediatric Cancer Atlas (ScPCA) Portal, we too had to identify the best methods for processing the myriad of datasets we were working with. For each step in our pipeline, from quantifying gene expression data to normalizing data, we had to decide which method was most appropriate. This involved picking a few tools to compare, testing tools using different parameters, and using quantitative metrics to identify the best-performing tool. Read more in this blog about building [.inline-snippet]scpca-nf[.inline-snippet], our open-source pipeline for processing single-cell and single-nuclei RNA-seq data.
More recently, we have added new features to the Portal, including the addition of cell type annotations to all samples. To do that, we identified the method(s) that would perform well when annotating cell types across datasets from over 50 different cancer types. If you’ve ever tried to cell type a single-cell dataset, you know this isn’t an easy task! Ultimately, we identified two methods, SingleR and CellAssign, that we incorporated into our [.inline-snippet]scpca-nf[.inline-snippet] workflow to add cell type annotations to the output data files.
Pick your poison
Not every method is created equal, so we did some light benchmarking to find the cell type annotation tool and reference combination that worked best for our specific project. Before we could actually benchmark, we needed to identify a list of candidate tools for annotating cell types that were feasible to be added to our workflow.
At the Data Lab, we consider the following questions when choosing a tool to implement.
- Is this tool actively maintained? Picking a tool that is actively maintained can be really helpful in avoiding future bugs. When software has bugs, it's important that a team of developers will actually fix them.
- How easy is it to implement this tool? Sometimes tools have a long list of dependencies, meaning they require other packages to be able to work properly. This can determine how easy it is to install and run that tool. Also, how are those dependencies handled? Picking a tool that handles the dependencies internally will make things much easier on you, the user.
- How much does running this tool cost? Each tool is going to have its own set of resource requirements. We like to find tools that are both time and memory-efficient, which can be helpful in reducing run time and processing costs!
- What additional information do I need to run this tool? Many tools will require additional input besides the data files. For cell type annotation in particular, a reference file, such as a previously annotated dataset, is usually required, so we needed tools that use publicly available references.
- What format is the output in? Be sure to look at what is included in the output and what format the output is in. Also, it may be helpful if metrics are included in the output. With cell type annotation it was important to us to use a tool that provided an associated metric that can be used to evaluate the quality of the cell type annotations.
When preparing to perform your analysis, find some tools that fit your criteria. Then you can test each tool on your datasets and find the one that’s best for you!
Pro Tip: All of our benchmarking is available in our public GitHub repositories. Visit [.inline-snippet]AlexsLemonade/sc-data-integration[.inline-snippet] to see our benchmarking for both integration of multiple single-cell datasets and cell-type annotation methods.
Design your experiment
After considering each of these questions, we identified a few tools that fit our criteria for cell-type annotation, SingleR and CellAssign.
- SingleR is a reference-based method that works seamlessly with [.inline-snippet]SingleCellExperiment[.inline-snippet] objects, uses publicly available reference datasets, is cost-efficient, and provides metrics that can be helpful in evaluating the quality of annotations.
- CellAssign is a marker gene-based method that is well-maintained, easy to use, and provides helpful metrics for evaluating the quality of annotations.
The biggest challenge with cell type annotation is not only finding the optimal tool for our datasets, but also finding the best reference for your dataset. So most of our benchmarking focused on identifying the best reference to use with each tool we identified. We also compared the annotations from each tool to each other to look for consistency in annotations. When multiple tools give the same result, it often means the result is more reliable.
Use controls
Just like when conducting a wet lab experiment, we always need controls! It’s important to be able to test if the tool is doing what is expected. We therefore needed to identify a dataset that we expected to provide a specific result as the positive control and a dataset that we expected should fail as the negative control.
For each cell type annotation method, we created a nonsense reference, or a reference with cell types that were not expected in the dataset we were cell typing. This served as our negative control because we would expect the method to fail to assign cells given a reference that did not match the dataset. For our positive control, we used a dataset with known ground-truth annotations so that we could compare the annotations provided by each method to the known annotations. We were fortunate enough to have access to data that had already been annotated, but if we did not have any datasets like this readily available, we would have used a simulated dataset. They can be really helpful in making sure that the tool is performing as expected, but remember, real data always comes with a little mess.
Identify a set of metrics
The first thing we did was identify a set of quantitative metrics to evaluate each of the tools that we collected on a subset of samples. For SingleR, we measured how confident SingleR was in each cell type annotation. We then compared this metric across multiple references to identify the most appropriate reference. For CellAssign, we looked at the distribution of the prediction scores. The more cells with a given cell type label with a prediction score near 1, the more confident CellAssign was in the assignment. Making these comparisons across different references allowed us to identify robust references to use for our ScPCA datasets.
For a subset of samples, we compared the annotations from both methods to a ground-truth annotation provided by the original submitter of the ScPCA dataset. We found that both methods seemed to perform equally well, depending on the reference of choice.
Ultimately, we chose to provide cell type annotations from both methods and include a comparison between the two methods in a cell type report available with each sample on the Portal.
To the Portal
After identifying the best tools to use and the best references, we were able to implement a new module for annotating cell types in [.inline-snippet]scpca-nf[.inline-snippet]. We then processed all datasets from the Portal through the updated pipeline and released them to the community. Through this process, we learned that cell type annotation can be very tricky, especially when using references built from normal tissue to annotate samples from cancer tissue. To help you, the user, evaluate the cell type annotations we provided, we included a supplemental cell type report with each sample that displays many of the metrics that we mentioned above.
You can head to the ScPCA Portal to download samples with added cell type annotations and explore some of the metrics we discussed for yourselves!