Cataloging the CCDI Childhood Cancer Data Catalog (CCDC)
Here at the Data Lab, we're all about, well, data! We believe that data sharing and accessibility is key to accelerating the research process, and ultimately to improving outcomes for childhood cancer patients. So, we were excited to learn that one of the goals of the NCI/NIH initiative, the Childhood Cancer Data Initiative (CCDI), is to build up a Data Ecosystem that will facilitate pediatric cancer researchers' ability to explore and collect data from disparate resources. Although this Ecosystem is still in the early stages, several components are already being developed and are available for researchers to use! One component that is particularly interesting to us is the CCDI's Childhood Cancer Data Catalog (CCDC). According to NCI,
The CCDI Childhood Cancer Data Catalog is an inventory of childhood cancer data repositories from across the childhood cancer research community. This catalog will make it easier for researchers, doctors, and citizen scientists to find data that will help them. Each resource page includes a summary description, data content types, and links to access the data. The inventory includes childhood cancer repositories, registries, data commons, websites, tools, and catalogs that manage and refer to data.
First, what is the CCDC? It's a submission-based service, meaning researchers can use a template form to submit their data resources as a way to promote resource awareness and usage within the pediatric cancer research community. Importantly, as the name implies, the CCDC is a catalog, not a database –- it does not host data directly, nor does it provide access to controlled data. Instead, the CCDC offers a centralized location for researchers to identify external data resources that may suit their needs which researchers can then further investigate on their own. Therefore, the CCDC's primary role is to lower the barrier to identifying potential resources of interest.
We were eager to learn more about the CCDC's potential, so we went ahead and "cataloged the data catalog" to develop a deeper sense of the landscape of existing data in pediatric cancer. What types of data are commonly available, and what types of analyses can they support? How do available data resources relate to one another? Can we at the Data Lab use any existing data in the catalog to improve our products, resources and/or training workshops, and similarly, could the CCDC be a suitable place to share our data? Answering these questions gives us insight into how we can best position ourselves to support cancer researchers in the age of data-intensive research, and here we'll pass some of what we found along to you! We'll note that there's a lot of ground here that we can't cover in this one blog post, but we hope this gives you a sense of what you can expect from the CCDC.
Within the CCDC, resources are generally partitioned into one of five categories (catalogs, knowledge bases, programs, registries, and repositories) depending on what kind of datasets they contain and how that data was originally collected. The CCDC offers a comprehensive glossary of terms and a PDF User Guide to help get you started exploring these resources, as well as release notes to help you keep tabs on how the catalog is being updated over time.
What types of data are commonly available, and what types of analyses can they support?
As of September 2022, the CCDC lists 100 datasets across 33 different resources which span a wide array of datatypes: "omics" datasets (both controlled-access and open summarized data), imaging, xenograft resources, cell lines, survivorship data, clinical trial results, and resources linking cancer genetic variants to clinically relevant outcomes. This breadth of data types means the CCDC caters broadly to all types of pediatric cancer researchers. Further, some of these resources are not limited to data but contain internal analytics and/or visualization tools that support data exploration or hypothesis generation. For example, St. Jude Cloud is a repository resource listed in the CCDC that both hosts genomic (much of which is controlled and must be requested) and clinical data as well as provides a robust cloud computing environment to interactively explore and analyze its hosted data. Another CCDC-listed registry resource, the National Childhood Cancer Registry Explorer (NCCR), offers data exports and options to make custom visualizations in the browser to explore incidence and survival statistics of pediatric and AYA (adolescent and young adult) cancer patients.
Importantly, not all resources are pediatric-specific, but all resources should contain at least some datasets of pediatric and/or AYA cancer, and this information is described on a given dataset and/or resource's summary page in the CCDC. These summary pages further provide additional metadata about dataset contents, often including number of cases and/or samples, case disease diagnosis, sample tissue location, case sex, case age at diagnoses, and any publications associated with the data. This information is generally collected during the submission process, so what kind of metadata is available for a given dataset will necessarily vary resource-to-resource. But, based on our experience, the CCDC maintainers can help you identify relevant metadata to include with your submission that is in line with what they would like to track.
How do available data resources relate to one another?
Something the CCDC highlights about the landscape of pediatric data is that there can be a lot of overlap among existing resources. While some datasets are unique to one resource, other datasets are embedded across several resources. One such example of this kind of overlap is data from the TARGET initiative, an NCI-sponsored cross-institutional program that genomically characterized several types of pediatric cancers. TARGET encompasses eleven sub-studies in a collaborative network of teams each studying different diseases and/or disease stages, such as Wilms Tumor and neuroblastoma. TARGET data includes open clinical metadata about patients and samples, open summarized genomic data as well as controlled raw genomics files whose access can be requested via dbGAP, an NIH-hosted repository for biological data. Although each TARGET subproject has its own dbGAP entry (for example, the Rhabdoid tumor sub-study and the Osteosarcoma sub-study), there is also an overall dbGAP page for the whole TARGET project where all requests for most of the controlled data should be directed.
Within the CCDC, individual TARGET sub-studies are recorded as "datasets," meaning there are eleven total datasets associated with TARGET. In this case, a given TARGET dataset comprehensively refers to all clinical/patient metadata, controlled genomic data, and summarized genomic data associated with a given TARGET sub-study.
As of September 2022, these eleven datasets are present in several of the 33 CCDC-listed resources:
- TARGET itself is a program resource in the CCDC that directs to the TARGET project homepage. This resource contains the eleven datasets - the TARGET sub-studies - and links are provided to the associated dbGAP Studies where controlled data can be requested.
- The Genomic Data Commons (GDC) is a repository resource in the CCDC. The GDC is an NCI-sponsored database of cancer genomic studies and hosts both open and controlled-access datasets. Within the CCDC, this resource lists eleven curated pediatric cancer datasets, one of which is TARGET. In fact, the GDC is the repository that contains controlled-access TARGET data - if one acquires access through dbGAP, one will be able to obtain the data from the GDC!
- Kids First (Gabriella Miller Kids First Pediatric Research Program) is a repository resource in the CCDC that offers both data export and analysis/visualization tools for pediatric cancer datasets. Within the CCDC, this resource lists fifteen different datasets, two of which are TARGET subprojects (TARGET AML and TARGET Neuroblastoma), and links for each are provided both to the GDC and to the relevant dbGAP study. These projects are listed because Kids First was a collaborator on these specific TARGET substudies.
- PedcBioPortal for Integrated Childhood Cancer Genomics is a repository resource within the CCDC that lists a single dataset: All data within the PedcBioPortal merged into a single overarching dataset for the purposes of reporting metadata (e.g. number of cases, diagnosis counts, case ages, etc.) within the CCDC. This CCDC entry provides a link to the PedcBioPortal home page, where users can find many distinct datasets which can be separately explored, visualized, and analyzed. Among these datasets are ten of the TARGET subprojects. The PedcBioPortal primarily provides an opportunity to interactively explore these data in the browser, but it also provides a link back to the TARGET Data Matrix for further information.
Based on our exploration of these TARGET-associated resources, we've summarized how you can interact with TARGET data in each one. (Figure 1)
How can the CCDC help us help you?
To explore how we at the Data Lab might use the CCDC to support pediatric cancer researchers, we did a little spelunking through the catalog itself! Since many of our initiatives focus on transcriptomic data, we compiled a list of RNA-seq and/or microarray datasets from pediatric tumor and/or xenografts in CCDC-listed resources. Next up, we're planning to dig through these datasets in a bit more depth to see how we might leverage these data. For example, we'll be looking for any openly available datasets that might be used as example datasets for trainees to analyze in our workshops.
While thinking about how CCDC data might support our resources, we got to thinking more about how we can work with the CCDC too! Currently, we're in the process of submitting the Single-Cell Pediatric Cancer Atlas (ScPCA), an open-access database of uniformly processed single-cell transcriptomic data from pediatric cancer clinical samples and xenografts, as a resource to the CCDC. You can read more about the ScPCA in this blog post. Our goal is for the ScPCA datasets to be part of the next quarterly CCDC release, so keep your eyes peeled!
In conclusion, we're excited to see this increased emphasis on data sharing in pediatric cancer research, and we hope you have a better sense now of how the CCDC might support your work as well! You can also support the CCDC's growth by submitting your datasets, and based on our experience, the CCDC maintainers will be glad to help you navigate the submission process. We look forward to following along as the catalog continues to develop.
On this blog, we share our expertise with the scientific community. You can expect to read technical content about our processes, information about our products and services, and much more. Subscribe here to receive updates!
Here at the Data Lab, we're all about, well, data! We believe that data sharing and accessibility is key to accelerating the research process, and ultimately to improving outcomes for childhood cancer patients. So, we were excited to learn that one of the goals of the NCI/NIH initiative, the Childhood Cancer Data Initiative (CCDI), is to build up a Data Ecosystem that will facilitate pediatric cancer researchers' ability to explore and collect data from disparate resources. Although this Ecosystem is still in the early stages, several components are already being developed and are available for researchers to use! One component that is particularly interesting to us is the CCDI's Childhood Cancer Data Catalog (CCDC). According to NCI,
The CCDI Childhood Cancer Data Catalog is an inventory of childhood cancer data repositories from across the childhood cancer research community. This catalog will make it easier for researchers, doctors, and citizen scientists to find data that will help them. Each resource page includes a summary description, data content types, and links to access the data. The inventory includes childhood cancer repositories, registries, data commons, websites, tools, and catalogs that manage and refer to data.
First, what is the CCDC? It's a submission-based service, meaning researchers can use a template form to submit their data resources as a way to promote resource awareness and usage within the pediatric cancer research community. Importantly, as the name implies, the CCDC is a catalog, not a database –- it does not host data directly, nor does it provide access to controlled data. Instead, the CCDC offers a centralized location for researchers to identify external data resources that may suit their needs which researchers can then further investigate on their own. Therefore, the CCDC's primary role is to lower the barrier to identifying potential resources of interest.
We were eager to learn more about the CCDC's potential, so we went ahead and "cataloged the data catalog" to develop a deeper sense of the landscape of existing data in pediatric cancer. What types of data are commonly available, and what types of analyses can they support? How do available data resources relate to one another? Can we at the Data Lab use any existing data in the catalog to improve our products, resources and/or training workshops, and similarly, could the CCDC be a suitable place to share our data? Answering these questions gives us insight into how we can best position ourselves to support cancer researchers in the age of data-intensive research, and here we'll pass some of what we found along to you! We'll note that there's a lot of ground here that we can't cover in this one blog post, but we hope this gives you a sense of what you can expect from the CCDC.
Within the CCDC, resources are generally partitioned into one of five categories (catalogs, knowledge bases, programs, registries, and repositories) depending on what kind of datasets they contain and how that data was originally collected. The CCDC offers a comprehensive glossary of terms and a PDF User Guide to help get you started exploring these resources, as well as release notes to help you keep tabs on how the catalog is being updated over time.
What types of data are commonly available, and what types of analyses can they support?
As of September 2022, the CCDC lists 100 datasets across 33 different resources which span a wide array of datatypes: "omics" datasets (both controlled-access and open summarized data), imaging, xenograft resources, cell lines, survivorship data, clinical trial results, and resources linking cancer genetic variants to clinically relevant outcomes. This breadth of data types means the CCDC caters broadly to all types of pediatric cancer researchers. Further, some of these resources are not limited to data but contain internal analytics and/or visualization tools that support data exploration or hypothesis generation. For example, St. Jude Cloud is a repository resource listed in the CCDC that both hosts genomic (much of which is controlled and must be requested) and clinical data as well as provides a robust cloud computing environment to interactively explore and analyze its hosted data. Another CCDC-listed registry resource, the National Childhood Cancer Registry Explorer (NCCR), offers data exports and options to make custom visualizations in the browser to explore incidence and survival statistics of pediatric and AYA (adolescent and young adult) cancer patients.
Importantly, not all resources are pediatric-specific, but all resources should contain at least some datasets of pediatric and/or AYA cancer, and this information is described on a given dataset and/or resource's summary page in the CCDC. These summary pages further provide additional metadata about dataset contents, often including number of cases and/or samples, case disease diagnosis, sample tissue location, case sex, case age at diagnoses, and any publications associated with the data. This information is generally collected during the submission process, so what kind of metadata is available for a given dataset will necessarily vary resource-to-resource. But, based on our experience, the CCDC maintainers can help you identify relevant metadata to include with your submission that is in line with what they would like to track.
How do available data resources relate to one another?
Something the CCDC highlights about the landscape of pediatric data is that there can be a lot of overlap among existing resources. While some datasets are unique to one resource, other datasets are embedded across several resources. One such example of this kind of overlap is data from the TARGET initiative, an NCI-sponsored cross-institutional program that genomically characterized several types of pediatric cancers. TARGET encompasses eleven sub-studies in a collaborative network of teams each studying different diseases and/or disease stages, such as Wilms Tumor and neuroblastoma. TARGET data includes open clinical metadata about patients and samples, open summarized genomic data as well as controlled raw genomics files whose access can be requested via dbGAP, an NIH-hosted repository for biological data. Although each TARGET subproject has its own dbGAP entry (for example, the Rhabdoid tumor sub-study and the Osteosarcoma sub-study), there is also an overall dbGAP page for the whole TARGET project where all requests for most of the controlled data should be directed.
Within the CCDC, individual TARGET sub-studies are recorded as "datasets," meaning there are eleven total datasets associated with TARGET. In this case, a given TARGET dataset comprehensively refers to all clinical/patient metadata, controlled genomic data, and summarized genomic data associated with a given TARGET sub-study.
As of September 2022, these eleven datasets are present in several of the 33 CCDC-listed resources:
- TARGET itself is a program resource in the CCDC that directs to the TARGET project homepage. This resource contains the eleven datasets - the TARGET sub-studies - and links are provided to the associated dbGAP Studies where controlled data can be requested.
- The Genomic Data Commons (GDC) is a repository resource in the CCDC. The GDC is an NCI-sponsored database of cancer genomic studies and hosts both open and controlled-access datasets. Within the CCDC, this resource lists eleven curated pediatric cancer datasets, one of which is TARGET. In fact, the GDC is the repository that contains controlled-access TARGET data - if one acquires access through dbGAP, one will be able to obtain the data from the GDC!
- Kids First (Gabriella Miller Kids First Pediatric Research Program) is a repository resource in the CCDC that offers both data export and analysis/visualization tools for pediatric cancer datasets. Within the CCDC, this resource lists fifteen different datasets, two of which are TARGET subprojects (TARGET AML and TARGET Neuroblastoma), and links for each are provided both to the GDC and to the relevant dbGAP study. These projects are listed because Kids First was a collaborator on these specific TARGET substudies.
- PedcBioPortal for Integrated Childhood Cancer Genomics is a repository resource within the CCDC that lists a single dataset: All data within the PedcBioPortal merged into a single overarching dataset for the purposes of reporting metadata (e.g. number of cases, diagnosis counts, case ages, etc.) within the CCDC. This CCDC entry provides a link to the PedcBioPortal home page, where users can find many distinct datasets which can be separately explored, visualized, and analyzed. Among these datasets are ten of the TARGET subprojects. The PedcBioPortal primarily provides an opportunity to interactively explore these data in the browser, but it also provides a link back to the TARGET Data Matrix for further information.
Based on our exploration of these TARGET-associated resources, we've summarized how you can interact with TARGET data in each one. (Figure 1)
How can the CCDC help us help you?
To explore how we at the Data Lab might use the CCDC to support pediatric cancer researchers, we did a little spelunking through the catalog itself! Since many of our initiatives focus on transcriptomic data, we compiled a list of RNA-seq and/or microarray datasets from pediatric tumor and/or xenografts in CCDC-listed resources. Next up, we're planning to dig through these datasets in a bit more depth to see how we might leverage these data. For example, we'll be looking for any openly available datasets that might be used as example datasets for trainees to analyze in our workshops.
While thinking about how CCDC data might support our resources, we got to thinking more about how we can work with the CCDC too! Currently, we're in the process of submitting the Single-Cell Pediatric Cancer Atlas (ScPCA), an open-access database of uniformly processed single-cell transcriptomic data from pediatric cancer clinical samples and xenografts, as a resource to the CCDC. You can read more about the ScPCA in this blog post. Our goal is for the ScPCA datasets to be part of the next quarterly CCDC release, so keep your eyes peeled!
In conclusion, we're excited to see this increased emphasis on data sharing in pediatric cancer research, and we hope you have a better sense now of how the CCDC might support your work as well! You can also support the CCDC's growth by submitting your datasets, and based on our experience, the CCDC maintainers will be glad to help you navigate the submission process. We look forward to following along as the catalog continues to develop.
On this blog, we share our expertise with the scientific community. You can expect to read technical content about our processes, information about our products and services, and much more. Subscribe here to receive updates!