Reflections on the Childhood Cancer Data Initiative Symposium
Here at the CCDL we value putting publicly available data to work. For example, we are currently processing and normalizing 1.5 million publicly available gene expression samples totaling ~$1.5 billion research dollars expended. This would not be possible if scientists didn’t share their data and if these samples were not deposited into accessible repositories. However, there are many barriers to data sharing in the childhood cancer community.
A Childhood Cancer Data Initiative (CCDI) has been proposed by the National Cancer Institute (NCI) to address these barriers and put into place a framework for the maximal use of existing data and for future data collection efforts. The CCDI aims to improve outcomes for children with cancer through better options for prevention, diagnosis, treatment and survivorship. The CCDI is the result of a dictum by President Trump in his January State of the Union address to devote $500 million to childhood cancer research over the next 10 years, primarily in the space of childhood cancer data sharing. This funding is not guaranteed. Congress will need to appropriate it.
At the end of July, the NCI kicked off the CCDI Symposium when Acting Director Douglas R. Lowy, MD convened researchers and patient advocates to discuss opportunities for enhanced data collection and usage in childhood cancer research. The CCDI meeting aimed to determine what data is necessary to advance treatment options for children with cancer and how to remove barriers to access and use of data. This could enable researchers’ to harness the cumulative power of this data, and it will require interoperability of databases and platforms, integration of various data types such as genomic, phenomic, histologic, and imaging data, and methods to extract knowledge from integrated data collections. Not only could integrating these data provide new insights into the disease, treatment and outcomes, but because childhood cancers are individually rare, despite being collectively deadly, pooling data is required to study them effectively.
The CCDI meeting focused primarily on four areas of need, each of which were addressed in breakout sessions:
- Prioritizing scientific and clinical research data needs for therapeutic progress
- Creating meaningful datasets for clinical care and associated research
- Infrastructure to enable federation among disparate pediatric data repositories
- Development of tools and resources to extract knowledge from data
We were invited to attend the meeting to share our perspectives on the efforts of the Childhood Cancer Data Lab (CCDL) sponsored by Alex’s Lemonade Stand Foundation. On the opening plenary panel, Dr. Taroni shared her thoughts on exactly what we mean when we say childhood cancer data (it might be broader than you think!). See the final slide of that presentation below! We’ll discuss some of our take-aways from the CCDI symposium below.
We were thrilled to hear attendees actively discussing the need to share data. The discussion wasn’t centered around if data should be shared, but rather how data can be shared. This seems obvious on its face, but unfortunately data are not always shared and when they are, there may be barriers to access and use for various reasons such as a dearth of knowledge regarding data sharing standards or a desire to wait to share until all possible publications are published by the original investigators. These reasons expose two problems around data sharing that are not always talked about:
- Sharing data well requires a set of skills and sufficient resources. The staff required to share well aren't always funded and the required skills are not always emphasized or cultivated.
- The culture of science—specifically the state of academic advancement and promotion—does not always incentivize sharing.
We were encouraged to see that both of these problems came up during the CCDI Symposium discussion. This is certainly a step in the right direction in improving the practice of science and was a great note on which to start discussions at the CCDI.
Data sharing and interoperability are of course not enough. We must extract knowledge from these data to truly make an impact and satisfy the objectives of the CCDI: improve outcomes for children with cancer through better options for prevention, diagnosis, treatment and survivorship. This point was highlighted in the summaries of the breakout sessions—a measure of the community's awareness of this issue. Nonetheless, we wanted to share our perspective on the matter and what discussion at the CCDI Symposium excited us the most.
Knowing what to do with data is, frankly, hard and requires specialized expertise. Each of the data modalities noted above has its own pitfalls. Training a machine learning model to predict some outcome could take minutes, but figuring out what data should be input for the model and evaluating whether the model is at all useful takes more time (usually by many orders of magnitude). Furthermore, even when combining available pediatric cancer data, approaches that were originally designed for common adult cancers are likely to be ill-suited for the pediatric cancer space. For these reasons and many more, some of our favorite points from the general and breakout group discussions we participated in were:
- We need to develop gold standard datasets for pediatric cancer. A gold standard dataset is a dataset that is well-characterized via input from experts in a domain or modality and where we know a “true answer” for some set of questions or outcomes. Datasets of this nature allow scientists and clinicians to develop and evaluate methods specifically designed for pediatric cancer or to examine the extent to which methods from other areas, like adult cancers, will work for pediatric cancer. Here’s an example problem that could be facilitated by gold standard datasets: we may find that imaging data is important for a particular problem, but not all pediatric cancer cohorts include imaging data. We can imagine a situation where we could infer the important features of imaging data from another type of data entirely. If a dataset existed where we had imaging data and the other informative type of data, we could design a method to solve the missing images problem.
- If we have high quality data, researchers who want to design methods to extract knowledge need to be able to use the data. Designing a pediatric cancer enclave or sandbox where researchers can compute on data without having to download the data themselves was brought up in the “Development of tools and resources to extract knowledge from data“ session by one of the small groups assembled during the breakout session. This setup would alleviate obstacles in obtaining access to the data and downloading the data as outlined in “Barriers to accessing public cancer genomic data” from some of the folks from the UCSC Treehouse Initiative (Learned et al. Scientific Data. 2019.). We encourage you to check that paper out if you’re interested in learning more about some of the barriers researchers currently face.
- Attracting the best talent to the field and increasing the computational skills of the pediatric cancer field will be important going forward. Both of the points above remove barriers that can stymie new entrants to the field.
This is a small sample of the what that was discussed at the CCDI Symposium, but the how is likely to be incredibly challenging. It will take interdisciplinary teams of scientists, engineers, user researchers, and research coordinators to make any or all of this happen. The team composition that realizes the CCDI is a team composition that isn’t always found in academic science. We hope that this concept is central as consideration of the CCDI advances.
We look forward to hearing more about the CCDI in the coming months. The Childhood Cancer Data Initiative Ideas site, where you can submit your own ideas, is open until August 23rd: https://cancerresearchideas.cancer.gov/
Here at the CCDL we value putting publicly available data to work. For example, we are currently processing and normalizing 1.5 million publicly available gene expression samples totaling ~$1.5 billion research dollars expended. This would not be possible if scientists didn’t share their data and if these samples were not deposited into accessible repositories. However, there are many barriers to data sharing in the childhood cancer community.
A Childhood Cancer Data Initiative (CCDI) has been proposed by the National Cancer Institute (NCI) to address these barriers and put into place a framework for the maximal use of existing data and for future data collection efforts. The CCDI aims to improve outcomes for children with cancer through better options for prevention, diagnosis, treatment and survivorship. The CCDI is the result of a dictum by President Trump in his January State of the Union address to devote $500 million to childhood cancer research over the next 10 years, primarily in the space of childhood cancer data sharing. This funding is not guaranteed. Congress will need to appropriate it.
At the end of July, the NCI kicked off the CCDI Symposium when Acting Director Douglas R. Lowy, MD convened researchers and patient advocates to discuss opportunities for enhanced data collection and usage in childhood cancer research. The CCDI meeting aimed to determine what data is necessary to advance treatment options for children with cancer and how to remove barriers to access and use of data. This could enable researchers’ to harness the cumulative power of this data, and it will require interoperability of databases and platforms, integration of various data types such as genomic, phenomic, histologic, and imaging data, and methods to extract knowledge from integrated data collections. Not only could integrating these data provide new insights into the disease, treatment and outcomes, but because childhood cancers are individually rare, despite being collectively deadly, pooling data is required to study them effectively.
The CCDI meeting focused primarily on four areas of need, each of which were addressed in breakout sessions:
- Prioritizing scientific and clinical research data needs for therapeutic progress
- Creating meaningful datasets for clinical care and associated research
- Infrastructure to enable federation among disparate pediatric data repositories
- Development of tools and resources to extract knowledge from data
We were invited to attend the meeting to share our perspectives on the efforts of the Childhood Cancer Data Lab (CCDL) sponsored by Alex’s Lemonade Stand Foundation. On the opening plenary panel, Dr. Taroni shared her thoughts on exactly what we mean when we say childhood cancer data (it might be broader than you think!). See the final slide of that presentation below! We’ll discuss some of our take-aways from the CCDI symposium below.
We were thrilled to hear attendees actively discussing the need to share data. The discussion wasn’t centered around if data should be shared, but rather how data can be shared. This seems obvious on its face, but unfortunately data are not always shared and when they are, there may be barriers to access and use for various reasons such as a dearth of knowledge regarding data sharing standards or a desire to wait to share until all possible publications are published by the original investigators. These reasons expose two problems around data sharing that are not always talked about:
- Sharing data well requires a set of skills and sufficient resources. The staff required to share well aren't always funded and the required skills are not always emphasized or cultivated.
- The culture of science—specifically the state of academic advancement and promotion—does not always incentivize sharing.
We were encouraged to see that both of these problems came up during the CCDI Symposium discussion. This is certainly a step in the right direction in improving the practice of science and was a great note on which to start discussions at the CCDI.
Data sharing and interoperability are of course not enough. We must extract knowledge from these data to truly make an impact and satisfy the objectives of the CCDI: improve outcomes for children with cancer through better options for prevention, diagnosis, treatment and survivorship. This point was highlighted in the summaries of the breakout sessions—a measure of the community's awareness of this issue. Nonetheless, we wanted to share our perspective on the matter and what discussion at the CCDI Symposium excited us the most.
Knowing what to do with data is, frankly, hard and requires specialized expertise. Each of the data modalities noted above has its own pitfalls. Training a machine learning model to predict some outcome could take minutes, but figuring out what data should be input for the model and evaluating whether the model is at all useful takes more time (usually by many orders of magnitude). Furthermore, even when combining available pediatric cancer data, approaches that were originally designed for common adult cancers are likely to be ill-suited for the pediatric cancer space. For these reasons and many more, some of our favorite points from the general and breakout group discussions we participated in were:
- We need to develop gold standard datasets for pediatric cancer. A gold standard dataset is a dataset that is well-characterized via input from experts in a domain or modality and where we know a “true answer” for some set of questions or outcomes. Datasets of this nature allow scientists and clinicians to develop and evaluate methods specifically designed for pediatric cancer or to examine the extent to which methods from other areas, like adult cancers, will work for pediatric cancer. Here’s an example problem that could be facilitated by gold standard datasets: we may find that imaging data is important for a particular problem, but not all pediatric cancer cohorts include imaging data. We can imagine a situation where we could infer the important features of imaging data from another type of data entirely. If a dataset existed where we had imaging data and the other informative type of data, we could design a method to solve the missing images problem.
- If we have high quality data, researchers who want to design methods to extract knowledge need to be able to use the data. Designing a pediatric cancer enclave or sandbox where researchers can compute on data without having to download the data themselves was brought up in the “Development of tools and resources to extract knowledge from data“ session by one of the small groups assembled during the breakout session. This setup would alleviate obstacles in obtaining access to the data and downloading the data as outlined in “Barriers to accessing public cancer genomic data” from some of the folks from the UCSC Treehouse Initiative (Learned et al. Scientific Data. 2019.). We encourage you to check that paper out if you’re interested in learning more about some of the barriers researchers currently face.
- Attracting the best talent to the field and increasing the computational skills of the pediatric cancer field will be important going forward. Both of the points above remove barriers that can stymie new entrants to the field.
This is a small sample of the what that was discussed at the CCDI Symposium, but the how is likely to be incredibly challenging. It will take interdisciplinary teams of scientists, engineers, user researchers, and research coordinators to make any or all of this happen. The team composition that realizes the CCDI is a team composition that isn’t always found in academic science. We hope that this concept is central as consideration of the CCDI advances.
We look forward to hearing more about the CCDI in the coming months. The Childhood Cancer Data Initiative Ideas site, where you can submit your own ideas, is open until August 23rd: https://cancerresearchideas.cancer.gov/