Collaborating with the Data Lab on OpenPBTA shaped how our team works reproducibly
This is a guest blog post by Jo Lynne Rokita, PhD, leader of the Bioinformatics Translational Pediatric Oncology Team at the Center for Data-Driven Discovery in Biomedicine (D3b) at Children’s Hospital of Philadelphia (CHOP).
At the Center for Data-Driven Discovery in Biomedicine (D3b), I lead the Bioinformatics Translational Pediatric Oncology Team, a team of bioinformatics scientists. Our mission is to advance pediatric oncology research and precision medicine through collaboration and development of open-source analytical tools, frameworks, and data resources. In 1998, I lost my four year old cousin John Matthew to a brain tumor we now know was likely a diffuse intrinsic pontine glioma. So, it was bittersweet for me to see the Open Pediatric Brain Tumor Atlas (OpenPBTA) manuscript published in Cell Genomics on the last day of brain tumor awareness month this past year1. But let’s rewind.
By the end of 2018, the Children’s Brain Tumor Network (CBTN), through D3b, had generated and released the Pediatric Brain Tumor Atlas (PBTA), a collection of genomic data for over 1,000 pediatric brain tumors. I remember being inspired by the open crowd-sourced Deep Review manuscript2 and a bit of Twitter brainstorming with Casey Greene, Director of Alex’s Lemonade Stand Foundation’s (ALSF) Childhood Cancer Data Lab at the time, to maximize the impact of the available PBTA data.
In 2019, just in time for Childhood Cancer Awareness Month that year, D3b, along with ALSF, CBTN, and many collaborators, announced the OpenPBTA, which we intended to be a ”speedy” project. The COVID-19 pandemic (among other things) had other plans. But, we powered through what became a fun, challenging, educational, and immensely rewarding project. What began as a “crazy idea” later became the first open and crowd-sourced analysis of pediatric brain tumor genomic data that allowed for contributions from experts across the globe. Aside from the manuscript and data distribution resource we have provided the pediatric oncology community, I frequently reflect on how grateful I am to Jaclyn Taroni, current Director of ALSF’s Childhood Cancer Data Lab and OpenPBTA collaborator, and the entire Data Lab team for their perseverance and commitment to training and maintaining standard processes throughout the duration of the project. The OpenPBTA has reshaped not only how my team works together, but also how we collaborate across teams at D3b.
Working on the OpenPBTA was fun, but as the first open analysis project of its kind, it was very challenging!
Turns out building things from scratch is hard. OpenPBTA was no different and although we built the GitHub and Docker infrastructure in the open, we also had to build new processes behind-the-scenes at D3b. Some of the challenges included:
Learning the technologies. Until the OpenPBTA, D3b was focused on brain tumor specimen and data collection, model generation, sequencing, and harmonization. It was time to propel analysis of the PBTA forward. D3b’s bioinformatics analytical team(s) were in their infancy and admittedly, we had no documentation in place for working collaboratively or reproducibly. Our bioinformatics scientists weren’t routinely and uniformly using GitHub or Docker, and we certainly were not doing code reviews. Combining the Data Lab’s technical expertise with D3b’s domain knowledge of brain tumors was an ideal pairing.
Building a framework for the histologies file. Ahhh, the histologies file. After my experience leading the genomic analysis of patient-derived xenograft (PDX) models from the multi-institutional Pediatric Preclinical Testing Program3, my first goal with OpenPBTA was to create a histologies file with essential and harmonized fields. Looking back at some of my initial code to create this file, I was combining data from >25 files - nothing was in one place. It was unsustainable having to manually re-pull and re-collate files and in the spirit of OpenPBTA, we wanted something reproducible. At somewhere around 15 data releases, we decided with our ADAPT (Advance Data and Platform Technologies) Unit that this data needed to be in a database. Since then, we have built and currently maintain a data warehouse of tables written in dbt (data build tool). The data warehouse pulls together genomic file attributes from Kids First, specimen attributes from Nautilus, and patient attributes from RedCap into a histologies table used to generate [.inline-snippet]pbta-histologies-base.tsv[.inline-snippet] for OpenPBTA data releases. As we harmonize data and generate additional data modalities (single nucleus RNA-Seq, proteogenomics, etc.), we continue building upon the histologies table in the D3b data warehouse.
Developing systematic data releases. D3b serves as the Data Resource Center for Kids First, so we had the majority of somatic pipelines already benchmarked and had processed data ready to go. (Copy number was something we benchmarked during OpenPBTA). We had to figure out how to merge and release the data systematically, such that if we added or removed data, we could easily update our merged matrices. We divided these tasks across people and did this mostly ad hoc in CAVATICA using the [.inline-snippet]pbta-histologies-base.tsv[.inline-snippet] file as the sample cohort, but ad hoc requests often disrupted scrum sprint planning or had to be integrated among the center priorities and were mistake-prone. I’m not sure if we ever figured out how to do this well during OpenPBTA (heck, we had 23 data releases!), but in the time since, we developed an Apache Airflow DAG (directed acyclic graph) to schedule weekly merges for each primary analysis file (fusion calls, expression values, consensus SNVs, CNV calls, SV calls, splice variants). Further, we created a pre-release QC module to ensure that data release files are not missing samples nor are we including additional samples not in the histologies file. This enables a more real-time integration of incoming samples into the merged matrices used for data releases and molecular tumor board analyses (more on that later).
Collaborating across D3b teams. D3b currently has nearly 100 full-time employees spread across seven multi-team units (panel A, below) and in panel B, I highlight the steps within the OpenPBTA projects in which each of five D3b units were integral to our data releases. Given the scale of D3b, we periodically strategize how best to align priorities across units and teams. The OpenPBTA taught us how to better communicate across teams. For example, we have created cross-team sprint planning meetings, regularly scheduled 1:1s between leads, synchronized unit and team priorities by quarter, and created resource management plans to ensure staffing across center projects.
Working on the OpenPBTA was a learning experience and improved collaboration within our team.
While the Data Lab adopted R [.inline-snippet]tidyverse[.inline-snippet] and its style guide, I was not nearly proficient. Turns out code review will get you up to speed quickly! Below are some ways that working on the OpenPBTA has influenced the standard practices in our Bioinformatics Translational Pediatric Oncology Team:
- GitHub and repository organization - we create research project-specific GitHub repositories following a similar structure to that of OpenPBTA, for example, the alternative lengthening of telomeres project
- Data releases - we maintain versioned project-specific data release files on AWS S3
- Docker - we require each project-specific repository to contain a Dockerfile with required packages used to run all analyses
- R tidyverse - we adopted R tidyverse and its style guide for writing code across repositories
- Pull requests (PRs) and code review - while each bioinformatics scientist leads 1-2 research projects and is responsible for maintenance of their repositories, all scripts and modules are added through a PR model and code is reviewed by 1-2 others in the group.
- GitHub actions - we use GitHub actions for OpenPedCan as our continuous integration tool of choice for running testing data through analysis modules upon PR (more on this project below!)
You can read more of the Data Lab’s thoughts on setting up data-intensive research for success and how carrying out the work of OpenPBTA influenced their own philosophy and processes.
Co-leading the OpenPBTA was a highlight of my career.
We have expanded OpenPBTA into an open pan-pediatric cancer analysis project, OpenPedCan, which now contains over 2,700 PBTA tumors harmonized with data from TARGET, GMKF Neuroblastoma, and CHOP’s Division of Genomic Diagnostics. Perhaps the most rewarding part about working on the OpenPBTA is that it permeated various domains within D3b in real-time. Specifically, the data we released was simultaneously used as the foundation for research and molecular tumor board analyses (below).
Molecular Tumor Boards. The Clinical and Translational Data Sciences Unit (CTDS) at D3b leverages the automated merged output from the Airflow DAG and the histologies table from the D3b data warehouse as its underlying data for molecular tumor board-specific analysis modules. To support a more real-time analysis, D3b’s Bioinformatics Unit (BIXU) developer team scaled many OpenPBTA and OpenPedCan modules to work in CAVATICA and are now routinely using a separate Airflow DAG. CTDS works with the Pacific Pediatric Neuro-Oncology Consortium (PNOC) and CHOP’s Frontier precision medicine initiative to support evidence-based therapy recommendations.
D3b and Collaborative Research. We worked with internal and external collaborators to develop new tools (annoFuse and ShinyFuse4, medulloClassifier5) and contribute to research projects6–8. OpenPBTA and OpenPedCan data releases were/are the source data for research projects and we are currently exploring GitHub submodules as a way to pull OpenPedCan data into project-specific repositories to avoid duplicate storage on AWS S3.
Pediatric Molecular Targets Platform. We collaborated with Frederick National Laboratory to create the NCI’s Pediatric Molecular Targets Platform using OpenPedCan as the underlying data source.
A warm thank you to ALSF for supporting the OpenPBTA, the Data Lab for partnering with D3b and investing in each contributor, and to each family whose specimen donations to the CBTN or PNOC made this possible. We are excited to learn how OpenPBTA and OpenPedCan are being used by others and truly hope these data resources will break down barriers to data access and empower researchers to make new discoveries and translate therapies across the pediatric oncology domain.
Jo Lynne Rokita can be contacted by email at rokita@chop.edu.
Bibliography
1. Shapiro, J. A. et al. OpenPBTA: The Open Pediatric Brain Tumor Atlas. Cell Genomics 3, (2023).
2. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, (2018).
3. Rokita, J. L. et al. Genomic Profiling of Childhood Tumor Patient-Derived Xenograft Models to Enable Rational Clinical Trial Design. Cell Rep. 29, 1675–1689.e9 (2019).
4. Gaonkar, K. S. et al. annoFuse: an R Package to annotate, prioritize, and interactively explore putative oncogenic RNA fusions. BMC Bioinformatics 21, 577 (2020).
5. Rathi, K. S. et al. A transcriptome-based classifier to determine molecular subtypes in medulloblastoma. PLoS Comput. Biol. 16, e1008263 (2020).
6. Kline, C. et al. Upfront biology-guided therapy in diffuse intrinsic pontine glioma: therapeutic, molecular, and biomarker outcomes from PNOC003. Clin. Cancer Res. (2022) doi:10.1158/1078-0432.CCR-22-0803.
7. Dang, M. T. et al. Macrophages in SHH subgroup medulloblastoma display dynamic heterogeneity that varies with treatment modality. Cell Rep. 34, 108917 (2021).
8. Stundon, J. L. et al. Alternative lengthening of telomeres (ALT) in pediatric high-grade gliomas can occur without ATRX mutation and is enriched in patients with pathogenic germline mismatch repair (MMR) variants. Neuro. Oncol. 25, 1331–1342 (2023).
This is a guest blog post by Jo Lynne Rokita, PhD, leader of the Bioinformatics Translational Pediatric Oncology Team at the Center for Data-Driven Discovery in Biomedicine (D3b) at Children’s Hospital of Philadelphia (CHOP).
At the Center for Data-Driven Discovery in Biomedicine (D3b), I lead the Bioinformatics Translational Pediatric Oncology Team, a team of bioinformatics scientists. Our mission is to advance pediatric oncology research and precision medicine through collaboration and development of open-source analytical tools, frameworks, and data resources. In 1998, I lost my four year old cousin John Matthew to a brain tumor we now know was likely a diffuse intrinsic pontine glioma. So, it was bittersweet for me to see the Open Pediatric Brain Tumor Atlas (OpenPBTA) manuscript published in Cell Genomics on the last day of brain tumor awareness month this past year1. But let’s rewind.
By the end of 2018, the Children’s Brain Tumor Network (CBTN), through D3b, had generated and released the Pediatric Brain Tumor Atlas (PBTA), a collection of genomic data for over 1,000 pediatric brain tumors. I remember being inspired by the open crowd-sourced Deep Review manuscript2 and a bit of Twitter brainstorming with Casey Greene, Director of Alex’s Lemonade Stand Foundation’s (ALSF) Childhood Cancer Data Lab at the time, to maximize the impact of the available PBTA data.
In 2019, just in time for Childhood Cancer Awareness Month that year, D3b, along with ALSF, CBTN, and many collaborators, announced the OpenPBTA, which we intended to be a ”speedy” project. The COVID-19 pandemic (among other things) had other plans. But, we powered through what became a fun, challenging, educational, and immensely rewarding project. What began as a “crazy idea” later became the first open and crowd-sourced analysis of pediatric brain tumor genomic data that allowed for contributions from experts across the globe. Aside from the manuscript and data distribution resource we have provided the pediatric oncology community, I frequently reflect on how grateful I am to Jaclyn Taroni, current Director of ALSF’s Childhood Cancer Data Lab and OpenPBTA collaborator, and the entire Data Lab team for their perseverance and commitment to training and maintaining standard processes throughout the duration of the project. The OpenPBTA has reshaped not only how my team works together, but also how we collaborate across teams at D3b.
Working on the OpenPBTA was fun, but as the first open analysis project of its kind, it was very challenging!
Turns out building things from scratch is hard. OpenPBTA was no different and although we built the GitHub and Docker infrastructure in the open, we also had to build new processes behind-the-scenes at D3b. Some of the challenges included:
Learning the technologies. Until the OpenPBTA, D3b was focused on brain tumor specimen and data collection, model generation, sequencing, and harmonization. It was time to propel analysis of the PBTA forward. D3b’s bioinformatics analytical team(s) were in their infancy and admittedly, we had no documentation in place for working collaboratively or reproducibly. Our bioinformatics scientists weren’t routinely and uniformly using GitHub or Docker, and we certainly were not doing code reviews. Combining the Data Lab’s technical expertise with D3b’s domain knowledge of brain tumors was an ideal pairing.
Building a framework for the histologies file. Ahhh, the histologies file. After my experience leading the genomic analysis of patient-derived xenograft (PDX) models from the multi-institutional Pediatric Preclinical Testing Program3, my first goal with OpenPBTA was to create a histologies file with essential and harmonized fields. Looking back at some of my initial code to create this file, I was combining data from >25 files - nothing was in one place. It was unsustainable having to manually re-pull and re-collate files and in the spirit of OpenPBTA, we wanted something reproducible. At somewhere around 15 data releases, we decided with our ADAPT (Advance Data and Platform Technologies) Unit that this data needed to be in a database. Since then, we have built and currently maintain a data warehouse of tables written in dbt (data build tool). The data warehouse pulls together genomic file attributes from Kids First, specimen attributes from Nautilus, and patient attributes from RedCap into a histologies table used to generate [.inline-snippet]pbta-histologies-base.tsv[.inline-snippet] for OpenPBTA data releases. As we harmonize data and generate additional data modalities (single nucleus RNA-Seq, proteogenomics, etc.), we continue building upon the histologies table in the D3b data warehouse.
Developing systematic data releases. D3b serves as the Data Resource Center for Kids First, so we had the majority of somatic pipelines already benchmarked and had processed data ready to go. (Copy number was something we benchmarked during OpenPBTA). We had to figure out how to merge and release the data systematically, such that if we added or removed data, we could easily update our merged matrices. We divided these tasks across people and did this mostly ad hoc in CAVATICA using the [.inline-snippet]pbta-histologies-base.tsv[.inline-snippet] file as the sample cohort, but ad hoc requests often disrupted scrum sprint planning or had to be integrated among the center priorities and were mistake-prone. I’m not sure if we ever figured out how to do this well during OpenPBTA (heck, we had 23 data releases!), but in the time since, we developed an Apache Airflow DAG (directed acyclic graph) to schedule weekly merges for each primary analysis file (fusion calls, expression values, consensus SNVs, CNV calls, SV calls, splice variants). Further, we created a pre-release QC module to ensure that data release files are not missing samples nor are we including additional samples not in the histologies file. This enables a more real-time integration of incoming samples into the merged matrices used for data releases and molecular tumor board analyses (more on that later).
Collaborating across D3b teams. D3b currently has nearly 100 full-time employees spread across seven multi-team units (panel A, below) and in panel B, I highlight the steps within the OpenPBTA projects in which each of five D3b units were integral to our data releases. Given the scale of D3b, we periodically strategize how best to align priorities across units and teams. The OpenPBTA taught us how to better communicate across teams. For example, we have created cross-team sprint planning meetings, regularly scheduled 1:1s between leads, synchronized unit and team priorities by quarter, and created resource management plans to ensure staffing across center projects.
Working on the OpenPBTA was a learning experience and improved collaboration within our team.
While the Data Lab adopted R [.inline-snippet]tidyverse[.inline-snippet] and its style guide, I was not nearly proficient. Turns out code review will get you up to speed quickly! Below are some ways that working on the OpenPBTA has influenced the standard practices in our Bioinformatics Translational Pediatric Oncology Team:
- GitHub and repository organization - we create research project-specific GitHub repositories following a similar structure to that of OpenPBTA, for example, the alternative lengthening of telomeres project
- Data releases - we maintain versioned project-specific data release files on AWS S3
- Docker - we require each project-specific repository to contain a Dockerfile with required packages used to run all analyses
- R tidyverse - we adopted R tidyverse and its style guide for writing code across repositories
- Pull requests (PRs) and code review - while each bioinformatics scientist leads 1-2 research projects and is responsible for maintenance of their repositories, all scripts and modules are added through a PR model and code is reviewed by 1-2 others in the group.
- GitHub actions - we use GitHub actions for OpenPedCan as our continuous integration tool of choice for running testing data through analysis modules upon PR (more on this project below!)
You can read more of the Data Lab’s thoughts on setting up data-intensive research for success and how carrying out the work of OpenPBTA influenced their own philosophy and processes.
Co-leading the OpenPBTA was a highlight of my career.
We have expanded OpenPBTA into an open pan-pediatric cancer analysis project, OpenPedCan, which now contains over 2,700 PBTA tumors harmonized with data from TARGET, GMKF Neuroblastoma, and CHOP’s Division of Genomic Diagnostics. Perhaps the most rewarding part about working on the OpenPBTA is that it permeated various domains within D3b in real-time. Specifically, the data we released was simultaneously used as the foundation for research and molecular tumor board analyses (below).
Molecular Tumor Boards. The Clinical and Translational Data Sciences Unit (CTDS) at D3b leverages the automated merged output from the Airflow DAG and the histologies table from the D3b data warehouse as its underlying data for molecular tumor board-specific analysis modules. To support a more real-time analysis, D3b’s Bioinformatics Unit (BIXU) developer team scaled many OpenPBTA and OpenPedCan modules to work in CAVATICA and are now routinely using a separate Airflow DAG. CTDS works with the Pacific Pediatric Neuro-Oncology Consortium (PNOC) and CHOP’s Frontier precision medicine initiative to support evidence-based therapy recommendations.
D3b and Collaborative Research. We worked with internal and external collaborators to develop new tools (annoFuse and ShinyFuse4, medulloClassifier5) and contribute to research projects6–8. OpenPBTA and OpenPedCan data releases were/are the source data for research projects and we are currently exploring GitHub submodules as a way to pull OpenPedCan data into project-specific repositories to avoid duplicate storage on AWS S3.
Pediatric Molecular Targets Platform. We collaborated with Frederick National Laboratory to create the NCI’s Pediatric Molecular Targets Platform using OpenPedCan as the underlying data source.
A warm thank you to ALSF for supporting the OpenPBTA, the Data Lab for partnering with D3b and investing in each contributor, and to each family whose specimen donations to the CBTN or PNOC made this possible. We are excited to learn how OpenPBTA and OpenPedCan are being used by others and truly hope these data resources will break down barriers to data access and empower researchers to make new discoveries and translate therapies across the pediatric oncology domain.
Jo Lynne Rokita can be contacted by email at rokita@chop.edu.
Bibliography
1. Shapiro, J. A. et al. OpenPBTA: The Open Pediatric Brain Tumor Atlas. Cell Genomics 3, (2023).
2. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, (2018).
3. Rokita, J. L. et al. Genomic Profiling of Childhood Tumor Patient-Derived Xenograft Models to Enable Rational Clinical Trial Design. Cell Rep. 29, 1675–1689.e9 (2019).
4. Gaonkar, K. S. et al. annoFuse: an R Package to annotate, prioritize, and interactively explore putative oncogenic RNA fusions. BMC Bioinformatics 21, 577 (2020).
5. Rathi, K. S. et al. A transcriptome-based classifier to determine molecular subtypes in medulloblastoma. PLoS Comput. Biol. 16, e1008263 (2020).
6. Kline, C. et al. Upfront biology-guided therapy in diffuse intrinsic pontine glioma: therapeutic, molecular, and biomarker outcomes from PNOC003. Clin. Cancer Res. (2022) doi:10.1158/1078-0432.CCR-22-0803.
7. Dang, M. T. et al. Macrophages in SHH subgroup medulloblastoma display dynamic heterogeneity that varies with treatment modality. Cell Rep. 34, 108917 (2021).
8. Stundon, J. L. et al. Alternative lengthening of telomeres (ALT) in pediatric high-grade gliomas can occur without ATRX mutation and is enriched in patients with pathogenic germline mismatch repair (MMR) variants. Neuro. Oncol. 25, 1331–1342 (2023).