Setting your research up for success in a data driven world
Before working as a Data Scientist at the Childhood Cancer Data Lab, I spent time in my PhD and post-doctoral fellowship in two very different research environments. Each had their own unique way of doing research. I found that some things worked really well and others were not as successful. The one constant was that the processes through which I chose to analyze my data were mainly up to me and each individual member of the lab, with little standardization across the lab members. I also spent a lot of time teaching myself how to perform analyses and write code and Google became my best friend. I found that this worked and was able to produce results, but not always in the most efficient and effective manner.
Since joining the Data Lab, I have learned that time invested in setting up standardized, efficient and organized processes up front leads to more reproducible research, sustainable workflows, and saves time in the long run. Many of the same processes that are implemented by the Data Lab could be transferred to any research environment to help improve the ability to conduct reproducible and efficient research.
Here are just a few of the main struggles that I experienced while performing data analysis in my research before I joined the Data Lab:
- No open sharing of code across lab mates - In my previous research experience, I was mostly responsible for deciding how I wanted to organize my code. My files were either stored on my personal computer or in my personal folder on a shared drive or server organized in only a way that I could understand. My colleagues all had their own folders housing their own version of the same workflow that we were all running, but each script was slightly different from each other. This implied that everyone had invested time writing and debugging that workflow and it was probably unlikely that anyone else would be able to use it.
- Keeping a version tracked history - How many times have you appended “_versionX” or “_final” to the end of your document resulting in multiple copies of the same exact document that are slightly different from the previous version? I must admit that I used to be guilty of this and kept my own private version of a document’s history on my own personal computer.
- Starting from scratch (every time) - Every time I had to analyze a different dataset, I would open up a brand new R notebook or R script and start from scratch. I would then proceed to perform the same analysis that I had done on similar samples, copying and pasting lines of code from previous notebooks, and replacing bits and pieces like a sample name, path to a file, etc. This resulted in multiple copies of analysis that performed the same functions, but had slightly different code. Each time I would perform an analysis, all of the code related to that analysis would be in one single script, no matter how long or complicated it was. There was no one reviewing my code here telling me to break things up into smaller, digestible pieces.
- Code without documentation - I noticed that as I spent so much time on the actual data analysis itself, I tended to forget to write down what exactly I was doing and instructions for how to do it. This made things extremely difficult when I had to go back and re-do something for revisions for a paper or to re-generate a figure with a slightly different font size. Not documenting my code either with comments or through README files led to many long nights of digging through code attempting to remember what exactly each script was performing.
Ensuring others can replicate your success
These struggles became even more apparent after joining the Data Lab and learning how they perform research. One of the things that we pride ourselves on at the Data Lab is our ability to create reproducible workflows. Everytime we write code, we want to make sure that someone else can both interpret that code and run that code on their own. This not only helps other people that will want to use this code, but it will also help future you. The more time you invest up front in creating reproducible code and workflows, the more time you will save analyzing similar data down the line. Below are steps that we take to ensure that our code is reproducible and efficient:
- Share It - At the Data Lab, we work on all of our projects collaboratively and everything is shared with our teammates. We use GitHub, a platform that allows for easy communication and collaboration among teammates, to store all of our code and workflows. Each project’s code is stored in their own GitHub repository, and all teammates have access to all of our repositories. Having a system that encourages open access of code, protocols, documentation, and data across lab members, as appropriate, allows our team to work in sync.
- Track It - We also track all of our work on GitHub, which keeps a versioned history of all documents. Many people assume that GitHub is just for code, but in fact, we use GitHub to keep track of documentation, too. With version control, we can continue to make edits on the same document, but still return to previous versions if we change our minds later! No more duplicate documents floating around with “_v2” appended to the filename. We also use issues on GitHub to keep track of to-do items for a project or any ideas that may come up while working on that project.
- Review It - One of the best practices in creating reproducible workflows is code review. The process of code review means that every time someone intends to make changes to an existing workflow or creates a new workflow, these changes have to be reviewed by another team member. Review allows us to ensure that the steps we are taking are accurate but it also helps teach us how to write readable and reproducible code, saving us time in the long run!
- Document It - Having code or workflows without documentation is like having a recipe with a list of ingredients and no instructions on how to combine them. Taking the time to write documentation about each step of your process as you develop them means that the steps that you took are fresh in your mind! It also means that when you return to that workflow months (or years) later, you can refer to that documentation to remind yourself of exactly what you did and the decisions that you made at the time.
I'M INTERESTED IN LEARNING MORE
Keep reading below to see an example of how we would approach developing one of our own reproducible workflows.
Putting it all together
Let’s walk through a behind the scenes look at the steps we would take to develop one of our workflows and see how each of the tools we discussed play an important role in maximizing success.
Say we are developing a workflow for processing RNA-sequencing data and we want to trim the FASTQ files, align them, and then perform quantification. The following is a series of smaller steps we would take to complete this larger task:
- Plan out your work - The first thing we should do before starting any coding, is to plan out the steps that we need to have in our workflow. Theoretically we could write the entire workflow in one step without much planning - that’s what I would have done before I joined the Data Lab. Since joining the team I’ve learned that breaking up tasks into smaller units of work results in a smaller margin of error.
- File Issues - Here we have three smaller steps that we will set up separately - trimming, alignment, and quantification - each with their own input and output. Development of each step of the workflow becomes its own unique task and we would file each task as an issue on the corresponding GitHub repository and assign it to a member of our team.
- Develop - For each individual step of the workflow, a team member would develop the piece of code, including adding documentation, that takes the necessary input to perform the specific task at hand, e.g., RNA-sequencing alignment, and then outputs the desired files.
- Code Review - That piece of code is then submitted as a pull request to GitHub and reviewed by another team member to ensure accuracy and reproducibility before merging into the main branch of the repository. Those changes (and the history of those changes) are immediately accessible to the entire team.
- Repeat for all parts of the workflow - This process is repeated for all steps of the workflow until each task is completed. The end result is one workflow consisting of individual steps that each reproducibly perform the smaller step of the workflow that they are intended to perform.
After we finish these steps we will have a completed RNA-sequencing workflow that takes FASTQ files as input, trims them, performs alignment, and then quantification. The beauty of this process is that we can now utilize this workflow over and over again for all of our RNA sequencing analysis. We no longer need to create a new script to perform the analysis for every single sample, rather we can just run these steps by inputting the FASTQ files for our specific sample. Check back soon for more information on using workflow managers to build workflows.
Read more about writing reproducible code and find more resources on our recent blog post, The Childhood Cancer Data Lab's not-so-secret sauce for efficient workflows — aka Philadelphia’s third most famous process, authored by a former Data Lab team member.
Sharing our knowledge
Are you interested in learning more about the processes we have in place and how they can be replicated in your lab setting? The Data Lab is eager to share information that will help you and your colleagues accelerate the pace of your work! Fill out this form to be notified about future materials and offerings like this.
Before working as a Data Scientist at the Childhood Cancer Data Lab, I spent time in my PhD and post-doctoral fellowship in two very different research environments. Each had their own unique way of doing research. I found that some things worked really well and others were not as successful. The one constant was that the processes through which I chose to analyze my data were mainly up to me and each individual member of the lab, with little standardization across the lab members. I also spent a lot of time teaching myself how to perform analyses and write code and Google became my best friend. I found that this worked and was able to produce results, but not always in the most efficient and effective manner.
Since joining the Data Lab, I have learned that time invested in setting up standardized, efficient and organized processes up front leads to more reproducible research, sustainable workflows, and saves time in the long run. Many of the same processes that are implemented by the Data Lab could be transferred to any research environment to help improve the ability to conduct reproducible and efficient research.
Here are just a few of the main struggles that I experienced while performing data analysis in my research before I joined the Data Lab:
- No open sharing of code across lab mates - In my previous research experience, I was mostly responsible for deciding how I wanted to organize my code. My files were either stored on my personal computer or in my personal folder on a shared drive or server organized in only a way that I could understand. My colleagues all had their own folders housing their own version of the same workflow that we were all running, but each script was slightly different from each other. This implied that everyone had invested time writing and debugging that workflow and it was probably unlikely that anyone else would be able to use it.
- Keeping a version tracked history - How many times have you appended “_versionX” or “_final” to the end of your document resulting in multiple copies of the same exact document that are slightly different from the previous version? I must admit that I used to be guilty of this and kept my own private version of a document’s history on my own personal computer.
- Starting from scratch (every time) - Every time I had to analyze a different dataset, I would open up a brand new R notebook or R script and start from scratch. I would then proceed to perform the same analysis that I had done on similar samples, copying and pasting lines of code from previous notebooks, and replacing bits and pieces like a sample name, path to a file, etc. This resulted in multiple copies of analysis that performed the same functions, but had slightly different code. Each time I would perform an analysis, all of the code related to that analysis would be in one single script, no matter how long or complicated it was. There was no one reviewing my code here telling me to break things up into smaller, digestible pieces.
- Code without documentation - I noticed that as I spent so much time on the actual data analysis itself, I tended to forget to write down what exactly I was doing and instructions for how to do it. This made things extremely difficult when I had to go back and re-do something for revisions for a paper or to re-generate a figure with a slightly different font size. Not documenting my code either with comments or through README files led to many long nights of digging through code attempting to remember what exactly each script was performing.
Ensuring others can replicate your success
These struggles became even more apparent after joining the Data Lab and learning how they perform research. One of the things that we pride ourselves on at the Data Lab is our ability to create reproducible workflows. Everytime we write code, we want to make sure that someone else can both interpret that code and run that code on their own. This not only helps other people that will want to use this code, but it will also help future you. The more time you invest up front in creating reproducible code and workflows, the more time you will save analyzing similar data down the line. Below are steps that we take to ensure that our code is reproducible and efficient:
- Share It - At the Data Lab, we work on all of our projects collaboratively and everything is shared with our teammates. We use GitHub, a platform that allows for easy communication and collaboration among teammates, to store all of our code and workflows. Each project’s code is stored in their own GitHub repository, and all teammates have access to all of our repositories. Having a system that encourages open access of code, protocols, documentation, and data across lab members, as appropriate, allows our team to work in sync.
- Track It - We also track all of our work on GitHub, which keeps a versioned history of all documents. Many people assume that GitHub is just for code, but in fact, we use GitHub to keep track of documentation, too. With version control, we can continue to make edits on the same document, but still return to previous versions if we change our minds later! No more duplicate documents floating around with “_v2” appended to the filename. We also use issues on GitHub to keep track of to-do items for a project or any ideas that may come up while working on that project.
- Review It - One of the best practices in creating reproducible workflows is code review. The process of code review means that every time someone intends to make changes to an existing workflow or creates a new workflow, these changes have to be reviewed by another team member. Review allows us to ensure that the steps we are taking are accurate but it also helps teach us how to write readable and reproducible code, saving us time in the long run!
- Document It - Having code or workflows without documentation is like having a recipe with a list of ingredients and no instructions on how to combine them. Taking the time to write documentation about each step of your process as you develop them means that the steps that you took are fresh in your mind! It also means that when you return to that workflow months (or years) later, you can refer to that documentation to remind yourself of exactly what you did and the decisions that you made at the time.
I'M INTERESTED IN LEARNING MORE
Keep reading below to see an example of how we would approach developing one of our own reproducible workflows.
Putting it all together
Let’s walk through a behind the scenes look at the steps we would take to develop one of our workflows and see how each of the tools we discussed play an important role in maximizing success.
Say we are developing a workflow for processing RNA-sequencing data and we want to trim the FASTQ files, align them, and then perform quantification. The following is a series of smaller steps we would take to complete this larger task:
- Plan out your work - The first thing we should do before starting any coding, is to plan out the steps that we need to have in our workflow. Theoretically we could write the entire workflow in one step without much planning - that’s what I would have done before I joined the Data Lab. Since joining the team I’ve learned that breaking up tasks into smaller units of work results in a smaller margin of error.
- File Issues - Here we have three smaller steps that we will set up separately - trimming, alignment, and quantification - each with their own input and output. Development of each step of the workflow becomes its own unique task and we would file each task as an issue on the corresponding GitHub repository and assign it to a member of our team.
- Develop - For each individual step of the workflow, a team member would develop the piece of code, including adding documentation, that takes the necessary input to perform the specific task at hand, e.g., RNA-sequencing alignment, and then outputs the desired files.
- Code Review - That piece of code is then submitted as a pull request to GitHub and reviewed by another team member to ensure accuracy and reproducibility before merging into the main branch of the repository. Those changes (and the history of those changes) are immediately accessible to the entire team.
- Repeat for all parts of the workflow - This process is repeated for all steps of the workflow until each task is completed. The end result is one workflow consisting of individual steps that each reproducibly perform the smaller step of the workflow that they are intended to perform.
After we finish these steps we will have a completed RNA-sequencing workflow that takes FASTQ files as input, trims them, performs alignment, and then quantification. The beauty of this process is that we can now utilize this workflow over and over again for all of our RNA sequencing analysis. We no longer need to create a new script to perform the analysis for every single sample, rather we can just run these steps by inputting the FASTQ files for our specific sample. Check back soon for more information on using workflow managers to build workflows.
Read more about writing reproducible code and find more resources on our recent blog post, The Childhood Cancer Data Lab's not-so-secret sauce for efficient workflows — aka Philadelphia’s third most famous process, authored by a former Data Lab team member.
Sharing our knowledge
Are you interested in learning more about the processes we have in place and how they can be replicated in your lab setting? The Data Lab is eager to share information that will help you and your colleagues accelerate the pace of your work! Fill out this form to be notified about future materials and offerings like this.