I’m terrible with names…but I’m using ontologies to try to be better
There is an old joke in computer science about how there are only two hard things: cache invalidation, naming things, and off-by-one errors. I’ll leave aside the first one as beyond my own expertise, but the second comes up all the time in my work as a biological data scientist. Naming variables and functions in my code is a constant struggle, but one I have to deal with on my own or with my team. Much bigger problems come up when trying to deal with all the various ways that people across the world use names when talking about the diseases they work on, the types of cells they are looking at, the experimental methods they are using, and just about every other aspect of their studies.
There are many places in life where different people refer to the same object by different names: Is it a drinking fountain, a water fountain, a bubbler? We usually don’t have too much trouble with this, but it can certainly be confusing at times, and it can make searching for information or categorization difficult. To make things easier, we’d really like to have a single set of names that everybody uses with agreed-upon meanings. Then we would be able to search databases and articles for standard keywords and know that we were finding everything we needed. Well, that is the dream, anyway. As usual, the real world gets in the way.
This is why we can’t have nice things
Accepted names for cell types and diseases often change through time, and experimental methods are always being developed and evolving. For example, phenylketonuria was once known as Følling’s disease, or phenylalaninemia, and it is commonly referred to by the abbreviation PKU. Different biologists may have favorite names for their favorite cell types, or might use different levels of specificity depending on the study. We expect variation like this, but it can make searching across databases and papers for relevant information much more challenging!
So, if we can’t really expect everyone to agree on names, should we give up? Never! All we need to do is to create a database of all the possible names for each thing we care about and assign a code to track that set of names. Then we can look everything up in one place, and know that two different names really refer to the same cell type or the same disease. Easy! If we really wanted to go all-in we could even try to include in our database the relationships among the various entities that we cataloged: whether one cell type is a subtype of another, or ways in which diseases may be related depending on the pathways involved or the organs they affect.
What I’ve just described is an ontology, and they are so choice. If you have the means, I highly recommend picking one up. Unfortunately, building and maintaining an ontology is a tremendous amount of work, for all of the reasons you might expect (remember “everyone agrees”?). Happily, many ontologies already exist, ready for use, complete with dedicated teams behind them maintaining their contents and keeping everything up to date. Now it is our job to start to use them where we can.
Some workers spend their career developing ontologies and ontology-related tools, while few researchers (biologists and physicians) know how ontologies can accelerate their research.
(Rubin et al., 2008)
Ontologies I have known
One of the most famous ontologies in biology is the Gene Ontology, which was developed to describe the biological information that we might have about any given gene. Somewhat confusingly, genes themselves are not part of the ontology itself (naming is hard!), but instead the ontology contains terms like “hexose metabolic process” that describe individual elements of function, location, or biological process.
Just to continue the theme of naming challenges, a term like “hexose metabolic process” is not quite ideal as a consistent way to refer to a concept: for one thing, it only really works if you are speaking English, and we’d like to make our terms more universally useful. Furthermore, we might change the precise words we use for a concept, so something a bit more abstract might be useful for consistency. In most ontologies, this is done by assigning identifiers to each term, which usually consist of a prefix of a few letters followed by a number. For example, the term “hexose metabolic process” (and its synonyms) has the identifier “GO:0019318”.
When we want to describe a gene we can then use those identifiers to annotate everything we know about the gene (potentially assigning multiple annotations to each gene). If everybody does this as they learn new information about their favorite genes, and/or we have dedicated curators to keep the database up to date, we will end up with a large, interconnected database where we can look up genes by annotations and annotations by genes, across a wide and diverse range of studies.
When you start looking for them, ontologies start to show up all over the place. Experimental conditions can be described by the Experimental Factor Ontology, cell types by the Cell Ontology, diseases by the Disease Ontology, Human Phenotype Ontology, the Mondo Disease Ontology, among others.
That last example may be a bit disconcerting: If there is more than one ontology that describes diseases, which one should you pick? I’m afraid I can’t easily answer that, as it probably depends on what other people in your field are using, which may not even be something you know. But if you can pick one, it is far more useful to have some defined terms and ontology identifiers than to have none, and it is usually possible to convert between ontologies if necessary.
Keeping track of all of the ontologies is a challenge in itself. Helpfully, EMBL's European Bioinformatics Institute maintains a searchable catalog of ontologies and their contents via their Ontology Lookup Service. The Ontology Lookup Service also maintains links among the various ontology terms, making it easier to translate between different ontologies that describe the same set of concepts, such as the various disease ontologies I just mentioned.
The Data Lab is diving in
Now, I don’t want you to think that ontologies are always easy to adopt, or that I have all the answers. I’m really just getting started with thinking about how best to take advantage of all of the value that ontologies promise in my own work. But as we go forward, the Data Lab is trying to incorporate ontologies more and more into the tools and databases that we build to encourage interoperability and utility.
- Within refine.bio, our database of processed RNA expression data, we are working to incorporate ontology-based annotations from the MetaSRA project to allow for better search and filtering by disease, experimental factors, and cell lines.
- In the Single-cell Pediatric Cancer Atlas (ScPCA) project, we are working on adding individual cell annotations as Cell Ontology terms to allow for comparisons across different experiments and annotation methods. Because the Cell Ontology also encodes how different cell types are related to one another, this can allow us to more easily compare methods that might classify cells to different levels of specificity.
- We also continue to use Gene Ontology and other functional ontologies when performing gene set enrichment analyses of various types.
Hopefully I’ve convinced you that ontologies are, at the very least, an idea worth spending a bit of time exploring. Next time you are assigning cell types or categorizing samples, maybe you will take a bit of time to see if there is an ontology that is right for you and your data. It won’t solve all the challenges of naming things, but it is a good place to start.
There is an old joke in computer science about how there are only two hard things: cache invalidation, naming things, and off-by-one errors. I’ll leave aside the first one as beyond my own expertise, but the second comes up all the time in my work as a biological data scientist. Naming variables and functions in my code is a constant struggle, but one I have to deal with on my own or with my team. Much bigger problems come up when trying to deal with all the various ways that people across the world use names when talking about the diseases they work on, the types of cells they are looking at, the experimental methods they are using, and just about every other aspect of their studies.
There are many places in life where different people refer to the same object by different names: Is it a drinking fountain, a water fountain, a bubbler? We usually don’t have too much trouble with this, but it can certainly be confusing at times, and it can make searching for information or categorization difficult. To make things easier, we’d really like to have a single set of names that everybody uses with agreed-upon meanings. Then we would be able to search databases and articles for standard keywords and know that we were finding everything we needed. Well, that is the dream, anyway. As usual, the real world gets in the way.
This is why we can’t have nice things
Accepted names for cell types and diseases often change through time, and experimental methods are always being developed and evolving. For example, phenylketonuria was once known as Følling’s disease, or phenylalaninemia, and it is commonly referred to by the abbreviation PKU. Different biologists may have favorite names for their favorite cell types, or might use different levels of specificity depending on the study. We expect variation like this, but it can make searching across databases and papers for relevant information much more challenging!
So, if we can’t really expect everyone to agree on names, should we give up? Never! All we need to do is to create a database of all the possible names for each thing we care about and assign a code to track that set of names. Then we can look everything up in one place, and know that two different names really refer to the same cell type or the same disease. Easy! If we really wanted to go all-in we could even try to include in our database the relationships among the various entities that we cataloged: whether one cell type is a subtype of another, or ways in which diseases may be related depending on the pathways involved or the organs they affect.
What I’ve just described is an ontology, and they are so choice. If you have the means, I highly recommend picking one up. Unfortunately, building and maintaining an ontology is a tremendous amount of work, for all of the reasons you might expect (remember “everyone agrees”?). Happily, many ontologies already exist, ready for use, complete with dedicated teams behind them maintaining their contents and keeping everything up to date. Now it is our job to start to use them where we can.
Some workers spend their career developing ontologies and ontology-related tools, while few researchers (biologists and physicians) know how ontologies can accelerate their research.
(Rubin et al., 2008)
Ontologies I have known
One of the most famous ontologies in biology is the Gene Ontology, which was developed to describe the biological information that we might have about any given gene. Somewhat confusingly, genes themselves are not part of the ontology itself (naming is hard!), but instead the ontology contains terms like “hexose metabolic process” that describe individual elements of function, location, or biological process.
Just to continue the theme of naming challenges, a term like “hexose metabolic process” is not quite ideal as a consistent way to refer to a concept: for one thing, it only really works if you are speaking English, and we’d like to make our terms more universally useful. Furthermore, we might change the precise words we use for a concept, so something a bit more abstract might be useful for consistency. In most ontologies, this is done by assigning identifiers to each term, which usually consist of a prefix of a few letters followed by a number. For example, the term “hexose metabolic process” (and its synonyms) has the identifier “GO:0019318”.
When we want to describe a gene we can then use those identifiers to annotate everything we know about the gene (potentially assigning multiple annotations to each gene). If everybody does this as they learn new information about their favorite genes, and/or we have dedicated curators to keep the database up to date, we will end up with a large, interconnected database where we can look up genes by annotations and annotations by genes, across a wide and diverse range of studies.
When you start looking for them, ontologies start to show up all over the place. Experimental conditions can be described by the Experimental Factor Ontology, cell types by the Cell Ontology, diseases by the Disease Ontology, Human Phenotype Ontology, the Mondo Disease Ontology, among others.
That last example may be a bit disconcerting: If there is more than one ontology that describes diseases, which one should you pick? I’m afraid I can’t easily answer that, as it probably depends on what other people in your field are using, which may not even be something you know. But if you can pick one, it is far more useful to have some defined terms and ontology identifiers than to have none, and it is usually possible to convert between ontologies if necessary.
Keeping track of all of the ontologies is a challenge in itself. Helpfully, EMBL's European Bioinformatics Institute maintains a searchable catalog of ontologies and their contents via their Ontology Lookup Service. The Ontology Lookup Service also maintains links among the various ontology terms, making it easier to translate between different ontologies that describe the same set of concepts, such as the various disease ontologies I just mentioned.
The Data Lab is diving in
Now, I don’t want you to think that ontologies are always easy to adopt, or that I have all the answers. I’m really just getting started with thinking about how best to take advantage of all of the value that ontologies promise in my own work. But as we go forward, the Data Lab is trying to incorporate ontologies more and more into the tools and databases that we build to encourage interoperability and utility.
- Within refine.bio, our database of processed RNA expression data, we are working to incorporate ontology-based annotations from the MetaSRA project to allow for better search and filtering by disease, experimental factors, and cell lines.
- In the Single-cell Pediatric Cancer Atlas (ScPCA) project, we are working on adding individual cell annotations as Cell Ontology terms to allow for comparisons across different experiments and annotation methods. Because the Cell Ontology also encodes how different cell types are related to one another, this can allow us to more easily compare methods that might classify cells to different levels of specificity.
- We also continue to use Gene Ontology and other functional ontologies when performing gene set enrichment analyses of various types.
Hopefully I’ve convinced you that ontologies are, at the very least, an idea worth spending a bit of time exploring. Next time you are assigning cell types or categorizing samples, maybe you will take a bit of time to see if there is an ontology that is right for you and your data. It won’t solve all the challenges of naming things, but it is a good place to start.