It is not particularly easy to analyze any type of -omics data. The researcher has to handle technical variation, biological variation, sample quality, sequencing variability, and appropriate controls (and that is just on the experimental side). Expanding those sources of variability to multiple platforms makes the research exponentially more difficult. It seems that enough controls cannot be built into the experimental design to alleviate concerns about any and every potential. However, systems biology can and must address these issues. It then comes down to how the systems biology community will solve these problems. Will it be on the level of each nucleotide? Or maybe each gene? I would argue that the ideal way to connect data across different type of -omics data is on the level of cellular processes.
As human beings, we can only process so much information. Surely computational resources can help expand the way researchers look at data, but we are still limited by what people can process. For this reason, I am inclined to say that we cannot connect data on the level of the nucleotide, or even the gene. Tens of thousands of genes exist and have different connectedness to other genes and networks of genes. Therefore, developing a model to encompass the way we handle connecting -omics data on the scale of a system is not only intractable, but it may also be counterproductive. As systems biologists, we adhere to looking at a system through a wider lens. This wider lens may encompass more information than other scientists care to think about, but that doesn’t mean information has to be considered at the same reductionist level (i.e. a nucleotide, or gene). Maybe by looking at the system in a more coarse way, we can actually reveal information that would not be identified otherwise.
By looking at data on the process level, the number of variables is lowered. There also may be structures to these variables that reduces the space even more. A great example of this process level is gene ontology, or GO terms. GO terms are structured as networks with different levels. These levels range from broad to narrow, and are connected to each other in various ways. Therefore, comparison between types of information can be looked at on the same playing field, just a smaller one than usual. Additionally, the relatedness of GO terms make it more clear how to compare individual processes. The genes and nucleotides contributing to one single process may be many, but the summation in the GO term retains the information while making it easier to comprehend.
A great example of this type of integration is the work I have been doing on connecting genetics and epigenetics. In this example, genetics are the key mutations that have an effect on phenotype (i.e. drug response) and epigenetics are the stable states of the gene regulatory networks that define cellular behaviors. Based on our definitions, these types of data (and the -omics that produced them) are inextricably linked. Any change in genetics will change the entire landscape of epigenetic states. Therefore, we needed to find a way to understand the effects of mutations on epigenetic states. The caveat and need for multi-omics integration in this case stems from the fact that genetics and epigenetics do not relate in a 1:1 manner. Just because a sample acquires a mutation does not mean that the gene with that mutation will have a change in its epigenetics (measured by gene expression). In fact, that type of connection is rarely the case. Much more likely is that mutation will change the signaling processes of the network, setting into motion a cascade of events that modify the network stable states. Therefore, a method is necessary to look more broadly than at the level of individual genes, or even a modeled network. Enter GO term semantic similarity. Semantic similarity is a method by which individual (and multiple) GO terms are compared based on the structure of the GO network. In my recent work, we take differentially expressed genes between samples and genes with unique mutations in those samples, and feed them into a GO term enrichment analysis. The statistically significantly enriched terms from each -omics data type are compared to one another via semantic similarity to identify a single number that encapsulates the connection between genetics and epigenetics. In the cases where that number is large, there is a clear connection and therefore genetics are driving the phenotype. In the opposite case, we can resolve that genetics are having less influence and phenotypic differences are driven moreso by epigenetic differences. In this way, we take a systems biology approach to connecting these data types, but retain the simplicity of a single number corresponding to connectedness.
There is no doubt that comparing multi-omics data is a tall task. Many approaches have been proposed to connect these data types, but a clear consensus has not been reached. Although not the elegant approach that computationally expansive models have provided, comparison of data on the process level has its own advantage in its simplicity. Invariably, comparison on this level will miss multiple interactions between the data on fine-grain levels, but it may also capture broader connections that would have been missed by other approaches. Is it perfect? Of course not. Does it get the job done? I’ll leave that up to you to decide.