I talked last time about the data tsunami but what really is this and why is it happening? And why should we care?
Biological research is a data intensive endeavour.
A look across the trends in genomics and related areas all show the same thing, the number of data sets being generated and published increases year on year in an exponential way. The figure above is an example google image search showing this idea, with many graphs of different omics data all going up and up. This is true on the commercial side as well, with bioinformatics and genomics market reports predicting double digit CAGRs driven by falling prices and rising demand. Meanwhile, omics technology continues improving, giving us ever better, more detailed and more complex measurements, often in higher throughput to give more data points each time. The biological insights and knowledge gained from these experiments continually feed back into the system, allowing us to ask –and answer!– ever more sophisticated questions as the field evolves.
While new data generation is increasing, at the same time the old data does not go away. Journals and funding bodies generally require that data sets are uploaded into pubic databases after publication meaning that many, or even most, of these sets are available for everyone to use. Even after it has been analysed, each data set still exists in its original form and retains its biological value. This value may even increase over time as we compile different types of data for a specific organism (i.e. the plant or animal being studied), allowing new combinations that give more complete information than is found in one experiment alone.
Taken all together this has resulted in the data tsunami, a gathering wave of biological measurements with rich potential for new ideas, new knowledge, and new answers to our most pressing problems.
Data integration and analysis across molecular layers is difficult.
So the data accumulates. How to make the most of this potential? The ability to analyse different types of omics data from the same sample all at once can improve our understanding of the underlying interactions and regulation within and between the different molecular layers. We can then use this understanding to predict how a plant or animal will change in response to outside interventions allowing us to optimise their growth and wellbeing. However this is not an easy task, with several major challenges standing in the way.
The first problem is harmonising all of the data that is produced when measuring the different chemicals, i.e. DNA, mRNA, proteins, metabolites, into a single multi-omics package ready for biological interrogation. Their different physical and chemical properties influence the types of measurements which may be taken and different measurement platforms or software packages may output even similar data in differing formats. Together this causes interoperability problems where the data created by one system may not be read or understood by another. Add in differences in data cleaning, normalisation, and transformation, and just getting all the data in one place can be a significant hurdle.
Then we need to integrate the different data types into one analysis framework that can identify true causal interactions and associations. For example, while RNA-seq or microarrays can mesure tens of thousands of transcripts with high coverage, mass-spec or similar may only profile a few hundred or thousand proteins or metabolites from the same sample. This can introduce annotation bias, where the transcript data has more weight simply because there is more of it even though the proteins may be more functionally meaningful. Additionally, small changes in very low concentration proteins or metabolites (e.g. cytokines, eicosanoids) may be lost among all the transcript data even when they have strong effects on the plant or animal. On the microbiome side, a key challenge is how to move from just species or strain abundances (i.e. which microbes are there) to modelling the metabolic capacity of the whole microbiome, since it can be argued that the latter is what actually causes a phenotype. This becomes even more complex when adding in the host metabolism as there are likely to be strong feedback loops within the whole system.
So what do we do about it?
Overall, data integration must be approached in a clever way to reduce noise, bias, and computational burden by focusing in on only what is important or meaningful, without losing useful biological signals along the way. FindingPheno takes a range of approaches to this problem, including biology-agnostic unsupervised machine learning methods, biology-informed hierarchical modelling, and time course or spatial dynamic modelling, with the overall aim of creating a holistic statistical model that can finally tame the tsunami.
Written by Shelley Edmunds