Integrating 'omic' information: a bridge between genomics and systems biology
Ge et al. (2003): Integrating 'omic' information: a bridge between genomics and systems biology
The theme of the paper is the integration of multiple independent datasets to increase the reliability of gene function annotation; the authors "review the recent development of strategies for such integration and we argue that these will be important for a systems approach to modular biology".
This paper introduces the term "modular biology", in which "biological processes of interest, or modules, are studied as complex systems of functionally interacting macromolecules". Various technologies have been developed "that allow the assignment of genes to particular biological modules". Examples mentioned are standardized high-throughput (HT) assays to analyze the transcriptome (microarrays, DNA chips or serial analysis of gene expression - SAGE) and the proteome, protein-protein, protein-DNA and other types of component-component interaction mapping (interactome mapping), systematic phenotypic analyses (phenome mapping) and transcript or protein localization mapping (localizome mapping).
"[D]ata obtained from any single omic approach should be interpreted cautiously" as "information can be missing because of the occurrence of false negatives, and information can be misleading because of the presence of false positives. [...] In addition, data emerging from any single omic approach can only provide crude indications of gene or protein function."
The authors further write:
In a study of omic data integration, HT, LT and combined HT-LT yeast interactome datasets were compared with three modular transcriptome datasets related to cell cycle, sporulation or environmental stresses. First, a protein interaction density (PID) value was calculated as the ratio of the number of observed protein–protein interaction pairs over the total number of possible pairwise combinations for a given set of gene products. PIDs were compared between sets of protein pairs encoded by genes belonging to common transcriptome clusters (or ‘intracluster’ pairs) and sets of protein pairs encoded by genes belonging to different clusters (or ‘intercluster’ pairs). In general, average intracluster PIDs are significantly greater than intercluster PIDs for interactome datasets, whereas the average intracluster and intercluster PIDs are similar for sets of random protein pairs. Also, LT interactome data give larger PIDs than HT data. This observation indicates a global correlation between transcriptome and interactome mapping data.
Another study performed HT Y2H and RNAi analyses "to identify genes involved in the DNA damage response (DDR) in C. elegans":
Proteins corresponding to potential C. elegans orthologs of DDR genes in other organisms were subjected to Y2H interactome screens and potential protein–protein interactions were identified. Subsequently, HT Y2H interactors were subjected to RNAi and examined for DDR-related phenotypes. Approximately 10% of the interactors tested exhibited DDR-related phenotypes and two phenoclusters were established, distinguishing potential functions in either DNA damage checkpoints or DNA repair. A majority of the interactions identified belong to the intracluster category, whereas a few are intercluster, which provided supporting evidence for a correlation between the interactome and the phenome.
These analyses can be used to formulate hypotheses:
Interactome–phenome correlation analyses also suggest that a combination of these two approaches can help to formulate hypotheses. [...] Further analysis of transcriptome and phenome data can also lead to the formulation of hypotheses. For example, the genes known by genetic means to be essential for yeast sporulation revealed two subgroups [...]. In one subgroup, expression was not responsive to sporulation and most of these genes were found to encode general cellular factors. In the other subgroup, genes were transcriptionally responsive to the sporulation process and many of those were found to encode proteins specifically required for sporulation. Therefore, uncharacterized genes that are both essential for and transcriptionally responsive to sporulation can be hypothesized to be sporulation-specific factors.
What will eventually be the result of integration of multiple omic approaches? The authors suggest:
With the increasing availability of various omic maps, more studies are likely to employ multidimensional integration of omic data. Over time, it should become increasingly clear whether a global correlation of omic datasets applies to different systems and modules and how biological hypotheses can be formulated based on data integration.
"[T]o provide visualization of integrated omic data", various "bioinformatic tools are being developed. For example, expression correlation of two genes encoding potentially interacting proteins can be visualized in webaccessible protein–protein interaction networks. [...] [B]ecause omic datasets are being constantly updated, visualization tools should allow the incorporation of constantly evolving data."
The authors conclude:
We propose that such data integration can be further applied to examine the topology of biological networks, to provide information on directionality of interactions, and to create wiring diagrams that better depict the functional outcome of component–component relationships. Together, these strategies should facilitate a systems approach to modular biology. [...] By applying a single omic approach, the knowledge of a system can be expanded from a single gene to a network of genes, which can be regarded as a basic model for the system. When genes or proteins in this network are systematically disrupted, responses from other parts of the network can be recorded and the data obtained can be incorporated into the basic model. [...] Real-life biological systems might contain more components and the wiring diagrams that depict the relationships between these components could be much more complex than currently appreciated. Also, the information available for biological systems is increasing as more omic datasets become available. Thus, HT data integration is needed in systems biology approaches, which should be achieved by the use of computational tools that apply the principles and methodologies discussed here to multiple sets of omic information in a dynamic manner. The biological networks or wiring diagrams modeled in this manner should shed light on the complexity of biological systems.