Enzyme function prediction with interpretable models
Syed et al. (2009): Enzyme function prediction with interpretable models
The goal of the authors was "to develop a new tool for functional prediction" of enzymes. To achieve this goal, they did the following:
In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.
The motivation for this effort is that "[o]ne of the main challenges [of researchers] is to decipher the intricate network of cellular pathways that exist in each genome":
Understanding this network is the key to understanding the functional role of individual genes, and the genetic variation across organisms with respect to key processes and mechanisms that are essential to sustain life. Of special interest are metabolic pathways that consist mostly of enzymatic reactions and are responsible for nucleotide and amino acid synthesis and degradation, energy metabolism, and other functions. [...]
Since experimental verification of pathways is a time consuming and expensive process, there is a great interest in computational methods that can extend the existing knowledge about pathways to newly sequenced genomes.
The authors list a couple of methods for identification of candidate enzymes, all of which "require a set of candidates for each [pathway] hole. And while they can use the trivial candidate set consisting of all of the proteins in the subject organism, their performance is likely to significantly improve if the candidate sets are small and focused. [...] Producing effective mappings from genes to enzyme families is the focus of this chapter."
Before going into details, the authors introduce the EC hierarchy:
Traditionally, enzymes have been organized into a a systematic, hierarchical classification scheme devised by the IUB Enzyme Commission [EC]. The EC number of an enzyme has the form A.B.C.D, where the first digit A indicates the enzyme's major functional family: oxidoreductase, transferase, hydrolase, lyasase, isomerase, or ligase. The second digit B places the enzyme in a functional subfamily, the third into a sub-subfamily, and so on. For example, the EC number 1.2.3.4 designates enzymes which are oxidoreductase (the first digit) that act on the aldehyde or oxo group of donors (the second digit) and have oxygen as an acceptor (the third digit). In this case, the last digit specifies the particular reaction that is catalyzed by this family of enzymes.
The difficulty in assigning the correct EC number lies in the fact that "in many cases sequences have diverged to the extent that their common ancestry cannot be detected even with the most powerful sequence comparison algorithms". However, "[s]ince structure is more conserved than sequence, structural similarity can suggest functional similarity even when sequences have diverged beyond detection". Therefore, "approaches based on alternate representation of proteins" have been devised:
Instead of traditional sequence or structural alignment techniques, several groups have proposed comparison algorithms that rely on an alternative representation of proteins. These representations are usually derived from sequence motifs (i.e. short patterns), simple physio-chemical properties that can easily be computed from sequence, or annotations that are independent of sequence but can be found in databases for many proteins, e.g. subcellular location.
Algorithms used for this purpose include Naive Bayes, Decision Trees, Nearest Neighbor and Support Vector Machines.
The authors decided not only to consider sequence and structure but also "[f]eatures such as the domain content, subcellular location, tissue specificity, pairwise interactions and expression profiles" and in this way developed "a method to predict the function of a protein based on basic biochemical properties augmented with (partial) data available from database records of biological properties that may hint at the biological context of the protein" with the goal being "to identify the most relevant and informative features and combination of features that best characterize protein families (e.g. enzyme families), and to create a model for each protein family that can be used for classification and function prediction". For this, they introduced "a mixture model of stochastic decision trees".
Each family is modeled by a collection of decision trees that capture the salient features that are common to the members of the family. Our choice of the model was motivated by the nature of the data. Decision trees are useful when the data is nominal, i.e. when there is no natural notion of similarity or even ordering between objects. Such is the case for some of the attributes we associate with proteins, for example, tissue specificity and subcellular location. These attributes rule out the use of many other popular machine learning models, such as Support Vector Machines. Other attractive aspects of decision trees are robustness to errors both in classification and attribute values. Moreover, decision trees can be used when some of the data is missing, and this is often the case with database attributes.
In addition, the authors "converged to a locally optimal learning strategy by conducting a rough greedy search over the space of decision tree learning algorithms".
The paper concludes with the following remark:
There are several modifications that may improve performance. One modification that we are considering is to modify the pruning algorithm and switch from a local approach to a global approach (where all nodes are considered before deciding on the node to be pruned). Another modification would be to assign probabilities to attribute values, since for some nominal attributes it is possible to quantify the likelihood of the different values. Clearly, integration of other features can refine the models, and finally, boosting techniques can also help to improve the performance.