Dr. Jan O. Korbel

Computational Biology & Genomics, Gerstein lab
Yale University School of Medicine

Protein Function Prediction

Many genome sequences have been completed over the last years and the pace of DNA sequencing is constantly increasing. When analyzing a newly sequenced genome it is typically impossible to assign functions to more than two thirds of all protein-coding genes, even if experimental approaches for function characterization as well as computational algorithms for homology detection (for example, BLAST [1]) are applied extensively. There is thus a great need for reliable tools for protein function prediction. A recently developed class of analysis approaches, the genomic context approaches, is often well-suited for function prediction, and can be utilized to infer functional aspects overlooked by BLAST. Genomic context approaches analyze in detail the genome a protein is encoded in: the genomic ‘background’, or context, points at the protein’s cellular role.

While I was a graduate student at the EMBL Heidelberg in the laboratory of Peer Bork, we both used previously existing genomic-context based methodologies for function prediction (i.e. Approaches I-III), and together with collaborators developed three novel genomic-context approaches (i.e. Approaches IV-VI):

Approach I: Conserved neighborhood, predicted operons. For example, prokaryotic proteins which participate in a common biochemical pathway or cellular process are often encoded by neighboring genes in the genome. Together, the genes form an operon, a group of genes that can be regulated and translated into proteins simultaneously. The ‘conserved operon approach’ (for illustrations, see Figure 1below) looks for operons in many completely sequenced genomes. If genes form a similar operon-structure over and over again (in this case the operon is presumably highly evolutionarily conserved), it is very likely that the respective proteins are functionally associated [2,3]: in other words, they are members of the same biochemical pathway or cellular process, or they form a protein complex together.

Approach II: Gene fusion. This approach searches for large fusions of protein-coding genes. If the open reading frames (ORFs) of two distinct genes frequently appear merged in different genomes, the two polypeptide halves of the resulting large fusion protein are likely to be functionally associated [4,5]. Moreover, also the corresponding, separately occurring proteins in other species are in such a case likely to interact.

Approach III: Phyletic pattern, co-occurrence. This approach uses the fact that groups of proteins involved in certain cellular pathways or complexes are usually together conserved in a species, if the pathway or complex is present in the organism [6,7]. In other words, the corresponding genes are usually either together present or absent from genomes (i.e. they co-occur).


Recently, we have developed three approaches that use novel signals in the genomic context for systematic function prediction (Figure 1). These not only predict functional associations between proteins, but yield further, specific information about the protein biological function:

Approach IV: Phyletic pattern, anticorrelation. If the presence of two genes across eukaryotic and prokaryotic genomes is mutually exclusive – in other words the gene occurrences 'anticorrelate' (one gene is present in a genome and the other one is absent) – it can be assumed that both genes encode proteins with the same function [8,9]. The underlying rationale is that there is usually no need for a genome to encode the same function twice. We have developed an approach that exploits gene occurrence anticorrelation in order to systematically identify functionally equivalent proteins [10]. We applied it for the discovery of analogous enzymes in the thiamin biosynthesis pathway . If catalytic function and cellular role are known for one protein, the approach can assign both properties to the novel protein. Both proteins do not have to be evolutionarily related.

Approach V: Conserved neighborhood, divergently transcribed genes. Not only gene neighbors that are located in operons are likely to functionally interact: Approach V uses the fact that also neighboring genes that point into diverging, opposite directions frequently encode functionally associated proteins in prokaryotes, especially if this gene organization has been conserved in evolution [11]. Often, one gene within such a so-called ‘divergently transcribed gene pair’ regulates the transcription of the other gene, as well as its own biosynthesis through ‘auto-regulation’.

Approach VI: Phyletic vs. phenotypic pattern. The common presence and absence of genes and phenotypic characteristics across species can be used to predict the genes involved in a the manifestation of a phenotype [12-14]. We have developed an novel approach that systematically connects genes and phenotypes by combining genomic context analysis with automated data-mining [15]. In particular, an automated literature analysis searches for documented phenotypic similarities between species. A phenotypic characteristic is then connected to genes that show a presence and absence across species similar to that characteristic.

 

 

Figure 1 — Protein function prediction from the genomic context of genes.
I. The conserved operon approach enables predicting functions for a hypothetical gene found in the genomes of several prokaryotic species [2,3]. In the illustration, red arrow symbols indicate the genomic location of the conserved hypothetical gene in different genomes. Here and in the following, the term gene is equated with a protein-coding open reading frame (ORF). Although the encoded protein does not share significant sequence similarity with well-characterised proteins of known function (and thus tools like BLAST will fail to identify a putative function), the repeated occurrence across species of the gene within operons belonging to a particular biochemical pathway reveals it presumable role. For instance, the blue, yellow, and green symbols may represent the genomic locations of genes (ORFs), which encode enzymes of a vitamin biosynthesis pathway: thus, the hypothetical gene is also likely to be belong to this pathway.
II. The fusion approach [4,5] allows predicting the function of a conserved hypothetical gene (red arrow symbols), if it is merged with a well-characterised gene in another species (see the combined red-yellow arrow symbols).
III. The co-occurrence approach [6,7] enables predicting the function of a conserved hypothetical gene (red arrow symbols), if it shows the same (or similar) occurrence across species as well-characterised genes (green and yellow).
IV. The anticorrelation approach [10] systemtaically identfies genes with anticorrelating (mutually exclusive) occurrence across species, which are predicted to code for proteins with functionally equivalent cellular role. The function of a conserved hypothetical gene (red arrow symbols) can thus be predicted, if the gene reveals anticorrelating occurrences with a gene of known function (green arrow symbols).
V. The divergently transcribed gene pair approach [11] predicts functions for a conserved hypothetical gene (red arrow symbols), if it is in a number of genomes divergently transcribed (i.e. pointing towards opposite directions on the genome) from a well-characterised gene (green arrow symbols), or from an operon of well-characterised genes (green, yellow and blue arrow symbols).
VI. This approach [15] allows to systematically connect genes (for example, hypothetical genes; see red arrow symbols) to the phenotypic characteristics (for instance, motility; see illustrated bacterial cells) they are likely to be responsible for. The rationale is that both gene and phenotypic characteristic typically show similar presence and absence across species.

 

References

[1] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., S25:3389-402. [Abstract]

[2] Dandekar T, Snel B, Huynen M, Bork P (1998). Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci., 23:324-8. [Abstract]

[3] Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N (1999). The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. U S A, 96:2896-901. [Abstract]

[4] Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA (1999). Protein interaction maps for complete genomes based on gene fusion events. Nature, 402:86-90. [Abstract]

[5] Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D (1999). Detecting protein function and protein-protein interactions from genome sequences. Science, 285:751-3. [Abstract]

[6] Huynen MA, Bork P. (1998). Measuring genome evolution. Proc. Natl. Acad. Sci. U S A, 95:5849-56. [Abstract]

[7] Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999). Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. U S A, 96:4285-8. [Abstract]

[8] Galperin MY, Koonin EV (2000). Who's your ne ighbor? New computational approaches for functional genomics. Nat. Biotechnol., 18:609-13. [Abstract]

[9] Myllykallio H, Lipowski G, Leduc D, Filee J, Forterre P, Liebl U. (2002). An alternative flavin-dependent mechanism for thymidylate synthesis. Science, 297:105-7. [Abstract]

[10] Morett E, Korbel JO, Rajan E, Saab-Rincon G, Olvera L, Olvera M, Schmidt S, Snel B & Bork P (2003). Systematic discovery of analogous enzymes in thiamin biosynthesis. Nat. Biotechnol., 21:790-5. [PDF]

[11] Korbel JO, Jensen LJ, von Mering C, Bork P (2004). Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat. Biotechnol., 22:911-7. [PDF]

[12] Levesque M, Shasha D, Kim W, Surette MG, Benfey PN (2003) Trait-to-gene: a computational method for predicting the function of uncharacterized genes. Curr. Biol. 13:129-33. [Abstract]

[13] Makarova KS, Wolf YI, Koonin EV (2003). Potential genomic determbnants of hyperthermophily. Trends Genet., 19:172-6. [Abstract]

[14] Jim K, Parmar K, Singh M, Tavazoie S. (2004). A cross-genomic approach for systematic mapping of phenotypic traits to genes. Genome Res., 14:109-15. [Abstract]

[15] Korbel JO, Doerks T, Jensen LJ, Perez-Iratxeta C, Kaczanowski S, Hooper S, Andrade M, Bork P (2005). Systematic Association of Genes to Phenotypes by Genome and Literature Mining. PLoS Biol., 3:e134. [PDF]

 

 

Utilizing the genomic context of genes