Protein complexes are dynamic structures that assemble, store and transduce biological information. A major post-genomic scientific and technological pursuit is to describe the functions performed by the proteins encoded by the genome. Within the cell, proteins assemble into complexes and dynamic macromolecular structures. It is primarily as components of complexes that proteins perform cellular functions. These macromolecular structures engage in tasks critical for cell survival, such as regulation of metabolic pathways, and control DNA replication and progression through the cell cycle, as well as a myriad of minor but important functions.
The implementation of large high-throughput sequencing projects constitutes a revolution in our approach to human genomics. The human genome project has provided the sequence for approximately 30 000 genes. This number does not differ substantially from the number of genes of the nematode worm Caenorhabditis elegans, suggesting that genomic complexity may partly rely on the contextual combination of gene products. Moreover, protein splice variants and post-translational modifications complicate matters by transforming these human genes into millions of proteins. Fortunately, proteomics (the study of the protein complement of the genome) offers opportunities to understand protein expression and protein function. While the genome is fixed, the proteome is much more dynamic. It changes during cellular development and in response to external stimuli. To fully understand the cellular machinery, simply identifying the proteins present is not enough. All of the interactions between them must also be delineated. The characterization of protein–protein interactions is essential to the understanding of the molecular role of the cell in the execution of various biological functions. These interactions form the basis of phenomena such as DNA replication and transcription, metabolic pathways, signaling pathways and cell cycle control. The association of more than two partners in a single complex introduces levels of complexity and regulation beyond binary interactions. The networks of protein interactions described recently (e.g. (Uetz et al. 2000, Ito et al. 2001, Gavin et al. 2002, Ho et al. 2002)) represent a higher level of proteome organization that goes beyond simple representations of protein networks. They represent a first draft of the molecular integration/regulation of the activities of cellular machineries.
The challenge of mapping protein interactions is vast, and many novel approaches have recently been developed for this task in the fields of molecular biology, proteomics and bioinformatics.
The charting of genome-scale protein-interaction maps is a first step forward addressing this challenge. A eukaryotic protein–protein interaction, or interactome, mapping effort has been initiated for Saccharomyces cerevisiae. However, many of the protein–protein interactions that are relevant for understanding human biology, diseases and development only take place in multicellular organisms. Recently, the interactome maps of multicellular model organisms have emerged (Walhout et al. 2000, Giot et al. 2003).
The term proteomics was introduced in 1995 (Wasinger et al. 1995). This domain has seen a tremendous growth over the last nine years, as illustrated by the number of publications related to proteomics. The major goal of proteomics is to make an inventory of all proteins encoded in the genome and to analyze protein properties such as expression level, post-translational modifications and interactions. A number of recently described technologies have provided ways to approach these problems. The most common technologies used in proteomics today are two-dimensional sodium dodecylsulfate polyacrylamide gel electrophoresis (2D SDS-PAGE) for protein separation, mass spectrometry (MS) and protein identification through manual interpretation or database correlation of mass spectra. Integration of these steps is essential for a successful proteome experiment yet it relies on accurate knowledge of the parameters influencing each step. The improvement of these techniques has led to large-scale research in proteomics (Fleischmann et al. 1995, Anderson et al. 2000). It is now possible to identify a large fraction of proteins of a given proteome.
The final step in the characterization of proteins requires the application of bioinformatics tools to process existing experimental information. Bioinformatics tools provide sophisticated methods to answer the questions of biological interest. A number of different bioinformatics strategies have been proposed to predict protein–protein interactions. These include the use of information derived from reference maps of interacting domain profile pairs (Wojcik & Schachter 2001), conserved gene-pairs and correlated prokaryotic interacting gene products (Dandekar et al. 1998), clusters of orthologous proteins (Tatusov et al. 1997), phylogenetic profile (Pellegrini et al. 1999) or tree similarity (Pazos et al. 1997), gene fusion events (Marcotte et al. 1999), location within a functional cluster map (Schwikowski et al. 2000) and others.
Bioinformatics therefore has a critical role in the analysis of protein–protein interaction. Several databases that accumulate these data are currently available. These databases play an essential role in visualizing and integrating their own experimental data with the information about protein–protein interactions available in the Database of Interacting Proteins (DIP) (Xenarios et al. 2002), the General Repository for Interaction Datasets (GRID) (Breitkreutz et al. 2003a), the Bio-molecular Interaction Network Database (BIND) (Figeys 2003) and the Human Protein Reference Database (HPRD) (Peri et al. 2004).
The focus of this paper is to describe experimental and bioinformatics approaches for determining protein–protein interactions (Fig. 1
). We also discuss the interpretation of protein–protein interaction information to elucidate protein function.