The increasing number of sequences that are available in databases, a hundred times higher than ten years ago (http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html), makes the accuracy of sequence annotation a great challenge. By contrast with global analyses of transcriptional activity that aim to scan the genome for potential transcription units (Choudhary et al., 2001
; Yamada et al., 2003
), transcriptome and proteome studies require the structure and function of genes to be determined precisely. Transcriptome studies need arrays designed to follow the expression of specific collections of genes that must be relevant to the biological question addressed. Proteomic approaches rely on the identification of proteins performed using mass spectrometry either from peptide sequencing or from peptide mass fingerprinting.
Bioinformatic analyses are being made easier as the quality of the available software and the annotations provided by databases is continuously improving, especially for plant model organisms such as Arabidopsis thaliana and Oryza sativa. Information related to gene structure and expression are available at NCBI (http://www.ncbi.nlm.nih.gov/Database/index.html). Genomic, cDNA, and EST sequences are compared to establish the exon/intron structure of genes. Bioinformatic predictions are checked and eventually corrected using primary sequences (http://www.ncbi.nlm.nih.gov/RefSeq/) (Pruitt and Maglott, 2001
). Information on protein functions are provided in several databases including Uniprot where the origin of the proposed function is mentioned (experimental or electronic annotation) (http://www.expasy.uniprot.org/) (Apweiler et al., 2004
), NCBI (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide), TAIR (http://www.arabidopsis.org/), TIGR (http://www.tigr.org/tdb/e2k1/ath1/), and MIPS (http://www.mips.biochem.mpg.de/proj/thal/db/index.html).
This paper will provide some examples of misleading annotations with regard to putative protein function that may cause mistakes either in array design or in data interpretation. Examples will be taken mainly from A. thaliana and from published papers or databases such as Uniprot, NCBI, TAIR, TIGR, and MIPS.