Cell wall structural proteins provide interesting examples of poor quality annotation because their sequences are rich in particular amino acids. Three classes of structural proteins have been clearly defined: extensins characterized by the presence of numerous Ser-Pron (n
3) motifs separated by Tyr-, Lys-, His, and Val-rich regions (Kieliszewski and Lamport, 1994
); Hydroxyproline/Proline-Rich proteins (H/PRPs) characterized by a high content in Pro and Pro-Pro-X-Y-Lys motifs, where X, Y=Val, Tyr, His, or Glu (Showalter, 1993
); and Glycine-Rich proteins (GRPs) characterized by a high content in Gly (up to 70%) organized in repeats of the (Gly-X) motif, where X=Gly, Ala, or Ser (Showalter, 1993
). Numerous proteins predicted to have a signal peptide by PSORT (http://psort.nibb.ac.jp/form.html) and TargetP (http://www.cbs.dtu.dk/services/TargetP/) and showing only short stretches of Pro or Gly have been wrongly annotated as extensin-like, PRP or GRP. This is notably the case for At2g33790 (14.6% Pro), At5g26070 (23.5% Pro), and At4g28300 (13.6% Pro) annotated as extensins or PRPs in the Uniprot, NCBI, TAIR, and TIGR databases. At4g34300 (14.7% Gly), At4g33930 (14.6% Gly), and At2g15340 (17.6% Gly) are presently annotated as GRPs in the NCBI, TAIR, and TIGR databases, but as putative or unknown proteins in the Uniprot and MIPS databases. Other examples are provided by a recent transcriptome study on peach by Trainotti et al. (2003)
. Contig 010 shows homology to the S65062 [GenBank] cotton fiber protein 6 (John, 1996
). Since, this protein has only one short Ser-Gly motif, it cannot be classified among structural proteins as suggested by the authors. In the same way, contig 125 shows homology to Arabidopsis thaliana NP_176440 [GenBank] (At1g62510). The primary sequence of the encoded protein has only one short X-Pro (with X = His, Lys, Asn, Thr, Ser) domain that again is not sufficient to classify it among the structural proteins mentioned in the MIPS database. It actually comprises a PFAM domain (PF00234) defining a protease inhibitor/seed storage/LTP family (http://hits.isb-sib.ch/cgi-bin/PFSCAN) clearly indicated in the NCBI, TAIR, and TIGR databases.