Review
In the last decade, Bioinformatics has been used for the microbial biotechnology in many ways: computationally analyzing the wet-lab data, genome sequencing, identification of protein coding segments [6,24,41,64], and genome comparison to identify the gene function [4,5,11,25,35,46,53,70,71], the development of genomic and proteomics databases [8,9,16,21,33,49,62,63], and inference of phenotypes (higher level functions) from genotypes (gene level functions). In order to understand higher level functions four major studies have been undertaken: (i) automated reconstruction and comparison of metabolic pathways [12,14,18,38,49,58,59,65], (ii) study of protein-protein and protein-DNA interactions to understand regulatory pathways [2,7,15,27,28,30,42-44,47,55,60,61,66], (iii) modeling 2D and 3D structure of proteins [10,31,52,57,67], and (iv) modeling the docking of 3D models of proteins with drugs [34]. Understanding 3D structure of proteins has a major impact in understanding protein-protein interactions. Protein-protein and protein-DNA interactions will provide a good understanding of binding sites in signaling pathways; understanding the interactions between proteins and chemical compounds has already facilitated the development of drug design.
Three approaches have been used in bioinformatics: (i) use of computational search and alignment techniques [4,5,53,70] to compare new genome against the set of known genes to annotate the structure and function of genes in a newly sequence genome, (ii) the use of mathematical modeling techniques such as data mining, statistical analysis, neural networks, genetic algorithm, and graph matching techniques to identify common patterns, features and high level functions, and (iii) an integrated approach that integrates search techniques with mathematical modeling.
Genome sequencing
The major contribution of the bioinformatics in genome sequencing has been in the: (i) development of automated sequencing techniques that integrate the PCR or BAC based amplification, 2D gel electrophoresis and automated reading of nucleotides, (ii) joining the sequences of smaller fragments (contigs) together to form a complete genome sequence, and (iii) the prediction of promoters and protein coding regions of the genome.
PCR (Polymerase Chain Reaction) or BAC (Bacterial Artificial Chromosome) based amplification techniques derive limited size fragments of a genome. The available fragment sequences suffer from nucleotide reading errors, repeats – very small and very similar fragments that fit in two or more parts of a genome, and chimera – two different parts of the genome or artifacts caused by contamination that join end to end giving a artifactual fragment. Generating multiple copies of the fragments, aligning the fragments, and using the majority voting at the same nucleotide positions solve the nucleotide reading error problem. Multiple experimental copies are needed to establish repeats and chimeras. Chimeras and repeats are removed before the final assembly of the genome-fragments. The joining of the fragments is modeled as a mathematical weighted graph where nodes are fragments and the weights of edges are the number of overlapping nucleotides, and the fragments are joined based upon maximum overlap using a greedy algorithm [46,70]. In a greedy algorithm, most nodes having maximum (or minimum) scores are collapsed first. To join contigs, the fragments with larger nucleotide sequence overlap are joined first.