Identification of CNE-containing co-orthologous regions in the Fugu genome
An initial set of 2,330 CNEs with little or no evidence of transcription or RNA secondary structure were identified using a whole-genome comparison of the Fugu (assembly v4) and human genomes (assembly v.36) as described in [29]. CNEs in the human genome were grouped into clusters so that each CNE was no more than 400 kb in distance from another CNE in the cluster. Clusters of CNEs in the human genome made up of hits from two non-contiguous Fugu scaffolds (that is, two separate locations in the Fugu genome) were considered further. Previously, we reported that the genes found closest to CNEs are statistically over-represented for Gene Ontology (GO) annotations [70] relating to transcriptional regulation and/or development (trans-dev) [26]. Genes within each of these clusters in the human genome (including the closest gene either side of the cluster) were considered to be trans-dev if they contained any of these over-represented GO annotations. To avoid ambiguities in associating CNEs to genes, we selected only those regions containing a single trans-dev gene within the CNE cluster. Ten pairs of Fugu scaffolds conformed to these criteria. Seven regions containing the largest number of CNEs in addition to well defined orthologous sequence within each Fugu region (that is, those that contain genes neighboring the CNE cluster) were selected for further analysis. These scaffolds contain CNEs that are conserved in the vicinity of the following trans-dev genes in the human genome: BCL11A, EBF1, FIGN, PAX2, SOX1, UNC4.1 and ZNF503. Fugu protein sequences from the corresponding orthologs were obtained from Ensembl Compara (v36), except in the case of PAX2 where only partial sequences were present. In these cases, tBLASTn searches using known protein sequences from zebrafish pax2 co-orthologs pax2.1 (SPTR: PAX2_BRARE) and pax2.2 (SPTR: O93370) were used to identify the Fugu protein sequence.
To verify that these genes were phylogenetically co-orthologous to mammalian copies we carried out multiple alignments of each pair of Fugu proteins together with available orthologs from human, mouse, rat, dog and chicken using CLUSTALW (v1.83) [71] downloaded from Ensembl Compara (v36) unless otherwise stated. In addition we used all available orthologs from zebrafish. Two of these are previously experimentally characterized co-orthologs and were downloaded from the SwissProt protein database (pax2.1, PAX2_BRARE; pax2.2, O93370; sox1a, Q4V997; sox1b, Q2Z1R2). The remaining novel zebrafish orthologs were downloaded using Ensembl Compara. In the cases of BCL11A, FIGN and ZNF503 only a single ortholog was identified by Ensembl. In the case of EBF1, no zebrafish ortholog could be identified, and was, therefore, replaced by a single ortholog from the stickleback genome. The closest available invertebrate ortholog of each gene in either Ciona (ci), Drosophila (dm) or Caenorhabditis elegans (ce) was used as an outgroup (BCL11A, LD11946p (dm); EBF1, coe (ci); FIGN, CBG21866 (ce); PAX2, pax258 (ci); SOX1, soxNRA (dm); UNC4.1, unc4 (ci); and ZNF503, noc (dm)).
The closest paralog from human, mouse, rat, dog, chicken and Fugu in each case was also included as an outgroup in each alignment (that is, BCL11A:BCL11B, FIGN:FIGN1L, PAX2:PAX5, SOX1:SOX3, EBF1:EBF3, ZNF503:ZNF703). The closest related paralog was defined as the highest significantly scoring non-ortholog high scoring pair (HSP) in a BLASTp search of the human protein sequence against the SwissProt/trEMBL nr protein database. No closely related paralog could be identified for UNC4.1. A phylogenetic tree was created from each alignment using the neighbor joining (NJ) method and 1,000 bootstrap replicates using MEGA v3.1 [72].
Fugu co-ortholog gene nomenclature
F. rubripes is not a good experimental model organism due to difficulties in captive breeding and experimental manipulation. Consequently, few of its genes have been experimentally validated. Most gene predictions in the Fugu genome therefore remain novel, uncharacterized and have no current gene name. For the purposes of this study we decided upon a naming scheme for the Fugu co-orthologs that uses the same name as the human ortholog (for example, SOX1 = sox1) together with a number that denotes the specific co-ortholog (for example, sox1.1/sox1.2). This naming convention is similar to that used in early studies of zebrafish co-orthologs (for example, pax2.1 [63]) but which has now been superseded by a naming convention using letters (for example, pax2a) (ZFIN gene nomenclature guidelines [73]). We therefore used a number-based nomenclature to distinguish Fugu co-orthologs from zebrafish co-orthologs. For those genes in which zebrafish co-orthologs had previously been characterized (pax2a/pax2b, sox1a/sox1b) we named Fugu equivalents by their phylogenetic similarity to these characterized zebrafish genes as ascertained through phylogeny. So, as an example, PAX2 co-orthologs were identified on Fugu scaffolds 86 and 59 (assembly v4; Figure 1d). The phylogeny identified the protein encoded on S86 as closest to zebrafish pax2a and that encoded on S59 as closest to pax2b (Figure 1c); therefore, the gene on S86 was named pax2.1 and the gene on S59 was named pax2.2. The rest of the co-orthologous sets that did not have characterized zebrafish equivalents were assigned names randomly. It is important to note this nomenclature is used purely to distinguish the genes and has no functional significance.
Identification of CNEs in Fugu co-orthologous regions
CNE clusters derived from the whole-genome alignment were used to define the extent of sequence in both human and Fugu for use in more sensitive multiple alignments. Regions up to the next known gene from the most peripheral CNEs in each cluster were extracted in both human and Fugu using the Ensembl API [74]. Special attention was paid to include the same orthologous region between co-orthologous pairs to ensure equivalent comparison. In situations where the full extent of the region could not be identified in one of the co-orthologs due to the location of the region at the end of a scaffold (for example, scaffold_86, znf503.1; Additional data file 1), only CNEs identified up to the same orthologous region (estimated by the presence of nearby genes) in the second co-ortholog were used for comparative analyses. Orthologous sequences corresponding to each human region were similarly extracted in mouse (assembly v34) and rat (assembly v3.4). All genomic sequences were orientated prior to alignment so that the trans-dev gene was in positive orientation and masked for known repeats and low complexity regions using RepeatMasker and the relevant species-specific repeat library. Multiple alignments for the discovery of conserved sub-sequences located in the same relative order and orientation were carried out using the MLAGAN alignment toolkit [75] with translated anchoring and the phylogenetic guide tree '((human (mouse rat)) fugu)'. Pairwise glocal alignments to uncover conserved elements that may have undergone rearrangements (and are, therefore, no longer in the same relative order along the sequence) or inversions between Fugu and all other organisms utilised in the MLAGAN alignment were carried out in a pairwise fashion using Shuffle-LAGAN [76] with default parameters. Each pair of Fugu co-orthologous regions was aligned to the same orthologous mammalian sequence. CNEs were identified from the alignments using the VISTA program [77] as regions with at least 65% identity over 40 bp using Fugu as the baseline sequence. CNEs were filtered further to include only those that were conserved in human and at least one rodent.
Identification of overlapping and distinct CNEs between Fugu co-orthologous regions
The human sequence was used as a reference in order to ascertain whether CNEs identified from each co-ortholog region overlapped the same human sequence (termed 'overlapping') or were conserved in only one co-ortholog (termed 'distinct'). CNEs between co-orthologs were considered overlapping if the conserved sequence overlapped the same position in the human genome by at least 20 bp. Fugu CNEs that were defined as 'distinct' to one co-ortholog were used as query sequences against the alternative Fugu co-orthologous genomic region using the CHAOS local aligner [75] on both strands with the following parameters: word length 10, score cut-off 10, degeneracy tolerance 1, rescoring cut-off 1,000, and BLAST-like extension on. Resulting alignments were filtered to retain only those with at least 65% identity over 40 bp.
Evolution of overlapping CNEs
Element length
To ensure equivalent comparison, the length of the human CNE was used when measuring changes in element length between CNEs conserved in both co-orthologs. For each pair of overlapping CNEs with a one-to-one relationship (that is, each CNE overlapped one other CNE), the proportion (P) of the length of the overlap compared to the full length of each CNE was calculated using:
where ov is the length of the overlap between co-orthologous CNE (in bp) and len is the full length of the CNE. Values of P that tend towards 1 indicate all or the majority of the element is contained within the overlap while those tending towards 0 indicate only a small proportion of the element is contained within the overlapping region.
Sequence evolution
To compare the evolutionary rates of Fugu co-orthologous copies against the single human copy (representing the ancestral sequence) we used all 'overlapping' co-orthologous CNEs. For those CNEs that did not have a one-to-one relationship (for example, in cases where two or more CNEs in one region overlapped a single CNE in another) we treated each individual overlap region independently. Multiple alignments were created for each co-ortholog CNE individually together with orthologous sequence from human, mouse, rat, dog and chicken (where available) to produce the best mapping of orthologous bases. The human sequence from the overlap detected was used to extract corresponding sequence within each multiple alignment for each co-orthologous Fugu copy together with those of orthologous sequences from the other vertebrates. These sequences were then realigned together using DIALIGN (v2.2) [78] and all gapped positions removed using the Gblocks program (v0.91b) [79]. A Tajima relative rate test of each pair of Fugu co-orthologous copies against the single human sequence was carried out as described in [57]. Only sequences that showed at least 4 independent changes in one of the elements and a p value ≤ 0.05 were considered to have undergone significant change.
Identification of CNEs duplicated at the origin of vertebrates
All human CNE sequences were searched against the human genome using BLAST with sensitive parameters (word size 8, mismatch penalty -1) to identify CNEs that have more than a single match (e-value ≤ 1 × 10-4) in the human genome. Paralogs were identified within 1.5 Mb of each hit using the method set out in [29]. The probability of the enrichment for overlapping CNEs within the dCNE set was calculated using a χ2 test with expected numbers for each type of CNE (overlapping versus distinct) calculated from the proportion of each within the whole CNE dataset (381:430 = 0.469:0.531).