RE-ANNOTATED GENOMIC SEQUENCE
Release 4 Notes: Updated September 15, 2005
Release 4.2.1 of the euchromatin is now available at http://flybase.net/annot/.
Annotation Statistics for Release 4.2.1
Note that gene model statistics include heterochromatin
annotations but aligned feature counts are only for euchromatin
* change is relative to Release 4.1 annotations
The sequence finishing and annotation of the heterochromatic region of the genome is being performed by the Drosophila Heterochromatin Genome Project (DHGP; see Hoskins et al. 2002). As sequence gaps are filled, and the heterochromatic scaffolds are finished to high quality and re-annotated, they will be contributed to GenBank and FlyBase and integrated into future releases of the Drosophila genomic sequence.
Release 3.2b annotation of the heterochromatic regions are available from FlyBase and the public data libraries (NCBI, EBI, DDBJ). At FlyBase, these data are available from FlyBase Gene Annotation reports, the FlyBase BLAST server, the batch query page, and the download site.
The Release 3.2b heterochromatin annotation represents the latest effort to describe the protein-coding genes, non-coding genes, and other features located in the heterochromatin sequence. In this update, the underlying sequence is the 20.7Mb of Release 3 whole-genome-shotgun (WGS) scaffolds from Celera that could not be assembled into the euchromatin arms as well as a few BDGP-sequenced scaffolds.
The WGS3 heterochromatin consists of ~2600 scaffolds that still contain gaps and collapsed repeats, but are otherwise considered relatively high-quality sequence. Some of these have been mapped to particular chromosome arms (i.e. 2h, 3h, 4h, Xh, or Yh), while the remaining have been placed on chromsome U. It is important to note that scaffolds that have been mapped to a particular chromosome arm are provisionally ordered, but not oriented: they are ordered by their experimentally determined cytological locations, but their orientation and exact order remain unclear. Chromosome U consists of unordered, unoriented scaffolds. While the underlying sequence of the scaffolds annotated in Release 3.2 has not changed, the mapping and ordering of these scaffolds on chromosome arms (e.g. 2h, 3h...) may differ from previous releases.
The transition between the euchromatic and heterochromatic regions of the genome is thought to be a gradual one, and there are no objective rules to categorize the sequence in this transitional area as definitively euchromatic or heterochromatic. Currently the boundaries between the euchromatic and heterochromatic portions of the genome are based on cytological data, as described in Hoskins et al. 2002.
Annotation guidelines consistent with FlyBase and the overall Drosophila genome annotation were adhered to whenever possible. However, since these annotations are based on high-quality draft sequence, certain gene models may contain missing or premature stop codons, missing start codons, or gaps within their ORFs. Open reading frames corresponding to fragments of transposable elements are common in heterochromatin; every attempt was made to identify these and exclude them from the gene annotations.
Release 4 annotation of the heterochromatin should become available in Summer 2005.
As the DHGP adds new data and improves the quality of the underlying sequence and assembly in future releases, the quality of the annotations will also improve. The DHGP welcomes any feedback and data from the community that will assist in this effort.
KNOWN MUTATIONS IN THE SEQUENCED STRAIN
The sequenced strain, usually described as the y; cn bw spstrain, was known to carry mutations in those four genes. During annotation, mutations in other genes have been discovered (currently known are mutations in oc, LysC, MstProx, GstD5, Rh6, Gr22b, Or98b and CG8447). To allow compilation of a comprehensive proteome, wild-type protein sequences for these genes have been included in sequence entries to GenBank/EMBL/DDBJ. Wherever possible, a RefSeq accession based on an alternative wild-type sequence and curated as a FlyBase Annotated Genome Sequence (ARGS) has been provided.
GENOMIC SEQUENCE RELEASES vs. ANNOTATION RELEASES
The different releases of the D. melanogaster genomic sequence are designated by the whole number component of the release number. The first annotated genomic sequence was released on March 24, 2000, and constituted Release 1 (Adams, et al., 2000). After Celera/BDGP filled 330 gaps and changed ~3000 annotations, Release 2 was made public in October, 2000. This whole genome shotgun assembly had ~1300 gaps.
To produce the 116.8 Mb Release 3 euchromatic sequence, the BDGP closed almost all of the gaps in the euchromatic portion of the genome, and raised the sequence quality to an estimated error rate of less than one in 100,000 base pairs in the unique portion of the sequence, and less than one in 10,000 base pairs in the repetitive portion (Celniker et al. 2002). The accuracy of the assembly was verified by restriction digestion of BAC clones, and composite sequences of transposable elements in the previous releases was replaced in Release 3 with the true sequences of 1572 individual transposon insertions.
To create the 118.4 Mb Release 4 genomic sequence, 21 gaps were closed, and the assembly was validated in collaboration with the Genome Sciences Centre at the British Columbia Cancer Agency in Vancouver, Canada, using fingerprint analysis of a tiling path of BACs spanning the genome. This assembly has 23 gaps remaining.
The BDGP is continuing to improve the genomic sequence to high quality. Release 5 genomic sequence is being submitted as unannotated BACs to GenBank as it is finished.
Commencing with Release 3 and continuing into the future, changes to the gene models and other annotations will occur more often than changes to the underlying sequence. These changes are indicated by fractional release numbers; for example, 'Release 3.2' consists of the second update of annotations on the Release 3 genomic sequence. FlyBase will continue to increment release numbers across the entire genome.
In FlyBase, the release number will appear at the top of each annotation query and report page, and also at the FlyBase download sites for sequence. Please make a note of the release number you are working with.
The annotated sequence is submitted to GenBank as chromosome arms, and GenBank cuts these into segments of manageable size, averaging ~270 kb. When the underlying sequence for a given segment changes, GenBank increments the decimal version number. Note that this does not occur genome-wide, so some accession version numbers will change and others will not. On occasion, the underlying sequence has not changed, but the extent of a given segment may differ (to avoid dividing a gene model between two segments). Such a change in extent will also result in an increment of the version number. Changes to annotations are indicated by an updated date stamp.
Examples of release number changes and corresponding GenBank version numbers are shown in the table below.
RELEASE 4.0 ANNOTATION
Annotations from Release 3.2 were promoted to the Release 4 sequence without further assessment; this constituted Release 4.0, made public in November 2004 (euchromatic sequences only).
Very few annotations differed between Release 3.2 and Release 4.0. Forty-one gene models that fell within regions of underlying sequence change exhibited changes in transcript sequences; of these, 25 resulted in changes to the predicted proteins. Two entities were deleted: one gene model was merged with its neighbor, and one natural transposable element insertion was not present in the Release 4 sequence. In addition, the initiating amino acid of the CDS's for non-AUG starts (erroneously annotated in r3.2) was corrected, and one gene model omitted from Release 3.2 was reinstated.
RELEASE 3.2 ANNOTATION
The March 2004 Release 3.2 included new sequence features curated from the fly literature, such as mutational lesions, aberration breakpoints, and insertion sites of transgenic constructs. These new sequence features may be accessed via the Gene Annotation reports, however, they are not included in the Release 3.2 GenBank submissions. A major addition to annotated gene models in Release 3.2 was the inclusion of 100 5SrRNAs (of the estimated 160 genes in the 56F 5SrRNA gene cluster); this includes four 5SrRNA pseudogenes.
When Release 3 of the genomic sequence became available, FlyBase conducted a comprehensive review of all euchromatic annotations (Misra et al. 2002). The goals of this re-annotation were:
The Release 3 re-annotation improved the quality of the majority of gene models. The length of UTRs and the number of alternative transcripts increased, due to the increase in EST and complete cDNA sequences. The fine details of the exon-intron structure were significantly improved. Numerous genes were merged and/or split, based on the cDNA and BLASTX data; some genes predicted in earlier releases were deleted, others are newly predicted. Genes were deleted if they overlapped transposons or if they fell below a minimum size cutoff (100aa) and had no experimental evidence beyond a computational gene prediction. Overall, these improved annotations in changes in >45% of the predicted proteins.
Transposable elements (TEs) in the Release 4
sequence have been completely re-annotated, using a combined evidence
approach described in Quesneville
et al. (2005) (PLoS Comp. Biol. 1 (2):e22). Many more elements are
now annotated in Release 4 relative to Release 3 because of improved
methods, inclusion of more families, and the addition of more TE-dense
sequence in peri- centromeric regions. Many of these new elements are
very short (a few hundred nucleotides) and/or from divergent copies
including the large INE-1 family, not previously annotated in Release 3.
GENE AND TRANSCRIPT IDENTIFIERS
In Releases 3.0 and 3.1, protein-coding genes were given 'CG ' identifiers of the form CGnnnnn. For non-protein-coding genes, such as tRNAs, snRNAs, snoRNAs, microRNAs, miscellaneous non-coding RNAs, and pseudogenes , 'CR' identifiers of the form CRnnnnn were assigned. Transposable elements were given TEnnnnn identifiers. Transcripts were assigned FlyBase transcript identifiers, for which the gene identifier is followed by a suffix -R[A-Z]; e.g., CG12345-RA, CG12345-RB. For peptides, the -R[A-Z] suffix is replaced by a -P[A-Z] suffix, with the second identifying letter always in agreement with that of the corresponding transcript; e.g., CG12345-PA, CG12345-PB.
In Release 3.2, the standard symbols for gene annotation CGnnnnn were replaced with the accepted gene symbol (where available). For example, CG8094, CG8094-RA, and CG8094-PA become gene Hex-C, transcript Hex-C-RA, and protein Hex-C-PA. The CG8094 ID is still supported as a more computable alternative to this symbolic name, but will be less visible.
In Release 1 and 2, only protein-coding genes were annotated, and CGnnnn identifiers were assigned to genes, CTnnnn identifiers to transcripts, and pp-CTnnnn identifiers to peptides. These old Release 1 and 2 CT identifiers are now obsolete, and there is no mapping between CT identifiers and the Release 3 CGnnnn-RA identifiers. However, in most cases the CT identifier has become a synonym of the gene, and can be queried using the FlyBase Gene Search page to find the gene they were associated with in Release 2. In some cases, a Release 2 gene may correspond to more than one Release 3 gene, e.g. if exons were redistributed or split between two new Release 3 genes.