Release 4 Notes

Release 4 Notes

RE-ANNOTATED GENOMIC SEQUENCE

Release 4 Notes: Updated September 15, 2005

Release 4.2.1 of the euchromatin is now available at http://flybase.net/annot/.

RELEASE 4.2 ANNOTATION UPDATE

HETEROCHROMATIN

KNOWN MUTATIONS IN THE SEQUENCED STRAIN

GENOMIC SEQUENCE RELEASES vs. ANNOTATION RELEASES

RELEASE 4.1

RELEASE 4.0

RELEASE 3.2

RELEASE 3.1

TRANSPOSABLE ELEMENTS

GENE AND TRANSCRIPT IDENTIFIERS

RELEASE 4.2 ANNOTATION UPDATE

Release 4, initially comprised of genomic sequence only, was made public in April 2004 (euchromatic sequences). This release adds 1.4 Mb of high quality sequence, including 21 gaps that have been closed and two inverted regions that have been corrected (see BDGP Release 4 notes for details).

Release 4.2 is the third of regular updates that reflect gene-by-gene annotation assessments, rather than a comprehensive survey of the entire genome. Re-annotation of a gene model is triggered by new sequence data, data curated from the literature, or user communications. For updates that include changes to annotations only (and not the underlying sequence), the release numbers increase as decimal increments. These more frequent updates also include new supporting data, represented in the evidence tiers of the gene annotation reports (e.g., Sos), in the Gbrowse views, and in the Apollo annotation editor and viewer.

Release 4.2 annotations were available from FlyBase on July 13, 2005. A minor version update called Release 4.2.1 became available on September 13, 2005. This update added several new EST sequence alignments and transposable element insertion sites. A handful of redundant gene models were deleted in order to synchronize the FlyBase annotations with the public data libraries ( NCBI, EBI, DDBJ) from which these annotations are also available. At FlyBase, these data are available from Gene Annotation reports, which are accessible from individual gene report pages or as a result of a query using the Basic Annotation Query Form, the FlyBase BLAST server, the batch query page, and the download site.

Previous releases, unannotated BAC-based sequences, and the WGS3 whole-genome shotgun sequence assembly continue to be available from GenBank. See the Heterochromatin section below for information about the release of Heterochromatin.

Tabulated information about features in this release are shown below and comparison of Release 4.2 to 4.1 with lists of new, split, merged, deleted, etc. genes may be found HERE. A major addition to gene models in Release 4.2 is the annotation of the histone gene clusters at 39D. Although the genomic sequence of this region is not fully resolved, the Release 4 sequence includes 23 histone gene clusters, including clusters that contain one or more pseudogenes.

In Release 4.2 the representation of dicistronic genes has been changed. Previously, a single annotated gene model (with a single CGnnnnn annotation id) existed for each dicistronic gene pair. In this release each component of the dicistronic pair has been treated as a separate annotated gene model. A dicistronic transcript is presented twice, with different start sites resulting in the two non-overlapping protein products; previously these two representations were considered alternative transcripts of a single gene model; in Release 4.2 they are presented as transcripts of two different gene models. Find more information on the new dicistronic identifiers HERE.

Annotation Statistics for Release 4.2.1

Note that gene model statistics include heterochromatin annotations but aligned feature counts are only for euchromatin

New Gene Models	270
Deleted Gene Models	0
Merged Gene Models	12
Split Gene Models	2
Unchanged peptides	18673

Annotated Gene Models	Count	Avg. size	Longest	Shortest	Change from previous release*
Genes	14715	4995	279927	16	+277
Protein coding genes	13987	5231	279927	144	+218
Protein coding transcripts	19608	2254	69571	132	+237
Exons	64813	48	27725	3	+550
Introns	48135	1189	185510	32	+240
5' Untranslated regions	17593	183	3391	1	+91
3' Untranslated regions	11949	373	5684	1	+104
Unique peptides	16968	556	23015	25	+110
rRNA	102	153	1995	29	0
tRNA**	294	75	71	186	-1
snRNA	46	111	255	36	+17
snoRNA	63	88	316	16	+35
miRNA	66	0	NA	NA	+66
miscellaneous non-coding RNA	107	1846	31065	19	-31
pseudogenes	50	1219	13064	53	+10

Transposable Elements Present in the Sequenced Strain	12794	1258	66001	21	+4434
Euchromatic transposable elements	6005	1258	66001	21	+4434
Heterochromatic repeat with transposon homology⁺	6189	NA	NA	NA	0

Other Annotated Gene Features	Count	Change from previous release
abberation junction	127	+41
enhancer	33	+6
point mutation	1000	+515
poly A site	126	+19
protein binding site	1370	+1278
regulatory region	219	+82
rescue fragment	207	+71
sequence variant	348	+116
signal peptide	1	0

Mapped reagent features	Count	Change from previous release
transposable element insertion site	33268	+16864
oligonucleotide	194086	0

Aligned evidence features⁺⁺

Algorithm

Count

Change from previous release

Nucleotide alignments
BAC	clone locator	710	0
D. melanogaster cDNA inserts	sim4tandem	10879	0
D. melanogaster EST (total)	sim4	308722	+87048
EST from sequenced strain	sim4	153900	+3674
EST from different strains	sim4	154822	+83374
Other melanogaster DNA sequences	sim4tandem	12707	+378

ab initio gene predictions
Genie prediction	Genie v2.2/flyGenie	11063	0
Genscan prediction	Genscan 1.0	17811	0
Augustus prediction	Augustus 1.0	12316	+12316

Proteins aligned
D. melanogaster proteins	WU-blastx 2.0	24086	0
Other Insect proteins	WU-blastx 2.0	7011	0
Nematode proteins	WU-blastx 2.0	6318	0
Yeast proteins	WU-blastx 2.0	2149	0
Plant proteins	WU-blastx 2.0	8319	0
Rodent proteins	WU-blastx 2.0	14732	0
Primate proteins	WU-blastx 2.0	13607	0
Other invertebrate proteins	WU-blastx 2.0	12991	0
Other vertebrate proteins	WU-blastx 2.0	10383	0

Translated nucleotide alignments
Insect ESTs	WU-tblastx 2.0	N.A.	N.A.
A. gambiae genomic	WU-tblastx 2.0	N.A.	N.A.
D. pseudoobscura genomic	WU-tblastx 2.0	N.A.	N.A.

* change is relative to Release 4.1 annotations
** 4 of the 294 tRNA genes are non-functional pseudogenes
+Natural transposon insertions in heterochromatin are 'repeat_regions' with high TE homology. See Transposable Elements below.
++Aligned evidence feature counts are for euchromatin only.

The confidence we have in the annotated gene models varies considerably; improvements to the gene models will be ongoing, and will require the continued input of the community. If you notice a mistake in annotation, please submit an error report form (also accessed from the gene annotation reports) or write to flybase-updates AT morgan.harvard.edu. Updates may also be submitted as sequence records or as Apollo-generated XML files.

HETEROCHROMATIN

The sequence finishing and annotation of the heterochromatic region of the genome is being performed by the Drosophila Heterochromatin Genome Project (DHGP; see Hoskins et al. 2002). As sequence gaps are filled, and the heterochromatic scaffolds are finished to high quality and re-annotated, they will be contributed to GenBank and FlyBase and integrated into future releases of the Drosophila genomic sequence.

Release 3.2b annotation of the heterochromatic regions are available from FlyBase and the public data libraries (NCBI, EBI, DDBJ). At FlyBase, these data are available from FlyBase Gene Annotation reports, the FlyBase BLAST server, the batch query page, and the download site.

The Release 3.2b heterochromatin annotation represents the latest effort to describe the protein-coding genes, non-coding genes, and other features located in the heterochromatin sequence. In this update, the underlying sequence is the 20.7Mb of Release 3 whole-genome-shotgun (WGS) scaffolds from Celera that could not be assembled into the euchromatin arms as well as a few BDGP-sequenced scaffolds.

The WGS3 heterochromatin consists of ~2600 scaffolds that still contain gaps and collapsed repeats, but are otherwise considered relatively high-quality sequence. Some of these have been mapped to particular chromosome arms (i.e. 2h, 3h, 4h, Xh, or Yh), while the remaining have been placed on chromsome U. It is important to note that scaffolds that have been mapped to a particular chromosome arm are provisionally ordered, but not oriented: they are ordered by their experimentally determined cytological locations, but their orientation and exact order remain unclear. Chromosome U consists of unordered, unoriented scaffolds. While the underlying sequence of the scaffolds annotated in Release 3.2 has not changed, the mapping and ordering of these scaffolds on chromosome arms (e.g. 2h, 3h...) may differ from previous releases.

The transition between the euchromatic and heterochromatic regions of the genome is thought to be a gradual one, and there are no objective rules to categorize the sequence in this transitional area as definitively euchromatic or heterochromatic. Currently the boundaries between the euchromatic and heterochromatic portions of the genome are based on cytological data, as described in Hoskins et al. 2002.

Annotation guidelines consistent with FlyBase and the overall Drosophila genome annotation were adhered to whenever possible. However, since these annotations are based on high-quality draft sequence, certain gene models may contain missing or premature stop codons, missing start codons, or gaps within their ORFs. Open reading frames corresponding to fragments of transposable elements are common in heterochromatin; every attempt was made to identify these and exclude them from the gene annotations.

Release 4 annotation of the heterochromatin should become available in Summer 2005.

As the DHGP adds new data and improves the quality of the underlying sequence and assembly in future releases, the quality of the annotations will also improve. The DHGP welcomes any feedback and data from the community that will assist in this effort.

KNOWN MUTATIONS IN THE SEQUENCED STRAIN

The sequenced strain, usually described as the y[1]; cn[1] bw[1] sp[1]strain, was known to carry mutations in those four genes. During annotation, mutations in other genes have been discovered (currently known are mutations in oc, LysC, MstProx, GstD5, Rh6, Gr22b, Or98b and CG8447). To allow compilation of a comprehensive proteome, wild-type protein sequences for these genes have been included in sequence entries to GenBank/EMBL/DDBJ. Wherever possible, a RefSeq accession based on an alternative wild-type sequence and curated as a FlyBase Annotated Genome Sequence (ARGS) has been provided.

GENOMIC SEQUENCE RELEASES vs. ANNOTATION RELEASES

The different releases of the D. melanogaster genomic sequence are designated by the whole number component of the release number. The first annotated genomic sequence was released on March 24, 2000, and constituted Release 1 (Adams, et al., 2000). After Celera/BDGP filled 330 gaps and changed ~3000 annotations, Release 2 was made public in October, 2000. This whole genome shotgun assembly had ~1300 gaps.

To produce the 116.8 Mb Release 3 euchromatic sequence, the BDGP closed almost all of the gaps in the euchromatic portion of the genome, and raised the sequence quality to an estimated error rate of less than one in 100,000 base pairs in the unique portion of the sequence, and less than one in 10,000 base pairs in the repetitive portion (Celniker et al. 2002). The accuracy of the assembly was verified by restriction digestion of BAC clones, and composite sequences of transposable elements in the previous releases was replaced in Release 3 with the true sequences of 1572 individual transposon insertions.

To create the 118.4 Mb Release 4 genomic sequence, 21 gaps were closed, and the assembly was validated in collaboration with the Genome Sciences Centre at the British Columbia Cancer Agency in Vancouver, Canada, using fingerprint analysis of a tiling path of BACs spanning the genome. This assembly has 23 gaps remaining.

The BDGP is continuing to improve the genomic sequence to high quality. Release 5 genomic sequence is being submitted as unannotated BACs to GenBank as it is finished.

Commencing with Release 3 and continuing into the future, changes to the gene models and other annotations will occur more often than changes to the underlying sequence. These changes are indicated by fractional release numbers; for example, 'Release 3.2' consists of the second update of annotations on the Release 3 genomic sequence. FlyBase will continue to increment release numbers across the entire genome.

In FlyBase, the release number will appear at the top of each annotation query and report page, and also at the FlyBase download sites for sequence. Please make a note of the release number you are working with.

The annotated sequence is submitted to GenBank as chromosome arms, and GenBank cuts these into segments of manageable size, averaging ~270 kb. When the underlying sequence for a given segment changes, GenBank increments the decimal version number. Note that this does not occur genome-wide, so some accession version numbers will change and others will not. On occasion, the underlying sequence has not changed, but the extent of a given segment may differ (to avoid dividing a gene model between two segments). Such a change in extent will also result in an increment of the version number. Changes to annotations are indicated by an updated date stamp.

Examples of release number changes and corresponding GenBank version numbers are shown in the table below.

Date	Release	GenBank Version
March 2000	Release 1	AE003452.1
October 2000	Release 2	AE003452.2
June 2002	Release 3.0	AE003452.3
February 2003	Release 3.1	AE003452.4
March 2004	Release 3.2	AE003452.4
November 2004	Release 4.0	AE003452.5
February 2004	Release 4.1	AE003452.5

March 2000	Release 1	AE003463.1
October 2000	Release 2	AE003463.1
June 2002	Release 3.0	AE003463.2
February 2003	Release 3.1	AE003463.2
March 2004	Release 3.2	AE003463.2
November 2004	Release 4.0	AE003463.2
February 2005	Release 4.1	AE003463.2

Links from FlyBase gene and annotation reports will go to the most recent release at NCBI. If you need access to a previous release, you can query at NCBI using the accession number including the version number suffix; click on 'revision history.'

RELEASE 4.1 ANNOTATION

Release 4.1 was the second of regular updates that reflect gene-by-gene annotation assessments, rather than a comprehensive survey of the entire genome. Re-annotation of a gene model was triggered by new sequence data, data curated from the literature, or user communications. For updates that included changes to annotations only (and not the underlying sequence), the release numbers increase as decimal increments. These more frequent updates also include new supporting data, represented in the evidence tiers of the gene annotation reports (e.g., Sos), in the Gbrowse views, and in the Apollo annotation editor and viewer.

RELEASE 4.0 ANNOTATION

Annotations from Release 3.2 were promoted to the Release 4 sequence without further assessment; this constituted Release 4.0, made public in November 2004 (euchromatic sequences only).

Very few annotations differed between Release 3.2 and Release 4.0. Forty-one gene models that fell within regions of underlying sequence change exhibited changes in transcript sequences; of these, 25 resulted in changes to the predicted proteins. Two entities were deleted: one gene model was merged with its neighbor, and one natural transposable element insertion was not present in the Release 4 sequence. In addition, the initiating amino acid of the CDS's for non-AUG starts (erroneously annotated in r3.2) was corrected, and one gene model omitted from Release 3.2 was reinstated.

RELEASE 3.2 ANNOTATION

The March 2004 Release 3.2 included new sequence features curated from the fly literature, such as mutational lesions, aberration breakpoints, and insertion sites of transgenic constructs. These new sequence features may be accessed via the Gene Annotation reports, however, they are not included in the Release 3.2 GenBank submissions. A major addition to annotated gene models in Release 3.2 was the inclusion of 100 5SrRNAs (of the estimated 160 genes in the 56F 5SrRNA gene cluster); this includes four 5SrRNA pseudogenes.

RELEASE 3.1

When Release 3 of the genomic sequence became available, FlyBase conducted a comprehensive review of all euchromatic annotations (Misra et al. 2002). The goals of this re-annotation were:

To manually inspect and synthesize the results of computational analysis of the entire euchromatic sequence into updated annotations, using a small group of human curators and a consistent set of rules.
To take advantage of the large numbers of new EST and full-length cDNA sequences from the BDGP (LBNL) and the community in improving gene models.
To add annotations of non-protein-coding genes, transposons, and pseudogenes.
To validate the results against published peptide sequences.

In order to address these goals, a new computational pipeline was created (Mungall et al. 2002) with an exhaustive list of Drosophila sequence datasets and SwissProt/trEMBL SWALL peptide datasets from other species. The results and datasets are stored in the new FlyBase genome annotation database, so that evidence for the annotations can be tracked and queried. A new graphical user interface, Apollo, was developed in a collaboration between FlyBase BDGP and Ensembl, to allow FlyBase biologist curators to easily view the results of computational analysis and efficiently edit the annotations (Lewis et al! . 2002). A set of curation rules and a controlled vocabulary of comments was created to allow the group of ten curators to annotate consistently. And finally, a set of validation steps was created, including software to compare each predicted peptide to those curated peptides in SwissProt with experimental evidence.

The Release 3 re-annotation improved the quality of the majority of gene models. The length of UTRs and the number of alternative transcripts increased, due to the increase in EST and complete cDNA sequences. The fine details of the exon-intron structure were significantly improved. Numerous genes were merged and/or split, based on the cDNA and BLASTX data; some genes predicted in earlier releases were deleted, others are newly predicted. Genes were deleted if they overlapped transposons or if they fell below a minimum size cutoff (100aa) and had no experimental evidence beyond a computational gene prediction. Overall, these improved annotations in changes in >45% of the predicted proteins.

TRANSPOSABLE ELEMENTS

As a result of the whole genome shotgun assembly, the sequence of each transposon in Releases 1 and 2 was a composite derived from a number of elements of that transposon type. In Release 3, the sequence of each transposon insertion in the euchromatin of the y[1]; cn[1] bw[1] sp[1] strain was determined and characterized (Kaminker et al. 2002). See the BDGP Natural Transposable Element page for more information. The transposons in euchromatin had not been updated between Release 3.1, Release 3.2, or Release 4.0, they were simply mapped forward.

Transposable elements (TEs) in the Release 4 sequence have been completely re-annotated, using a combined evidence approach described in Quesneville et al. (2005) (PLoS Comp. Biol. 1 (2):e22). Many more elements are now annotated in Release 4 relative to Release 3 because of improved methods, inclusion of more families, and the addition of more TE-dense sequence in peri- centromeric regions. Many of these new elements are very short (a few hundred nucleotides) and/or from divergent copies including the large INE-1 family, not previously annotated in Release 3.

The Drosophila heterochromatin sequence is extremely rich in repetitive satellite elements, simple repeats, and transposable element fragments. At the time of Release 3.1, greater than 55% of the Release 3 heterochromatin sequence was determined to have homology to a repetitive element of some type. Currently the Drosophila Heterochromatin Genome Project estimates that 75% of the Release 3 heterochromatin sequence is comprised of repetitive sequence. Since the repetitive regions in the heterochromatin are so fragmented and located in regions with many gaps and potential assembly errors, we did not rigorously curate and hand-identify transposable elements in the same manner as Kaminker et al. 2002 for the Release 3 euchromatin. Instead, we used the Kaminker et al. "Natural Transposable Element" dataset as a library for Repeatmasker to identify stretches of sequence that were likely to be a transposable element or repeat. Since these regions may not be represent complete elements, or may contain many nested elements, the DHGP refers to these as 'repeat regions'. Essentially, 'repeat regions' are stretches of genomic sequence with a significant alignment to a known Drosophila transposable element or simple repeat. In most cases a repeat region is comprised of thousands of nested fragments of other transposable elements. Since our method relies on alignment to known elements it is likely that some legitimate repeats remain to be identified.

The results of the heterochromatin repeat analysis can be seen as the 'Repeatmasker' result tier when using the Apollo genome viewer or obtained as FASTA, GFF, of GAME-XML from the DHGP FTP site.

GENE AND TRANSCRIPT IDENTIFIERS

In Releases 3.0 and 3.1, protein-coding genes were given 'CG ' identifiers of the form CGnnnnn. For non-protein-coding genes, such as tRNAs, snRNAs, snoRNAs, microRNAs, miscellaneous non-coding RNAs, and pseudogenes , 'CR' identifiers of the form CRnnnnn were assigned. Transposable elements were given TEnnnnn identifiers. Transcripts were assigned FlyBase transcript identifiers, for which the gene identifier is followed by a suffix -R[A-Z]; e.g., CG12345-RA, CG12345-RB. For peptides, the -R[A-Z] suffix is replaced by a -P[A-Z] suffix, with the second identifying letter always in agreement with that of the corresponding transcript; e.g., CG12345-PA, CG12345-PB.

In Release 3.2, the standard symbols for gene annotation CGnnnnn were replaced with the accepted gene symbol (where available). For example, CG8094, CG8094-RA, and CG8094-PA become gene Hex-C, transcript Hex-C-RA, and protein Hex-C-PA. The CG8094 ID is still supported as a more computable alternative to this symbolic name, but will be less visible.

In Release 1 and 2, only protein-coding genes were annotated, and CGnnnn identifiers were assigned to genes, CTnnnn identifiers to transcripts, and pp-CTnnnn identifiers to peptides. These old Release 1 and 2 CT identifiers are now obsolete, and there is no mapping between CT identifiers and the Release 3 CGnnnn-RA identifiers. However, in most cases the CT identifier has become a synonym of the gene, and can be queried using the FlyBase Gene Search page to find the gene they were associated with in Release 2. In some cases, a Release 2 gene may correspond to more than one Release 3 gene, e.g. if exons were redistributed or split between two new Release 3 genes.