[ Japanese ]

2.0 Annotation items in H-InvDB

H-InvDB provides the following annotation items;

The detailed annotation pipelines will be described later.

2.0.1 H-InvDB identifiers

Annotations in H-InvDB are assigned to individual transcript or cluster or protein. We also defined and assigned the unique identifiers for proteins and gene family/group. We defined a unique identifier for each of those annotation units as follows

HIT (H-Invitational transcript):

Prefix HIT plus 9 digit numbers plus version_number; e.g. HIT000000001.1

We defined an HIT ID for each H-Inv cDNA, mRNA or RNA entry, which is a stable and unique identifier for each H-Invitational transcript. In order to identify the modification in sequence or annotation of an H-Inv transcript entry, an HIT version is assigned to each HIT ID and always stated with the HIT ID.

*1) For eHIT gene models (refer to 2.0.3 for the details), we assigned HIT ID with additional geh at the prefix;
Prefix eHIT plus 9 digit numbers plus version_number; e.g. eHIX000000001.1

*2) For pHIT gene models (refer to 2.0.4 for the details), we assigned HIT ID with additional gph at the prefix;
Prefix pHIT plus 9 digit numbers plus version_number; e.g. pHIX000000001.1

*3) Transcript which locate multiple locations on the genome are assigned HIT ID with additional multi-location_number plus version_number; eg HIT000000001_01.1 for each transcript

HIX (H-Invitational cluster):

Prefix HIX plus 7 digit numbers plus version_number; e.g. HIX0000001.1

We defined an HIX ID for each H-Inv cluster, which is a stable and unique identifier for each H-Inv cluster. A unique HIX ID is assigned to each H-Inv transcript entry identifying the location in the human genome or the unmapped cluster. In order to identify the modification in location in the human genome or annotation of the H-Inv cluster entry, an HIX version is assigned to each HIX ID and always stated with the HIX ID.

HIP (H-Invitational protein):

Prefix HIP plus 9 digit numbers plus version_number; e.g. HIP000000001.1

We defined an HIP ID for each unique translation, which is a stable and unique identifier for each H-Invitational protein.

HIF (H-Invitational gene family/group):

Prefix HIF plus 7 digit numbers; e.g. HIF0000001

We defined an HIF ID for each H-Inv gene family/group, which is a stable and unique identifier for each H-Inv gene family/group. A unique HIF ID is assigned to each group of H-Inv cluster entries identifying the groupings of the human gene in a context of sequence similarity and conservation of functional motif.

2.0.2 HIT, H-Invitational transcripts

HIT stands for H-Invitational transcript and is a one of the main annotation units in H-InvDB. The nucleotide sequence dataset for the each H-InvDB release are obtained from DDBJ server ihttp://www.ddbj.nig.ac.jp/index.htmlj on the date of sequence data-fix.

The human transcript dataset consists of the followings;

The all the annotation for each HIT are provided in gTranscript viewh (refer to 4.1 for the details)

2.0.3 eHIT gene models

The eHIT entry is a computationally and manually annotated gene model, whose exon-intron structure is synthetically predicted by integrating the information of EST and mRNA sequences. The geHITh collection aims to complement gHITh (H-InvDBfs main transcript entry) annotation and to cover a wider space of the transcriptome.

2.0.3.1 Sequence and annotation data used for the construction of eHIT gene models

EST and mRNA sequences deposited in the DDBJ/Genbank were used for the prediction of gene structures. Freeze date was May 9 of 2007. We obtained the mRNA-genome splice alignments in the gene model construction of gHITh, thus we reused the alignment data for the eHIT construction. For the EST-genome splice alignments, we used the UCSC annotation data. We used only spliced ESTsf annotation (described as intronESTs in UCSC) to filter out unreliable sequences or alignments and to construct reliable gene models by checking the consistency of exon-intron structures.

2.0.3.2 Gene model prediction (exon-intron structure)

Firstly, we performed positional clustering of all ESTs and mRNAs on the human genome. We clustered mRNAs and ESTs based on the exon overlap on the same strand by using the single-linkage clustering method, and identified gene clusters on the genome. Secondly, we determined one gene model (exon-intron structure) from each gene cluster by adapting the following merging process: (step1) all cluster members were sorted by the degree of the splice patternfs consensus with other members and the fullness of mRNA/EST sequence; (step2) Starting from the 1st mRNA/EST structure, we merged another ESTfs exon-intron structure with the gene model if the exon of the given mRNA/EST overlaps with the gene model without any inconsistencies in the intron structures. By continuing this step until no more merging occurs, we obtained one gene model for the given gene cluster.

2.0.3.3 Filtering of eHIT gene models

To avoid contamination due to wrong prediction and redundancy, we filtered out gene models which were judged as unreliable, problematic, or identical to gHITh gene models. As a result, 629 eHITs remained as an additional set of gene models, which are expected to support the transcriptome annotation by H-Inv gHITh transcript entries.

2.0.4 pHIT gene models

pHIT transcripts are the novel gene candidates predicted from human genome sequences using Cap Analysis Gene Expression (CAGE) tag and several gene (coding region) prediction programs by JBIRC (http://h-invitational.jp/).

The prediction programs analyze a genome sequence as input, and detect protein coding regions and splice sites by statistical methods, predicting the entire gene structure. We used GENSCAN, FGENESH, and HMMGene. To provide more accurate predictions than those by each single program, we integrated the predictions using the JIGSAW program.

CAGE tags are experimentally detected tag sequences that imply 5'-end sequences of transcripts. The positions on which CAGE tags are mapped are considered to be transcription start sites (TSSs). Therefore the downstream regions of the mapped positions are likely to contain genes. We mapped CAGE tags publicly available from The Genome Network Projects of JAPAN, on the genome, and predicted genes on the downstream regions from CAGE-mapped points.

2.0.4.1 Determination of target regions and prediction of pHIT gene model.

The mapped CAGE tags often construct clusters. A cluster with many tags is considered to be more reliable. Thus, we selected tag clusters with more than ten tags. Furthermore, we took the single best cluster when several clusters are within 1 kbp. Using 17,725 clusters, we analyzed 100 kb regions downstream of them, and predicted 2,988 genes. We removed some predicted genes that are considered to be Immunoglobulin or read-through transcripts (long transcripts that concatenate two tandem genes). We used the sequences of the chromosome 1 to 22, and the X chromosome (the genome sequences of NCBI build 36).

2.0.4.2 Available information for the pHIT gene model

All pHITs are numbers for predicted gene structures as coding sequence portions of transcripts. From the 2,988 predicted genes, we provide only the portion of genes with high confidence; this makes the pHIT numbers discontinuous. Also, we provide five CAGE tag names randomly picked up from the CAGE tag clusters which we used for prediction.

Revised: December 26, 2007