[ Japanese ]

2.3 Annotation policies: Proteins

2.3.1 Prediction of coding-sequence (CDS).

We predicted the coding-sequence (CDS) of H-Invitational transcript sequences by using a computational approach. Discarding the redundancy of cDNAs as described previously, we identified protein-coding and non-protein-coding loci while a part of the protein-coding loci were annotated as pseudogene candidates (refer to 2.3.4).

2.3.2 Determination of 'H-Inv proteins'.

Since structures and functions of protein products from alternative splicing isoforms are expected to be quite similar, we selected a 'representative transcript' of each locus. The total protein-coding loci define a set of human proteins; here we defined as eH-Inv proteinsf. All of these eH-Inv proteinsf were determined by careful human curation, followed by computationally prediction.

2.3.3 Procedure of standardised functional annotation

After determination of the H-Inv proteins, we assigned a standardized functional annotation. The most suitable 'data source ID' to each 'H-Inv protein' based on the results of similarity search and InterProScan was assigned. According to the levels of the sequence similarity, we classified 'H-Inv proteins' into seven categories as illustrated in Fig 2.3.1.

Fig 2.3.1 Scheme illustrating functional annotation of H-Inv proteins.

The diagram illustrates the human curation pipeline to classify H-Inv proteins into seven similarity categories; Category I , II, III, IV, V, VI and VII proteins.

2.3.4 Annotation of transcribed pseudogene candidates

H-InvDB transcribed pseudogene candidates were predicted by the following two steps;
[Step1] Filtering of functional protein-coding genes and determination of frame shift and nonsense mutation
As a result of functional annotation, we filtered out the functional protein coding genes by only targeting representative category II transcripts. Then we determined the transcripts with frame shift error or nonsense mutation based on the alignment with target protein by FASTY.
[Step2] Prediction of transcribed pseudogene candidates based on support vector machine (SVM)
We applied support vector machine (SVM) method to predict transcribed pseudogene candidates using the selected parameters.

Revised: December 26, 2007