Integrated proteogenomics database

Interpreting identifiers

To create an iPtgxDB, the annotations from the different sources were collapsed into singletons (same sequence in all sources) or annotation clusters of two or more sequences with the same stop codon but different start sites. For each cluster, we define an anchor sequence from the annotation highest up in the hierarchy (Fig. 3A), e.g. RefSeq2015. We construct an informative and transparent protein identifier that integrates all relevant information: a code is added to the anchor sequence for each identical annotation (RefSeq2013=rso, Ensembl=ens, Genoscope=geno, Prodigal=prod, ChemGenome=chemg, in silico ORF=orf) separated by a pipe sign (e.g. BH_RS00220|rso|ens|geno). Identical in silico ORFs are not considered. For alternative start sites, the length difference compared to the anchor annotation is added prior to the code (e.g. ...|-17aa_prod|+6aa_chemg). Finally, chromosome, start and stop position, reading frame, start codon and CDS length complete the identifier. The anchor sequence identifier thus integrates relevant information of the genomic location and all annotation sources for this region, including possible reductions and extensions (Fig. 3A).


Identifiers for entries with alternative initiation sites contain a reference to the anchor annotation, the length difference and the annotation source (e.g. BH_RS00220_+6aa_chemg). Note that CDS length for alternative initiation sites reflects the full protein sequence length up to the stop codon of the anchor sequence although the corresponding protein sequence that is added to the search DB covers only the sequence up to the first tryptic cleavage site in the anchor sequence. For pseudogenes, we added the suffix "_p" (e.g. BH_RS02905_p) or "_fCDS_p" (e.g. BHGENO0333_fCDS_p; "fragmented CDS", for Genoscope pseudogenes) to the identifier, and a sequence translated to the first stop codon to the protein DB.


Here are some examples explaining the meaning of identifiers of few novelties uncovered for Bartonella henselae Houston-1. The iPtgxDB identifier for the protein BHGENO0898|-21aa_chemg|prod|NC_005956_922641_922847_+3_ATG_68 (Fig. 4A) indicates it is a novel ORF in Bhen (BH) predicted by Genoscope (GENO). The additional annotation sources encoded in the identifier imply that the corresponding annotation cluster consists of a shorter ChemGenome prediction (-21aa... 21 aa shorter) and an identical Prodigal prediction, but no annotation in RefSeq. The encoding gene is located on the NC_005956 chromosome, starting at 922,641 bp and ending at 922,847 bp in the +3 reading frame. The initiation codon is ATG and the protein is 68 aa long.

The iPtgxDB identifier for the protein BHORF|NC_005956_1305464_1305568_+2_ATG_34 (Fig. 4B) indicates it is a novel ORF in Bhen (BH), in this case solely predicted by in silico 6-frame translation (ORF), as no additional annotation sources are listed in the identifier. The encoding gene is located on the NC_005956 chromosome, starting at 1,305,464 bp and ending at 1,305,568 bp in the +2 reading frame. The initiation codon is ATG and the protein is 34 aa long.

For a highly expressed pseudogene, the iPtgxDB identifier for the protein BH_RS01070_p|rso|ens|geno|+8aa_chemg|prod|NC_005956_294440_292854_-3_ATG_528 (Fig. 4C) indicates it is a pseudogene in Bhen (BH), here in RefSeq2015 (_p). The additional annotation sources encoded in the identifier shows that the annotation cluster consists of other identical annotations which are not pseudogenes including a ChemGenome annotation that predicts a longer protein (+8aa_chemg... 8aa longer). The encoding gene is located on the NC_005956 chromosome, starting at 294,440 bp and ending at 292,854 bp in the -3 reading frame. The initiation codon is ATG and the protein is 528 aa long.

For a novel extension annotated by ChemGenome, the iPtgxDB identifier for the protein BH_RS01750|-7aa_rso|+63aa_chemg|prod|NC_005956_416783_416364_-3_TTG_139 (Fig. 4D) indicates it is an annotated protein-coding ORF in Bhen (BH), here by RefSeq2015 (RS). The additional annotation sources encoded in the identifier imply that the corresponding annotation cluster consists of an alternative start site in RefSeq2013 annotation (-7aa_rso...7 aa shorter), an alternative start site in ChemGenome annotation (+63aa_chemg... 63 aa longer), and an identical Prodigal prediction. The encoding gene is located on the NC_005956 chromosome, starting at 416,783 bp and ending at 416,364 bp in the -3 reading frame. The initiation codon is TTG and the protein is 139 aa long.