ProteinKG65


Introduction


ProteinKG65 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO term and proteins entities. It contains about 614,099 entities, 5,620,437 triples (including 5,510,437 protein-go triplets and 110,000 Go-Go triplets). We provide both the inductive and the transductive settings used in the original paper.

Setting
  
 #protein entity
 #go entity
 #relation
 #triplet
  train
  543,110
  28,524
  57
  4,884,034
Transductive
  valid
  25,241
  5,009
  44
  51,243
  test
  217,463
  17,908
  57
  575,160
  train
  543,110
  28,524
  57
  4,884,034
Induvtive
  valid
  855
  270
  31
  2,216
  test
  3,085
  1,062
  50
  11,127

We get Go term information from Gene Ontology. Gene Ontology is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species, which consists of a set of Go terms(or concepts) with relations that operate between them. The vocabulary of genes and gene products involved in Gene Ontology is divided into three categories, covering three aspects of biology:

Go term


All entities in Gene Ontology belong to BPO, MFO and CCO. The relationship between Go term and Protein is mainly concentrated in the following tree types:

To mitigate this severe long-tail effect, we refine some of these relationships to keep the data balanced. The total number of relationships after refinement is 65, compared with 25 before.

Data


Download ProteinKG65

Zenodo

Details


ProteinKG65 follows Gene Ontology and Gene Annotations released in April 2020. Each Entity (Go term entity or Protein entity) is identified by a unique ID.

Protein-Go triplet example:

71090(Q14028)       36(enables_nucleotide_binding)       117(GO:0000166)
GO-Go triplet example:
0(GO:0000001)       0(is_a)                   23558(GO:0048308)
Protein Sequence example:
P0DQM8:         QKCCSGGSCPLYFRDRLICPCC

Publication


Cite


@article{
cheng2022proteinkg65,
title={Multi-modal Protein Knowledge Graph Construction and Applications},
author={Cheng, Siyuan and Liang, Xiaozhuan and Bi, Zhen and Chen, Huajun and Zhang, Ningyu},
journal={arXiv preprint arXiv:2207.10080},
year={2022}
}