ProteinKG25


Introduction


ProteinKG25 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO term and proteins entities. It contains about 612,483 entities, 4,990,097 triples (including 4,879,951 protein-go triplets and 110,146 Go-Go triplets). We provide both the inductive and the transductive settings used in the original paper.

Setting
  
 #protein entity
 #go entity
 #relation
 #triplet
  train
  541,916
  28,610
  52
  4,861,576
Transductive
  valid
  7,834
  1,865
  44
  12,354
  test
  381,428
  10,611
  54
  1,267,362
  train
  541,916
  28,610
  52
  4,861,576
Inductive
  valid
  608
  118
  12
  1,170
  test
  1,553
  541
  37
  4,405

We get Go term information from Gene Ontology. Gene Ontology is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species, which consists of a set of Go terms(or concepts) with relations that operate between them. The vocabulary of genes and gene products involved in Gene Ontology is divided into three categories, covering three aspects of biology:

Go term


All entities in Gene Ontology belong to BPO, MFO and CCO. The relationship between Go term and Protein is mainly concentrated in the following tree types:

Data


Download ProteinKG25

Details


ProteinKG25 follows Gene Ontology and Gene Annotations released in April 2020. Each Entity(Go term entity or Protein entity) is identified by a unique ID.

Protein-Go triplet example:

71090(Q14028)       36(enables_nucleotide_binding)       117(GO:0000166)
GO-Go triplet example:
0(GO:0000001)       0(is_a)                   23558(GO:0048308)
Protein Sequence example:
P0DQM8:         QKCCSGGSCPLYFRDRLICPCC

Publication


Cite


@inproceedings{
zhang2022ontoprotein,
title={OntoProtein: Protein Pretraining With Gene Ontology Embedding},
author={Ningyu Zhang and Zhen Bi and Xiaozhuan Liang and Siyuan Cheng and Haosen Hong and Shumin Deng and Qiang Zhang and Jiazhang Lian and Huajun Chen},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=yfe1VMYAXa4}
}