ProteinKG25 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO term and proteins entities. It contains about 612,483 entities, 4,990,097 triples (including 4,879,951 protein-go triplets and 110,146 Go-Go triplets). We provide both the inductive and the transductive settings used in the original paper.
Setting |
#protein entity |
#go entity |
#relation |
#triplet |
|
train |
541,916 |
28,610 |
52 |
4,861,576 |
|
Transductive |
valid |
7,834 |
1,865 |
44 |
12,354 |
test |
381,428 |
10,611 |
54 |
1,267,362 |
|
train |
541,916 |
28,610 |
52 |
4,861,576 |
|
Inductive |
valid |
608 |
118 |
12 |
1,170 |
test |
1,553 |
541 |
37 |
4,405 |
We get Go term information from Gene Ontology. Gene Ontology is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species, which consists of a set of Go terms(or concepts) with relations that operate between them. The vocabulary of genes and gene products involved in Gene Ontology is divided into three categories, covering three aspects of biology:
All entities in Gene Ontology belong to BPO, MFO and CCO. The relationship between Go term and Protein is mainly concentrated in the following tree types:
Protein-Go triplet example: