ProteinKG65 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO term and proteins entities. It contains about 614,099 entities, 5,620,437 triples (including 5,510,437 protein-go triplets and 110,000 Go-Go triplets). We provide both the inductive and the transductive settings used in the original paper.
Setting |
#protein entity |
#go entity |
#relation |
#triplet |
|
train |
543,110 |
28,524 |
57 |
4,884,034 |
|
Transductive |
valid |
25,241 |
5,009 |
44 |
51,243 |
test |
217,463 |
17,908 |
57 |
575,160 |
|
train |
543,110 |
28,524 |
57 |
4,884,034 |
|
Induvtive |
valid |
855 |
270 |
31 |
2,216 |
test |
3,085 |
1,062 |
50 |
11,127 |
We get Go term information from Gene Ontology. Gene Ontology is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species, which consists of a set of Go terms(or concepts) with relations that operate between them. The vocabulary of genes and gene products involved in Gene Ontology is divided into three categories, covering three aspects of biology:
All entities in Gene Ontology belong to BPO, MFO and CCO. The relationship between Go term and Protein is mainly concentrated in the following tree types:
To mitigate this severe long-tail effect, we refine some of these relationships to keep the data balanced. The total number of relationships after refinement is 65, compared with 25 before.
Protein-Go triplet example: