OncomAbKGEmbeddings
Therapeutic Monoclonal Antibodies Repurposing in Oncology Domain using IMGT/mAb-KG embeddings
The tool is freely available at Hypotheses_Generation Tool.
The code will be open after the acceptation of the manuscript
Installation
This project was developed using python 3.10.16 and use the pykeen framework. The dependencies of the project are in the requirements.txt file.
Usage
Three principal python scripts are provided:
-
LaunchKGEModels.py allows training all the KGE models in pykeen:
-
kg_input (
str):
The path to the input knowledge graph file (e.g., a TTL file). This file should contain the triples that will be used for training and evaluation. -
output_dir (
str):
The directory where all output files (trained models, logs, etc.) will be stored. -
kg_format (
str, default:"ttl"):
The format of the input knowledge graph file. Common formats include"ttl","nt", etc. -
preprocess (
bool, default:False):
Whether to perform any preprocessing steps on the knowledge graph before training. The specifics of the preprocessing may vary depending on the use case and is specific to IMGT KG data. -
is_only_onco (
bool, default:False):
If set toTrue, restricts the knowledge graph processing or filtering to oncology-related data only (if applicable). -
create_inverse_triples (
bool, default:False):
Whether to automatically generate inverse relations for each triple, effectively doubling the number of relation types and potentially improving embedding performance. -
ratio_train (
float, default:0.8):
The proportion of triples to include in the training set. -
ratio_test (
float, default:0.1):
The proportion of triples to include in the test set. -
ratio_val (
float, default:0.1):
The proportion of triples to include in the validation set. -
random_seed (
int, default:12346789):
The seed used for random operations such as shuffling splits, ensuring reproducibility. -
optimizer_class (
str, optional):
The class name of the optimizer to use (e.g.,"Adam","SGD"). If not provided, the default optimizer for the chosen KGE library may be used. -
loss_function (
str, optional):
The name of the loss function to use (e.g.,"margin_ranking","cross_entropy"). If not provided, a default loss function may be used. -
early_stopper_freq (
int, default:50):
Frequency (in epochs) at which to evaluate the performance and check if early stopping criteria are met. -
early_stopper_metric (
str, default:"hits@10"):
The metric used to determine early stopping (e.g.,"hits@1","hits@10","MRR"). -
early_stopper_patience (
int, default:10):
Number of checks (based onearly_stopper_freq) with no improvement after which training will be stopped early. -
early_stopper_delta (
float, default:0.001):
The minimum change in the monitored metric to qualify as an improvement for early stopping. -
negative_sampler (
str, default:"basic"):
The sampling strategy to generate negative examples (e.g.,"basic","bern", etc.). -
num_negs_per_pos (
int, default:1):
Number of negative samples to generate for each positive triple. -
embedding_dim (
int, default:200):
The dimensionality of the embedding vectors. -
num_epochs (
int, default:500):
The number of training epochs. -
batch_size (
int, default:1024):
The size of each mini-batch during training. -
use_restriction (
bool, default:False):
IfTrue, apply any predefined restrictions (e.g., certain filtering or domain-specific constraints) during training or evaluation. -
evaluation_relation_whitelist (optional):
A list of relations to include in the evaluation. IfNone, all relations are evaluated. -
evaluation_entity_whitelist (optional):
A list of entities to include in the evaluation. IfNone, all entities are evaluated.
-
- Launch10KGEModels.py to train only the top ten model obtained in our experiments.
- utils.py contains different utility functions.
The input data is in the dir MAbKG/ and the mAbOncoKGProcessed directory contains the processed data of IMG/mAb-KG oncology domain version.
- To test all the models
python -u LaunchKGEModels.py MAbKG/IMGT_ONTO_ABOX_MAB_ONCOLOGY_ONLY.ttl ExperimentMLOncoKG/mAbOncoOnlyProcessedbernoulli --preprocess True --early_stopper_freq 100 --embedding_dim 100 \
--negative_sampler bernoulli --num_negs_per_pos 1 --num_epochs 500 --is_only_onco True --use_restriction False --batch_size 64
- To test the ten best models
python -u Launch10KGEModels.py MAbKG/IMGT_ONTO_ABOX_MAB_ONCOLOGY_ONLY.ttl ExperimentMLOncoKG/mAbOncoOnlyProcessedbernoulli --preprocess True --early_stopper_freq 100 \
--embedding_dim 200 --negative_sampler bernoulli --num_negs_per_pos 1 --num_epochs 500 --is_only_onco True --use_restriction False --batch_size 64
Support
email : gaoussou.sanou@igh.cnrs.fr / patrice.duroux@igh.cnrs.fr