Coder Social home page Coder Social logo

raphaelsty / mkb Goto Github PK

View Code? Open in Web Editor NEW
61.0 4.0 3.0 45.14 MB

Knowledge Base Embedding By Cooperative Knowledge Distillation

Python 100.00%
knowledge-graph knowledge-graph-embeddings graph-embedding graph machine-learning pytorch wn18 knowledge mkb embeddings distillation triplets

mkb's Introduction

PyTorch

PyTorch

mkb is a library dedicated to knowledge graph embeddings. The purpose of this library is to provide modular tools using PyTorch.


Table of contents

๐Ÿ‘พ Installation

You should be able to install and use this library with any Python version above 3.6.

pip install git+https://github.com/raphaelsty/mkb

โšก๏ธ Quickstart:

Load or initialize your dataset as a list of triplets:

train = [
    ('๐Ÿฆ†', 'is a', 'bird'),
    ('๐Ÿฆ…', 'is a', 'bird'),

    ('๐Ÿฆ†', 'lives in', '๐ŸŒณ'),
    ('๐Ÿฆ‰', 'lives in', '๐ŸŒณ'),
    ('๐Ÿฆ…', 'lives in', '๐Ÿ”'),

    ('๐Ÿฆ‰', 'hability', 'fly'),
    ('๐Ÿฆ…', 'hability', 'fly'),

    ('๐ŸŒ', 'is a', 'mollusc'),
    ('๐Ÿœ', 'is a', 'insect'),
    ('๐Ÿ', 'is a', 'insect'),

    ('๐ŸŒ', 'lives in', '๐ŸŒณ'),
    ('๐Ÿ', 'lives in', '๐ŸŒณ'),

    ('๐Ÿ', 'hability', 'fly'),

    ('๐Ÿป', 'is a', 'mammal'),
    ('๐Ÿถ', 'is a', 'mammal'),
    ('๐Ÿจ', 'is a', 'mammal'),

    ('๐Ÿป', 'lives in', '๐Ÿ”'),
    ('๐Ÿถ', 'lives in', '๐Ÿ '),
    ('๐Ÿฑ', 'lives in', '๐Ÿ '),
    ('๐Ÿจ', 'lives in', '๐ŸŒณ'),

    ('๐Ÿฌ', 'lives in', '๐ŸŒŠ'),
    ('๐Ÿณ', 'lives in', '๐ŸŒŠ'),

    ('๐Ÿ‹', 'is a', 'marine mammal'),
    ('๐Ÿณ', 'is a', 'marine mammal'),
]

valid = [
    ('๐Ÿฆ†', 'hability', 'fly'),
    ('๐Ÿฑ', 'is_a', 'mammal'),
    ('๐Ÿœ', 'lives_in', '๐ŸŒณ'),
    ('๐Ÿฌ', 'is_a', 'marine mammal'),
    ('๐Ÿ‹', 'lives_in', '๐ŸŒŠ'),
    ('๐Ÿฆ‰', 'is a', 'bird'),
]

Train your model to make coherent embeddings for each entities and relations of your dataset using a pipeline:

from mkb import datasets
from mkb import models
from mkb import losses
from mkb import sampling
from mkb import evaluation
from mkb import compose

import torch

_ = torch.manual_seed(42)

# Set device = 'cuda' if you own a gpu.
device = 'cpu'

dataset = datasets.Dataset(
    train      = train,
    valid      = valid,
    batch_size = 24,
)

model = models.RotatE(
    entities   = dataset.entities,
    relations  = dataset.relations,
    gamma      = 3,
    hidden_dim = 200,
)

model = model.to(device)

optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr = 0.003,
)

negative_sampling = sampling.NegativeSampling(
    size          = 24,
    train_triples = dataset.train,
    entities      = dataset.entities,
    relations     = dataset.relations,
    seed          = 42,
)

validation = evaluation.Evaluation(
    true_triples = dataset.true_triples,
    entities     = dataset.entities,
    relations    = dataset.relations,
    batch_size   = 8,
    device       = device,
)

pipeline = compose.Pipeline(
    epochs                = 100,
    eval_every            = 50,
    early_stopping_rounds = 3,
    device                = device,
)

pipeline = pipeline.learn(
    model      = model,
    dataset    = dataset,
    evaluation = validation,
    sampling   = negative_sampling,
    optimizer  = optimizer,
    loss       = losses.Adversarial(alpha=1)
)

Plot embeddings:
from sklearn import manifold

from sklearn import cluster

import seaborn as sns

import matplotlib.pyplot as plt

import pandas as pd

emojis_tokens = {
    '๐Ÿฆ†': 'duck',
    '๐Ÿฆ…': 'eagle',
    '๐Ÿฆ‰': 'owl',
    '๐ŸŒ': 'snail',
    '๐Ÿœ': 'ant',
    '๐Ÿ': 'bee',
    '๐Ÿป': 'bear',
    '๐Ÿถ': 'dog',
    '๐Ÿจ': 'koala',
    '๐Ÿฑ': 'cat',
    '๐Ÿฌ': 'dolphin',
    '๐Ÿณ': 'whale',
    '๐Ÿ‹': 'humpback whale',
}

embeddings = pd.DataFrame(model.embeddings['entities']).T.reset_index()

embeddings = embeddings[embeddings['index'].isin(emojis_tokens)].set_index('index')

tsne = manifold.TSNE(n_components = 2, random_state = 42, n_iter=1500, perplexity=3, early_exaggeration=100)

kmeans = cluster.KMeans(n_clusters = 5, random_state=42)

X = tsne.fit_transform(embeddings)
X = pd.DataFrame(X, columns = ['dim_1', 'dim_2'])
X['cluster'] = kmeans.fit_predict(X)

%config InlineBackend.figure_format = 'retina'

fgrid = sns.lmplot(
    data = X,
    x = 'dim_1',
    y = 'dim_2',
    hue = 'cluster',
    fit_reg = False,
    legend = False,
    legend_out = False,
    height = 7,
    aspect = 1.6,
    scatter_kws={"s": 500}
)

ax = fgrid.axes[0,0]
ax.set_ylabel('')
ax.set_xlabel('')
ax.set(xticklabels=[])
ax.set(yticklabels=[])

for i, label in enumerate(embeddings.index):

     ax.text(
         X['dim_1'][i] + 1,
         X['dim_2'][i],
         emojis_tokens[label],
         horizontalalignment = 'left',
         size = 'medium',
         color = 'black',
         weight = 'semibold',
     )

plt.show()
PyTorch

๐Ÿ—‚ Datasets

Datasets available:

  • datasets.CountriesS1
  • datasets.CountriesS2
  • datasets.CountriesS3
  • datasets.Fb13
  • datasets.Fb15k
  • datasets.Fb15k237
  • datasets.InferWiki16k
  • datasets.InferWiki64k
  • datasets.Kinship
  • datasets.Nations
  • datasets.Nell995
  • datasets.Umls
  • datasets.Wn11
  • datasets.Wn18
  • datasets.Wn18rr
  • datasets.Yago310

Load existing dataset:

from mkb import datasets

dataset = datasets.Wn18rr(batch_size=256)

dataset
Wn18rr dataset
    Batch size          256
    Entities            40923
    Relations           11
    Shuffle             True
    Train triples       86834
    Validation triples  3033
    Test triples        3134

Or create your own dataset:

from mkb import datasets

train = [
    ('๐Ÿฆ†', 'is a', 'bird'),
    ('๐Ÿฆ…', 'is a', 'bird'),
    ('๐Ÿฆ‰', 'hability', 'fly'),
    ('๐Ÿฆ…', 'hability', 'fly')
]

valid = [
    ('๐Ÿฆ‰', 'is a', 'bird')
]

test = [
    ('๐Ÿฆ†', 'hability', 'fly')
]

dataset = datasets.Dataset(
    train = train,
    valid = valid,
    test = test,
    batch_size = 3,
    seed = 42,
)

dataset
Dataset dataset
        Batch size  3
          Entities  5
         Relations  2
           Shuffle  True
     Train triples  4
Validation triples  1
      Test triples  1

๐Ÿค– Models

Knowledge graph models build latent representations of nodes (entities) and relationships in the graph. These models implement an optimization process to represent the entities and relations in a consistent space.

Models available:

  • models.TransE
  • models.DistMult
  • models.RotatE
  • models.pRotatE
  • models.ComplEx
  • models.SentenceTransformer
  • models.Transformer

Initialize a model:

from mkb import models

model = models.RotatE(
   entities   = dataset.entities,
   relations  = dataset.relations,
   gamma      = 6,
   hidden_dim = 500
)

model
RotatE model
    Entities embeddings  dim  1000
    Relations embeddings dim  500
    Gamma                     3.0
    Number of entities        40923
    Number of relations       11

Set the learning rate of the model:

import torch

learning_rate = 0.00005

optimizer = torch.optim.Adam(
   filter(lambda p: p.requires_grad, model.parameters()),
   lr = learning_rate,
)

๐ŸŽญ Negative sampling

Knowledge graph embedding models learn to distinguish existing triplets from generated triplets. The sampling module allows to generate triplets that do not exist in the dataset.

from mkb import sampling

negative_sampling = sampling.NegativeSampling(
    size          = 256,
    train_triples = dataset.train,
    entities      = dataset.entities,
    relations     = dataset.relations,
    seed          = 42,
)

๐Ÿค– Train your model

You can train your model using a pipeline:

from mkb import compose
from mkb import losses
from mkb import evaluation

validation = evaluation.Evaluation(
    true_triples = dataset.true_triples,
    entities   = dataset.entities,
    relations  = dataset.relations,
    batch_size = 8,
    device     = device,
)

pipeline = compose.Pipeline(
    epochs                = 100,
    eval_every            = 50,
    early_stopping_rounds = 3,
    device                = device,
)

pipeline = pipeline.learn(
    model      = model,
    dataset    = dataset,
    evaluation = validation,
    sampling   = negative_sampling,
    optimizer  = optimizer,
    loss       = losses.Adversarial(alpha=1)
)

You can also train your model with a lower level of abstraction:

from mkb import losses
from mkb import evaluation

validation = evaluation.Evaluation(
    true_triples = dataset.true_triples,
    entities   = dataset.entities,
    relations  = dataset.relations,
    batch_size = 8,
    device     = device,
)

loss = losses.Adversarial(alpha=0.5)

for epoch in range(2000):

    for data in dataset:

        sample = data['sample'].to(device)
        weight = data['weight'].to(device)
        mode = data['mode']

        negative_sample = negative_sampling.generate(sample=sample, mode=mode)

        negative_sample = negative_sample.to(device)

        positive_score = model(sample)

        negative_score = model(
            sample=sample,
            negative_sample=negative_sample,
            mode=mode
        )

        error = loss(positive_score, negative_score, weight)

        error.backward()

        _ = optimizer.step()

        optimizer.zero_grad()

    validation_scores = validation.eval(dataset=dataset.valid, model=model)

    print(validation_scores)

๐Ÿ“Š Evaluation

You can evaluate the performance of your models with the evaluation module.

from mkb import evaluation

validation = evaluation.Evaluation(
    true_triples = dataset.true_triples,
    entities   = dataset.entities,
    relations  = dataset.relations,
    batch_size = 8,
    device     = device,
)

๐ŸŽฏ Link prediction task:

The task of link prediction aim at finding the most likely head or tail for a given tuple. For example, the model should retrieve the entity United States for the triplet ('Barack Obama', 'president_of', ?).

Validate the model on the validation set:

validation.eval(model = model, dataset = dataset.valid)
{'MRR': 0.5833, 'MR': 400.0, 'HITS@1': 20.25, 'HITS@3': 30.0, 'HITS@10': 40.0}

Validate the model on the test set:

validation.eval(model = model, dataset = dataset.test)
{'MRR': 0.5833, 'MR': 600.0, 'HITS@1': 21.35, 'HITS@3': 38.0, 'HITS@10': 41.0}

๐Ÿ”Ž Link prediction detailed evaluation:

You can get a more detailed evaluation of the link prediction task and measure the performance of the model according to the type of relationship.

validation.detail_eval(model=model, dataset=dataset.test, threshold=1.5)
          head                               tail
          MRR   MR HITS@1 HITS@3 HITS@10     MRR   MR HITS@1 HITS@3 HITS@10
relation
1_1       0.5  2.0    0.0    1.0     1.0  0.3333  3.0    0.0    1.0     1.0
1_M       1.0  1.0    1.0    1.0     1.0  0.5000  2.0    0.0    1.0     1.0
M_1       0.0  0.0    0.0    0.0     0.0  0.0000  0.0    0.0    0.0     0.0
M_M       0.0  0.0    0.0    0.0     0.0  0.0000  0.0    0.0    0.0     0.0

โžก๏ธ Relation prediction:

The task of relation prediction is to find the most likely relation for a given tuple (head, tail).

validation.eval_relations(model=model, dataset=dataset.test)
{'MRR_relations': 1.0, 'MR_relations': 1.0, 'HITS@1_relations': 1.0, 'HITS@3_relations': 1.0, 'HITS@10_relations': 1.0}

๐Ÿฆพ Triplet classification

The triplet classification task is designed to predict whether or not a triplet exists. The triplet classification task is available for every datasets in mkb except Countries datasets.

from mkb import evaluation

evaluation.find_threshold(
    model = model,
    X = dataset.classification_valid['X'],
    y = dataset.classification_valid['y'],
    batch_size = 10,
)

Best threshold found from triplet classification valid set and associated accuracy:

{'threshold': 1.924787, 'accuracy': 0.803803}
evaluation.accuracy(
    model = model,
    X = dataset.classification_test['X'],
    y = dataset.classification_test['y'],
    threshold = 1.924787,
    batch_size = 10,
)

Accuracy of the model on the triplet classification test set:

0.793803

๐Ÿคฉ Get embeddings

You can extract embeddings from entities and relationships computed by the model with the models.embeddings property.

model.embeddings['entities']
{'hello': tensor([ 0.7645,  0.8300, -0.2343]), 'world': tensor([ 0.9186, -0.2191,  0.2018])}
model.embeddings['relations']
{'lorem': tensor([-0.4869,  0.5873,  0.8815]), 'ipsum': tensor([-0.7336,  0.8692,  0.1872])}

๐Ÿ” Transformers

MKB provides an implementation of the paper Inductive Entity Representations from Text via Link Prediction. It allows to train transformers to build embeddings of the entities of a knowledge graph under the link prediction objective. After fine-tuning the transformer on the link prediction task, we can use it to build an entity search engine. It can also perform tasks related to the completion of knowledge graphs. Finally, we can use it for any downstream task such as classification.

Using a transformer instead of embeddings has many advantages, such as constructing contextualized latent representations of entities. In addition, this model can encode entities that it has never seen with the textual description of the entity. The learning time is much longer than a classical TransE model, but the model converges with fewer epochs.

MKB provides two classes dedicated to fine-tune both Sentence Transformers and vanilla Transformers.

  • models.SentenceTransformer: Dedicated to Sentence Transformer models.
  • models.Transformer: Dedicated to traditional Transformer models.

Under the hood, the Transformer model is trained using entity labels. Therefore, it is important to provide relevant entity labels. We initialize an embedding matrix dedicated to relationships. The negative samples are generated following the in-batch strategy.

Here is how to fine-tune a sentence transformer under the link prediction objective:

from mkb import losses, evaluation, datasets, text, models
from transformers import AutoTokenizer, AutoModel
import torch

_ = torch.manual_seed(42)

train = [
    ("jaguar", "cousin", "cat"),
    ("tiger", "cousin", "cat"),
    ("dog", "cousin", "wolf"),
    ("dog", "angry_against", "cat"),
    ("wolf", "angry_against", "jaguar"),
]

valid = [
    ("cat", "cousin", "jaguar"),
    ("cat", "cousin", "tiger"),
    ("dog", "angry_against", "tiger"),
]

test = [
    ("wolf", "angry_against", "tiger"),
    ("wolf", "angry_against", "cat"),
]

dataset = datasets.Dataset(
    batch_size = 5,
    train = train,
    valid = valid,
    test = test,
    seed = 42,
    shuffle=True,
)

device = "cpu"

model = models.SentenceTransformer(
    model = AutoModel.from_pretrained("sentence-transformers/all-mpnet-base-v2"),
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2"),
    entities = dataset.entities,
    relations = dataset.relations,
    gamma = 9,
    device = device,
)

model = model.to(device)

optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr = 0.000005,
)

# Link prediction evaluation for Transformers
evaluation = evaluation.TransformerEvaluation(
    entities = dataset.entities,
    relations = dataset.relations,
    true_triples = dataset.train + dataset.valid + dataset.test,
    batch_size = 2,
    device = device,
)

model = text.learn(
    model = model,
    dataset = dataset,
    evaluation = evaluation,
    optimizer = optimizer,
    loss = losses.Adversarial(alpha=0.5),
    negative_sampling_size = 5,
    epochs = 1,
    eval_every = 5,
    early_stopping_rounds = 3,
    device = device,
)

# Saving the Sentence Transformer model:
model.model.save_pretrained("model")
model.tokenizer.save_pretrained("model")

relations = {}
for id_relation, label in model.relations.items():
    relations[label] = model.relation_embedding[id_relation].cpu().detach().tolist()

with open(f"relations.json", "w") as f:
    json.dump(relations, f, indent=4)

After training a Sentence Transformer on the link prediction task using MKB and saving the model, we can load the trained model using the sentence_transformers library.

from sentence_transformers import SentenceTransformer
import json
import numpy as np

# Entity encoder
model = SentenceTransformer("model", device="cpu")

# Relations embeddings
with open(f"relations.json", "r") as f:
    relations = json.load(f)

Here is how to fine-tune a Transformer under the link prediction objective:

from mkb import losses, evaluation, datasets, text, models
from transformers import AutoTokenizer, AutoModel

model = models.Transformer(
    model = AutoModel.from_pretrained("bert-base-uncased"),
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased"),
    entities = dataset.entities,
    relations = dataset.relations,
    gamma = 9,
    device = device,
)

๐Ÿงฐ Development

# Download and navigate to the source code
$ git clone https://github.com/raphaelsty/mkb
$ cd mkb

# Create a virtual environment
$ python3 -m venv env
$ source env/bin/activate

# Install
$ python setup.py install

# Run tests
$ python -m pytest

๐Ÿ’ฌ Citations

Knowledge Base Embedding By Cooperative Knowledge Distillation

@inproceedings{sourty-etal-2020-knowledge,
    title = "Knowledge Base Embedding By Cooperative Knowledge Distillation",
    author = {Sourty, Rapha{\"e}l  and
      Moreno, Jose G.  and
      Servant, Fran{\c{c}}ois-Paul  and
      Tamine-Lechani, Lynda},
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.coling-main.489",
    pages = "5579--5590",
}

๐Ÿ‘ See also

There are a multitude of tools and libraries available on github to build latent knowledge graph embeddings. These libraries are very complete and provide powerful implementations of knowledge graph embedding algorithms.

From a user's point of view, I find that most of the libraries suffer from a lack of modularity. That's why I created this tool. Mkb addresses this modularity problem and is easily integrated into a machine learning pipeline.

  • DGL-KE: High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.

  • OpenKE: An Open-source Framework for Knowledge Embedding implemented with PyTorch.

  • GraphVite: GraphVite is a general graph embedding engine, dedicated to high-speed and large-scale embedding learning in various applications.

  • LibKGE: LibKGE is a PyTorch-based library for efficient training, evaluation, and hyperparameter optimization of knowledge graph embeddings (KGE).

  • TorchKGE: Knowledge Graph embedding in Python and Pytorch.

  • KnowledgeGraphEmbedding: RotatE, Knowledge Graph Embedding by Relational Rotation in Complex Space

๐Ÿ—’ License

This project is free and open-source software licensed under the MIT license.

mkb's People

Contributors

raphaelsty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mkb's Issues

Changing the knowledge set

I am trying to use this library and I find it very handy and useful, but I have a question that I can't find the answer to. Does mkb have the ability to add and remove knowledge online or does it require retraining the entire model?

Inference on `SentenceTransformer`

From your example i was able to train a SentenceTransformer. I used this dummy data

train = [
    ("jaguar", "cousin", "cat"),
    ("tiger", "cousin", "cat"),
    ("dog", "cousin", "wolf"),
    ("dog", "angry_against", "cat"),
    ("wolf", "angry_against", "jaguar"),
]

valid = [
    ("cat", "cousin", "jaguar"),
    ("cat", "cousin", "tiger"),
    ("dog", "angry_against", "tiger"),
]

test = [
    ("wolf", "angry_against", "tiger"),
    ("wolf", "angry_against", "cat"),
]

Given a new head node "big cat" which is unseen one, I wanted to know if we can make certain inference like.

get_top_tails( 
   k = 2,
   model = model,
   head = "big cat",
   relation = 'cousin'
)

k being number of closest or top tails.

Does this make sense or have i completely misunderstood the SentenceTransformer model ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.