Coder Social home page Coder Social logo

kg-bert's Introduction

KG-BERT: BERT for Knowledge Graph Completion

The repository is modified from pytorch-pretrained-BERT and tested on Python 3.5+.

Installing requirement packages

pip install -r requirements.txt

Data

(1) The benchmark knowledge graph datasets are in ./data.

(2) entity2text.txt or entity2textlong.txt in each dataset contains entity textual sequences.

(3) relation2text.txt in each dataset contains relation textual sequences.

Reproducing results

1. Triple Classification

WN11

python run_bert_triple_classifier.py 
--task_name kg
--do_train  
--do_eval 
--do_predict 
--data_dir ./data/WN11 
--bert_model bert-base-uncased 
--max_seq_length 20 
--train_batch_size 32 
--learning_rate 5e-5 
--num_train_epochs 3.0 
--output_dir ./output_WN11/  
--gradient_accumulation_steps 1 
--eval_batch_size 512

FB13

python run_bert_triple_classifier.py 
--task_name kg  
--do_train  
--do_eval 
--do_predict 
--data_dir ./data/FB13 
--bert_model bert-base-cased
--max_seq_length 200
--train_batch_size 32 
--learning_rate 5e-5 
--num_train_epochs 3.0 
--output_dir ./output_FB13/  
--gradient_accumulation_steps 1 
--eval_batch_size 512

2. Relation Prediction

FB15K

python3 run_bert_relation_prediction.py 
--task_name kg  
--do_train  
--do_eval 
--do_predict 
--data_dir ./data/FB15K 
--bert_model bert-base-cased
--max_seq_length 25
--train_batch_size 32 
--learning_rate 5e-5 
--num_train_epochs 20.0 
--output_dir ./output_FB15K/  
--gradient_accumulation_steps 1 
--eval_batch_size 512

3. Link Prediction

WN18RR

python3 run_bert_link_prediction.py
--task_name kg  
--do_train  
--do_eval 
--do_predict 
--data_dir ./data/WN18RR
--bert_model bert-base-cased
--max_seq_length 50
--train_batch_size 32 
--learning_rate 5e-5 
--num_train_epochs 5.0 
--output_dir ./output_WN18RR/  
--gradient_accumulation_steps 1 
--eval_batch_size 5000

UMLS

python3 run_bert_link_prediction.py
--task_name kg  
--do_train  
--do_eval 
--do_predict 
--data_dir ./data/umls
--bert_model bert-base-uncased
--max_seq_length 15
--train_batch_size 32 
--learning_rate 5e-5 
--num_train_epochs 5.0 
--output_dir ./output_umls/  
--gradient_accumulation_steps 1 
--eval_batch_size 135

FB15k-237

python3 run_bert_link_prediction.py
--task_name kg  
--do_train  
--do_eval 
--do_predict 
--data_dir ./data/FB15k-237
--bert_model bert-base-cased
--max_seq_length 150
--train_batch_size 32 
--learning_rate 5e-5 
--num_train_epochs 5.0 
--output_dir ./output_FB15k-237/  
--gradient_accumulation_steps 1 
--eval_batch_size 1500

kg-bert's People

Contributors

yao8839836 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kg-bert's Issues

bert-base-uncased文件

这个文件自己从谷歌下载,直接放在项目下可以用吗,需不需要处理三个.cpkt文件呢?
谢谢

运行配置与时间

你好!@yao8839836
非常有意义的工作!请问方便告诉我您在WN18RR与FB15k-237数据集上的运行时间吗?或者告诉我配置,我复现看下时间。谢谢!这对我帮助很大。

How to implement pretrained model from the output folder

I have run the run_bert_classifier.py. That process outputs a folder name output_FB15K. It contains *.json; *.bin and vocab.txt files. The question is how I can implement those files to check whether a given triple is correct or not?

For instance, if I give a triple 453 1347 37, how I can check if that triple is correct or not. Anyone could give an example script to do that?

WARNING - pytorch_pretrained_bert.optimization - Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.

when I execute

python run_bert_triple_classifier.py
--task_name kg
--do_train
--do_eval
--do_predict
--data_dir ./data/FB15K
--bert_model bert-base-cased
--max_seq_length 200
--train_batch_size 12
--learning_rate 5e-5
--num_train_epochs 3.0
--output_dir ./output_FB15K/
--gradient_accumulation_steps 1
--eval_batch_size 12

Here the information of training:
13:52:37 - INFO - main - ***** Running training *****
13:52:37 - INFO - main - Num examples = 966284
13:52:37 - INFO - main - Batch size = 12
13:52:37 - INFO - main - Num steps = 241569
Epoch: 0%| | 0/3 [00:00<?, ?it/s]
Iteration: 17%|██████▎ | 13815/80524 [1:37:39<7:50:50, 2.
etc.

At the final epoch of training I get the warning and error such as below:

18:08:12 - WARNING - pytorch_pretrained_bert.optimization - Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.

18:08:12 - WARNING - pytorch_pretrained_bert.optimization - Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.

18:08:13 - INFO - pytorch_pretrained_bert.modeling - loading archive file ./output_FB15K/
18:08:13 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}

18:08:15 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file ./output_FB15K/vocab.txt
Traceback (most recent call last):
File "run_bert_triple_classifier.py", line 858, in
main()
File "run_bert_triple_classifier.py", line 708, in main
eval_examples = processor.get_dev_examples(args.data_dir)
File "run_bert_triple_classifier.py", line 135, in get_dev_examples
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev", data_dir)
File "run_bert_triple_classifier.py", line 207, in _create_examples
triple_label = line[3]
IndexError: list index out of range

Anyone know this problem and the solution?

epoch not working, pytorch issue?

Hello I'm having this problem, can you help me?

after I run this command
python run_bert_triple_classifier.py --task_name kg --do_train --do_eval --do_predict --data_dir data/WN11 --bert_model bert-base-uncased --max_seq_length 20 --train_batch_size 32 --learning_rate 5e-5 --num_train_epochs 3.0 --output_dir output_WN11/ --gradient_accumulation_steps 1 --eval_batch_size 512

I get the following error where the epoch doesn't start

image
image

Thank you in advance

Using this with roberta-base

Hi,

If I want to use this with roberta-base pre-trained model, what are the parts that needs modification?
creating features changes with instead of [CLS] [SEP] tags and later Roberta for sequence classification should be used I guess. Please let me know if there are further modifications required.

请问为什么我在本地运行时会报unicode error?

在我的本地部署之后,使用FB15K-237运行代码时,出现了如下报错
02/05/2024 14:55:04 - INFO - main - device: cpu n_gpu: 0, distributed training: False, 16-bits training: False
02/05/2024 14:55:05 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at C:\Users\KNLYE.pytorch_pretrained_bert\5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137
Traceback (most recent call last):
File "D:\PythonHome\Graduation Code\kg-bert-master\run_bert_link_prediction.py", line 1060, in
main()
File "D:\PythonHome\Graduation Code\kg-bert-master\run_bert_link_prediction.py", line 564, in main
File "D:\PythonHome\Graduation Code\kg-bert-master\run_bert_link_prediction.py", line 119, in get_train_examples
return self._create_examples(
File "D:\PythonHome\Graduation Code\kg-bert-master\run_bert_link_prediction.py", line 173, in _create_examples
ent_lines = f.readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x90 in position 1518: illegal multibyte sequence
我尝试将entity2text另存为gbk或utf-8格式,问题仍未解决
python版本为3.9

How to implement run_bert_relation_prediction.py

In your https://github.com/yao8839836/kg-bert we can perform run_bert_relation_prediction.py by running this script:

python3 run_bert_relation_prediction.py
--task_name kg
--do_train
--do_eval
--do_predict
--data_dir ./data/FB15K
--bert_model bert-base-cased
--max_seq_length 25
--train_batch_size 32
--learning_rate 5e-5
--num_train_epochs 20.0
--output_dir ./output_FB15K/
--gradient_accumulation_steps 1
--eval_batch_size 512

After performing that script, we have 5 files:
config.json
eval_results.txt
pytorch_model.bin
test_results.txt
vocab.txt

However, I have a question, how to predict the correct relation one given a head and a tail (/m/027rn, ?, /m/06cx9) after we have the 5 files above?.

Here I have an example, the train.tsv file contains triples below:

/m/027rn /location/country/form_of_government /m/06cx9
/m/017dcd /tv/tv_program/regular_cast./tv/regular_tv_appearance/actor /m/06v8s0
...
etc.
...
eof

From the given (/m/027rn, ?, /m/06cx9), the result I want is program will output "/location/country/form_of_government" for the correct relation one.

To perform that, what should I configure to either the scripts to perform run_bert_relation_prediction.py or the data (train.tsv, dev.tsv, and test.tsv)? without doing re-train as the first time mentioned above.

Anyone could help?

Best regards,

moh-yani

estimated prediction time?

Hi,

I really like your work and want to resemble your work.
I'm curious about how long it will take to predict the link/relation for umls data?

关于bert用于三元组的疑问

作者你好,我有两个疑问想要询问一下。第一个疑问,我之前并没有接触过bert处理三个句子(或者实体),对于您代码中写的for sequence triples:
# tokens: [CLS] Steve Jobs [SEP] founded [SEP] Apple Inc .[SEP]
# type_ids: 0 0 0 0 1 1 0 0 0 0,我想问一下这个是bert模型里边带有的功能吗?。第二个问题,bert预训练的时候其中一个方案是判断句子一和句子二是否衔接。而三元组显然不符合这个逻辑,用这个方法来微调bert是否有些牵强?而且实体词往往比较短,在短实体词上做mask感觉不是很好。期待您的解答,谢谢!

gpu的使用

代码中,提供了两种多卡并行方式: DistributedDataParallel 和 DataParallel,

请问:1. 您在使用这两种方式进行训练时,运行时间差别大吗?

  1. 我使用 v-100 运行 link prediction 任务时,花了近1周时间还没结果,您是否有其他方式进行加速?

谢谢。

'NoneType' object has no attribute 'lower'

Hello, first of all, thank you for your contribution to the field. I am very happy to learn a lot from your paper. When repeating your Triple Classification experiment, I encountered this problem, 'NoneType' object has no attribute 'lower', the problem is in line 542 of run_bert_triple_classifier.py, and can't find a solution, it is very risky to give you Write an email and look forward to your reply, which can help me solve my doubts. Thanks again

Time complexity of test triple ranking

Hi there, you mentioned that

Each correct test triple (h, r, t) is corrupted by replacing either its head or tail entity with every entity $e \in E$.

which means, in my opinion, the time complexity of $N$ test triples ranking is $N \times |E|$.
I wonder whether it is time-consuming?

Using `.add_tokens`

If I'm correct you are using the description of an entity (or relationship) and tokenize that description. The entities and relationships do not have their own tokens, right?

Did you try to learn an embedding for an entity/relationship or does that not really make any sense?

clarification dataset

Hi, sorry for the trivial questions.
How can I find the labels for the train sets FB13 and WN11? I don't understand if they are one negative and one positive line each in a safe way. About WN11 what mean the numbers at the end of the entities?
And for the triple classification task, with dev dataset do you mean the val dataset?
So train -> train ; dev -> val ; test -> test?
Thank you.

关于FB15K-237数据集文本句子的问题

作者您好,我想请问您给出的数据集FB15k-237中的entity2text.txt文件是如何生成的,因为原数据集上entity文件并没有给出实际的文本表示,只有编码表示(如/m/02zyy4),请问是通过什么途径获得的该词实际的表示(/m/02zyy4 Michael Madsen)呢?希望您能给出解答,谢谢!

commands not working for prediction

Hi
Thanks for posting your code! love your work and wanted to reproduce the result.

For 3. Link Prediction (UMLS)
I ran the exact same command as you posted but get this. Can you help me how I can fix this?

python3 run_bert_link_prediction.py
--task_name kg
--do_train
--do_eval
--do_predict
--data_dir ./data/umls
--bert_model bert-base-uncased
--max_seq_length 15
--train_batch_size 32
--learning_rate 5e-5
--num_train_epochs 5.0
--output_dir ./output_umls/
--gradient_accumulation_steps 1
--eval_batch_size 135

image

Thank you in advance

你好

你好,请问咱们这个代码Hits@指标能跑到论文中的实验结果吗?

Train run_bert_link_prediction.py is still ongoing for 7 days

I am trying perform run_bert_link_prediction.py for FB15K data with GPU 8GB and 12 batch_size. After 7 days, it is still ongoing. The iteration process shows:
Testing: 100%
left: ???
Testing: 100%
right: ???
mean rank until now: ???
hit@10 until now: ???

Is it normal? how long to perform that file in your experience?

non-ascii characters throw errors in FB13

Hi,

Thank you for your work and publishing the code. I'm trying to run triple_classification example for FB13 as in the readme file, but I'm getting the following error:

Traceback (most recent call last): File "run_bert_triple_classifier.py", line 847, in <module> main() File "run_bert_triple_classifier.py", line 556, in main train_examples = processor.get_train_examples(args.data_dir) File "run_bert_triple_classifier.py", line 120, in get_train_examples self._read_tsv(os.path.join(data_dir, "train.tsv")), "train", data_dir) File "run_bert_triple_classifier.py", line 173, in _create_examples ent_lines = f.readlines() File "/mnt/orange/ubrew/data/opt/python/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 27: ordinal not in range(128)

Is this behavior expectable?
I can change the open command to have encoding="utf-8" argument, but it becomes extremely slow. How did you deal with this issue?

hit@10在LP任务的testing阶段一直为0

Hi, 你好, 非常感谢你结合kg&bert做link prediction的相关工作.

  1. 想请问按照你提供的运行命令, 在FB15k237数据集的实验中,test阶段 一直hit@10 until now: 0.0, 但是MR缓步下降是什么原因造成的呢?

  2. 其中,WARNING - pytorch_pretrained_bert.optimization - Training beyond specified 't_total'. Learning rate multiplier set to 0.0. Please set 't_total' of WarmupLinearSchedule correctly.是什么原因呢?

Pre trained model

Hi,
Do you have a pre trained KG-Bert model available that can be used for further fine tuning? I am carrying out research for entity typing and since training the dataset is taking a very long time, having a pre trained model would be a really big help.

Pytorch_pretrained_model

when I exectute

python run_bert_triple_classifier.py
--task_name kg
--do_train
--do_eval
--do_predict
--data_dir ./data/WN11
--bert_model bert-base-uncased
--max_seq_length 20
--train_batch_size 32
--learning_rate 5e-5
--num_train_epochs 3.0
--output_dir ./output_WN11/
--gradient_accumulation_steps 1
--eval_batch_size 512

it turned out that a pytorch_pretrained_file is downloading.
I found it too slow to download. so I copy:
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
to download alone.

How to use the model for free text?

Hello! You work is amazing. I wonder how to create a pipeline in order to generate a knowledge graph from a full article? Should I exhaustively group up words into triples, and then feed into this model for classification? or if there is any other easier ways?

论文中模型图

没怎么看过源码,单纯的从KG-BERT和BERT的模型图中提问,BERT的sentenceA和B中的token都是相互交互的,而KG-BERT中head entity中的第一个token并没有和其他部分的token进行交互,而最后一个token只和[SEP],token1,relation中的token1交互,是模型图没有画全还是模型就是设置成这样的?

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1519: character maps to <undefined>

When I try to use Freebase in three categories of triple, relation, and link prediction it gives the Unicode decode error. What should be the possible solution for this? tried with python 3.6 and 3.7.

File "C:\Users\waqas\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1519: character maps to

GPU的使用

您好,请问我下载了您的代码,并尝试进行复现,但是在训练,验证,测试的过程中,gpu的使用只有显存被占用了,而并没有使用GPU进行运算,您知道是怎么回事吗?我需要在代码中添加什么吗?

Positive Labels for Negative Example in Test Data (Link Prediction Task)

In run_bert_link_prediction.py, processor._create_examples(tail_corrupt_list, "test", args.data_dir) produces input examples with labels for KG-BERT. But I noticed that in your code shown below all the examples including negative ones in the test set are getting a positive label "1". Would you please explain it? Looking forward to your response. Thank you.

def _create_examples(self, lines, set_type, data_dir):
        .....
        .....
        if set_type == "dev" or set_type == "test":

            label = "1"

            guid = "%s-%s" % (set_type, i)
            text_a = head_ent_text
            text_b = relation_text
            text_c = tail_ent_text 
            self.labels.add(label)
         .....

三元组分类任务的数据集构建

如题所示,我发现您的WN11的测试集的正负样本是1:1. 而训练集的正负样本比是随机的,并且是有时替换头有时替换关系,请问这种设计是您自己提出的还是遵循某一篇论文呢?如果能得到您的回复,我将感激不尽。

Missing Triple Classifier

@yao8839836

Apologize for re-issuing this point.

After I tried your advise below for another example in "test.tsv:
------------------------------- begin of your advice-------------------------------
The script should be similar to line 777--828 in run_bert_triple_classifier.py.

You need to assign your test triple 453 1347 37 to eval_examples in line 777 (which will have only one example),

then preds in line 828 will be the label which indicates if the triple is correct or not.
------------------------------- end of your advice-------------------------------

I do not enough understand how to put the triple in "test.tsv" file.
If I want to check whether this triple below correct or not
/m/01qscs /award/award_nominee/award_nominations./award/award_nomination/award /m/02x8n1n

How should I put the triple in "test.tsv" file?

I really hope you can help me to explain this.

Big thanks for your attention.

moh-yani

list index out of error

Hello,

I ran the following command to do prediction but encountered this error:

python3 run_bert_triple_classifier.py --task_name kg --do_train --do_eval --do_predict --data_dir data/umls --bert_model bert-base-uncased --max_seq_length 20 --train_batch_size 32 --learning_rate 5e-5 --num_train_epochs 3.0 --output_dir output/umls --gradient_accumulation_steps 1 --eval_batch_size 512

image

can you please help? thank you in advance

Cannot reproduce WN18RR link prediction results

Hi, thanks for publishing this great contribution!

I ran the link prediction program on WN18RR with exactly the same parameters as recommended except eval batch size 2500 instead of 5000 to fit in my 1080 ti (took 5 days but finished ok).

python3 run_bert_link_prediction.py
--task_name kg
--do_train
--do_eval
--do_predict
--data_dir ./data/WN18RR
--bert_model bert-base-cased
--max_seq_length 50
--train_batch_size 32
--learning_rate 5e-5
--num_train_epochs 5.0
--output_dir ./output_WN18RR/
--gradient_accumulation_steps 1
--eval_batch_size 2500

My results:
MR Hits@10
127.38 35.97

Results in your paper:
MR Hits@10
97 52.4

Do you have any thoughts on what might explain the difference? Did you observe such variation running this program with different random seed?

Thanks for any ideas!
Tim

请问可以应用在中文的语料库吗

你好
目前我在做一项实验希望给予两个entity预测出relation。我想请问您的代码是否也能应用在中文语料库上。另外,是否也有像是FB15K之类这种中文的训练资料呢

另外,请问此篇论文是投稿于哪个会议或期刊?
谢谢

结果复现

作者您好,最近尝试在服务器上复现您在UMLS数据集上进行链接预测的实验结果,发现在相同配置下和论文中展示的实验结果存在一定差异,能否详细请教一下具体的设备信息或可能的原因,万分感谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.