xu-song / bert-as-language-model Goto Github PK

View Code? Open in Web Editor NEW

246.0 9.0 68.0 179 KB

BERT as language model, fork from https://github.com/google-research/bert

License: Apache License 2.0

Python 100.00%

bert language-model tensorflow

bert-as-language-model's Introduction

🤗Demo | 📖cases-en | 📖cases-zh |

BERT as Language Model

For a sentence , we have

$p(S) = \prod_{i=1}^{k} p(w_i | context)$

In traditional language model, such as RNN, $context = w_1, ..., w_{i-1}$ ,

$p(S) = \prod_{i=1}^{k} p(w_i | w_1, ..., w_{i-1})$

In bidirectional language model, it has larger context, $context = w_1, ..., w_{i-1},w_{i+1},...,w_k$ .

In this implementation, we simply adopt the following approximation,

$p(S) \approx \prod_{i=1}^{k} p(w_i | w_1, ..., w_{i-1},w_{i+1}, ...,w_k)$ .

Demo

Try out the Web Demo at

test-case

more cases: 中文

export BERT_BASE_DIR=model/uncased_L-12_H-768_A-12
export INPUT_FILE=data/lm/test.en.tsv
python run_lm_predict.py \
  --input_file=$INPUT_FILE \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --output_dir=/tmp/lm_output/

for the following test case

$ cat data/lm/test.en.tsv 
there is a book on the desk
there is a plane on the desk
there is a book in the desk

$ cat /tmp/lm/output/test_result.json

output:

# prob: probability
# ppl:  perplexity
[
  {
    "tokens": [
      {
        "token": "there",
        "prob": 0.9988962411880493
      },
      {
        "token": "is",
        "prob": 0.013578361831605434
      },
      {
        "token": "a",
        "prob": 0.9420605897903442
      },
      {
        "token": "book",
        "prob": 0.07452250272035599
      },
      {
        "token": "on",
        "prob": 0.9607976675033569
      },
      {
        "token": "the",
        "prob": 0.4983428418636322
      },
      {
        "token": "desk",
        "prob": 4.040586190967588e-06
      }
    ],
    "ppl": 17.69329728285426
  },
  {
    "tokens": [
      {
        "token": "there",
        "prob": 0.996775209903717
      },
      {
        "token": "is",
        "prob": 0.03194097802042961
      },
      {
        "token": "a",
        "prob": 0.8877727389335632
      },
      {
        "token": "plane",
        "prob": 3.4907534427475184e-05   # low probability
      },
      {
        "token": "on",
        "prob": 0.1902322769165039
      },
      {
        "token": "the",
        "prob": 0.5981084704399109
      },
      {
        "token": "desk",
        "prob": 3.3164762953674654e-06
      }
    ],
    "ppl": 59.646456254851806
  },
  {
    "tokens": [
      {
        "token": "there",
        "prob": 0.9969795942306519
      },
      {
        "token": "is",
        "prob": 0.03379646688699722
      },
      {
        "token": "a",
        "prob": 0.9095568060874939
      },
      {
        "token": "book",
        "prob": 0.013939591124653816
      },
      {
        "token": "in",
        "prob": 0.000823647016659379  # low probability
      },
      {
        "token": "the",
        "prob": 0.5844194293022156
      },
      {
        "token": "desk",
        "prob": 3.3361218356731115e-06
      }
    ],
    "ppl": 54.65941516205144
  }
]

bert-as-language-model's People

Contributors

Stargazers

Watchers

Forkers

dongjusin delaiahz coddinglxf region123 ktaskn ziyaoh godfatherzzx jankim hydercps mruayan xinzhu-cai colinsongf zhp510730568 bitmindlab leekltw dikea alphadl auscenery yaolinxia kuetuofa johnwu678 embeddedsamurai ohchun shaoxiaoyu rogerspy trungtv maryamnajafian yuhuofei alxfed-robot rhtrht eastonyi githubmyk miaomiaosang askintution ruanchaves atsutomaki joytianya liushui9404 c00renut hewen1990 kunwangr aaronliu7 marvinirwin 1106944911 ysx001 sanghy xuebingo hrxx hxing093020 qiulikun musclesunflower wayneno1 stefensa hunterkai mic10086 krutfurth thcrwi setsunanana yancong222 niutong gdls tianbuwei killergary forenaissance

bert-as-language-model's Issues

Probability of the last word is always too small

Hi, after seeing the result of predicting several Chinese phrases, I found that the probability of the last word is always too small compared to other words in the same phrase. This also happens in all examples shown in your readme.md. Therefore, the perplexity of phrases become also very high.

What do you think about this phenomenon? Thanks for your attention.

字prob和句子perplexity计算方法

你好，感谢你提供了使用bert计算perplexity的代码。
目前我有两个小疑问希望能得到解答：
1、计算某个字的prob时是不是首先对该字进行mask，然后输入mask后的句子，再对mask位置的隐层向量进行softmax从而得到该字的prob。通过对句中所有字进行上述遍历，遍得到了句中所有字的prob？
2、是否会计算[CLS]和[SEP]的prob？计算perplexity时是否会考虑[CLS]和[SEP]的prob，还是只考虑原始句子中所有字的prob？
感谢！

predict耗时太久，运行一条语句需要10s，请问怎么改进呢

请问predict的主要耗时在哪里呢，想做成一个服务，但是耗时太久请问怎么改进呢

为什么要设置max_predictions_per_seq

如题

求概率还是用AR model

依照bert-as-language-model的思路，p(a,b) = p(a|b)*p(b|a)，显然错误，这得到的根本不是概率。
求概率还是用AR model：p(a,b) = p(a)*p(b|a)

请问模型微调过吗?

用uncased_L-12_H-768_A-12,chinese_L-12_H-768_A-12 跑不出来效果

Words outside of vocab

I see tokenizer convert tokens to ids using the vocabulary file. What if the input sentence contains words not in the vocabulary file? Do we need to use our own vocabulary file?

TODO:

拼写检查，word打分
decode: 猜词

每次执行时提示找不到模型

INFO:tensorflow:Could not find trained model in model_dir: ./bert_output, running initialization to predict.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running infer on CPU
INFO:tensorflow:Initialize variable bert/embeddings/LayerNorm/beta:0 from checkpoint

参数指定了 output_dir作为导出的模型保存路径，不过导出为空。封装成函数去调用，每次调用预测函数 estimator.predict，都要从google的中文模型中重新加载，重新走到 Could not find trained model in model_dir: ./bert_output, running initialization to predict.

原理是什么？

看代码没看懂

预训练模型有没有保存cls/predictions下的权重

如果没有的话，这样推理，是不是用的是初始权重，对最后的概率是有影响的

请教下作者xu-song， case里面中文句子打分，用的是分词后的模型，还是单字的模型呢。

谢谢。
另外看到一些评论说， bert不能作为语言模型对句子合理性打分，记得是bert 官方的回复，您怎么理解。

就是按顺序依次mask掉每一个词，然后预测该词的概率。

就是按顺序依次mask掉每一个词，然后预测该词的概率。
整个句子的概率就按照下面这个公式简单近似的:

Originally posted by @xu-song in #15 (comment)

请问最后的结果就是将预测的概率相乘？是这个意思吗?

作为language model对句子打分需要finetune吗

如果把bert作为语言模型，来对句子进行rescore，是相当于token classification任务是吧，是需要在BertModel的基础上加上一个线性层（如linear[hidden_size, vocab_size])，然后进行finetune吗

serving 的问题

请问是否尝试过将checkpoint导出成为可以用来做serving的pb格式呢？

Export bert-as-language-model as online service but failed.

Hi, I'm trying to export the unfine-tuned BERT model as online service.
I followed the official instructions SavedModel and successfully exported the fine-tuned model. But when I try to export the original BERT model, it failed. The error messages are as followed:

Traceback (most recent call last):
File "export_lm_predictor.py", line 136, in
'./exported_model', serving_input_receiver_fn(max_seq_len, 20))
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 734, in export_saved_model
strip_default_attrs=True)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 663, in export_savedmodel
mode=model_fn_lib.ModeKeys.PREDICT)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 789, in _export_saved_model_for_mode
strip_default_attrs=strip_default_attrs)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 878, in _export_all_saved_models
raise ValueError("Couldn't find trained model at %s." % self._model_dir)
ValueError: Couldn't find trained model at ../bert_models/chinese_L-12_H-768_A-12.

I guess that there is no graph.pbtxt and checkpoint files in the original model dir. Does anyone have any ideas? Thanks!

[edit]
I specify the checkpoint_path parameter in export_saved_model function. By the way, I
create the estimator using tf.estimator.Estimator. Then I got a new error:
ValueError: Couldn't find 'checkpoint' file or checkpoints in given directory ../bert_models/chinese_L-12_H-768_A-12
So we must got checkpoint file in the original BERT model directory?

online service的问题

这个并没有fine-tune的模型，如何将bert-as-language-model作为线上服务呢？

paper to cite

Hello,
Is there a paper for the model that could be cited?

推导结果不正常

你哈，我从bert的链接上下载了bert预训练模型，但是对于以下三句话的复杂度预测结果和你的差距有点大
而且第三句话的复杂度更小一些，这似乎不太正常
你知道什么这是原因吗，非常感谢~
我使用的tensorflow版本是2.6.2
[
{
"tokens": [
{
"token": "there",
"prob": 0.002376210642978549
},
{
"token": "is",
"prob": 0.00032396349706687033
},
{
"token": "a",
"prob": 0.00016864163626451045
},
{
"token": "book",
"prob": 8.497028466081247e-05
},
{
"token": "on",
"prob": 0.000501244910992682
},
{
"token": "the",
"prob": 0.00038025222602300346
},
{
"token": "desk",
"prob": 6.700590802211082e-06
}
],
"ppl": 4931.9851396876575
},
{
"tokens": [
{
"token": "there",
"prob": 0.002963493810966611
},
{
"token": "is",
"prob": 0.0003500459424685687
},
{
"token": "a",
"prob": 0.00018642270879354328
},
{
"token": "plane",
"prob": 1.383832932333462e-05
},
{
"token": "on",
"prob": 0.0005545589956454933
},
{
"token": "the",
"prob": 0.00038116113864816725
},
{
"token": "desk",
"prob": 7.67214714869624e-06
}
],
"ppl": 5835.439745980134
},
{
"tokens": [
{
"token": "there",
"prob": 0.002954021329060197
},
{
"token": "is",
"prob": 0.00039738742634654045
},
{
"token": "a",
"prob": 0.00024926112382672727
},
{
"token": "book",
"prob": 0.00010113466123584658
},
{
"token": "in",
"prob": 0.00033981725573539734
},
{
"token": "the",
"prob": 0.00039128249045461416
},
{
"token": "desk",
"prob": 6.479389867308782e-06
}
],
"ppl": 4531.281922045702
}
]