ckiplab / ckip-transformers Goto Github PK

View Code? Open in Web Editor NEW

635.0 13.0 67.0 238 KB

CKIP Transformers

Home Page: https://ckip-transformers.readthedocs.io

License: GNU General Public License v3.0

Makefile 3.66% Python 96.34%

ckip transformers language-model word-segmentation part-of-speech-tagging named-entity-recognition

ckip-transformers's People

Contributors

Stargazers

Watchers

Forkers

aiedward jumpingsquid chaochun juilin allenyummy starflettw box9527 vegaviazhang loganwu0526 nlpsurfers triper1022 driver88 leehao921 nicemartin qqaatw little-bigtiger zengwesley31729 linhong00316 johnroyer dream0228 ag027592 lewisget xy-liao markhsia maxmax2016 annie944001 yiyiwang515 ritalinyutzu kappilan chaoneng nonego 405129293 relifeted minghsuanwu guoyandan widebluesky world4jason pan93412 yanghaocsg billechu zsctju ralphliang miaykc ayuhamaro shih-yu-yeh toastynews ilyi1116 zxhjiutian victorchang1001 wysstartgo wangchenmin reynaquita2905 georgemary3311 gavinchen1314 ramstorage iverson476ers echonesis kennyhuangml100 zhaoxjmail wesley1110 darkthread raekawu rolandzu vl3box wtlee ewssrouyi bearnetwork-brnkc

ckip-transformers's Issues

max_length in README is not correct?

Hi, thanks for your great work.
I found a tiny error for your example, when execute the code
ws = ws_driver(text, batch_size=256, max_length=512)
It would show the error message is that
"AssertionError: Sequence length is longer than the maximum sequence length for this model (512 > 510)."
Set the max_length lower than 510 can fix this.
Without that, everything is fine. It's a excellent and convenience tool for extract information from data.

有没有简体中文bert-tiny的预训练模型？

如何使用fine-tuned完的model

您好:

想請教一下我目前利用了您在說明中提到的範例檔run_ner.py來去依照我自己的資料集微調完model了
https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification

最後分別生成了config.json以及tf_model.h5兩個檔案

但是當我想使用使用自己微調過的model時
在這行
ws_driver = CkipNerChunker(model="tmp/tf_model.h5")
跳出了以下錯誤

Traceback (most recent call last):
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 89, in _get_model_name
model_name = self._model_names[model]
KeyError: './tmp/tf_model.h5'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "Transformers_pretrained.py", line 12, in
ws_driver = CkipWordSegmenter(model="./tmp/tf_model.h5")
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/driver.py", line 52, in init
model_name = kwargs.pop("model_name", self._get_model_name(model))
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 91, in _get_model_name
raise KeyError(f"Invalid model {model}") from exc
KeyError: 'Invalid model ./tmp/tf_model.h5'

請問我該如何正確地使用我自己微調完的model搭配CkipWordSegmenter, CkipPosTagger以及CkipNerChunker呢

Set device = -1 but still using GPU

Hi @emfomy , thank you for your attention 🙏

`ckip_transformers` version

0.2.7

What happened

Set device = -1, but the model still uses GPU.

script:

from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(level=3, device=-1)
res = ner_driver(text_list)

Before running the script:

After running the script:

What do you think should happen instead

It should not consume GPU resources.

How to reproduce

Run the script in GPU enable env:

from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(level=3, device=-1)
res = ner_driver(text_list)

Operating System

Ubuntu 20.04.2 LTS

Development Environment

Python 3.8.12
PyTorch 1.9.0+cu111
Transformers 4.7.0
Tensorflow 2.11.0

Anything else

I've checked the source code, self.device is set as "cpu", and both model and data tensor has to(self.device), so it's weird to have this problem.
And if the environment has no GPU, the model script is still runnable.

Speed up tokenize.

HuggingFace's tokenizer can also return the original indices.
We may rewrite the tokenization step using this feature instead of tokenizing character by character.

如何引用？

如果用了你的bert base chinese ，应该加哪个参考文献？

Albert-tiny English support for NLU tasks

Is there a way to get an equivalent albert-tiny english language model to perform downstream tasks like intent and entity classification. I'm afraid there is no albert-tiny model present hence any lead on this regards or guide to create one from scratch, would be highly appreciated.
Thanks

Some traditional Chinese characters mapped to UNK

Thanks for the great library! Not sure if this is the correct place to ask, but I think I was using your tokenizer in huggingface transformers. I found that some traditional Chinese characters are mapped to UNKs, see the below screenshot.

The code I used was

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
input_ids = tokenizer.encode("重刋道藏輯要高上玉皇本行集經天樞上相(臣)張良校正三淸勅門下湛寂常道信擬議之至難恢漠神通豈形容之可盡", return_tensors='pt')
print ('encoded ids: ', input_ids)
print ('map encoded ids back to words: ', tokenizer.decode(input_ids[0]))

Thanks in advance!

Model Fine-Tunning 數個問題

您好，

請問模型中各個模型的NLP Task Models 是如何訓練的?是基於Language Models 進行fine tune嗎?
是否能透過ckiplab/albert-base-chinese-ws 模型進行fine tune 我自己本身的資料集(希望可以訓練一個新的模型以利新資料斷詞) 若可以，資料集是否需要事先label (tokens) 還是透過raw data即可
訓練出來的模型，可以透過NLP tool 來使用嗎? 因目前的套件提供的方式，似乎是使用數字代表模型 (不能客製化使用自己finetune的模型)

简体中文

这个模型是否适用于简体中文呢？是否有简体中文的相关实验数据？

fineTune Model

你好,
我想請問若要fine-tune以下ws ,pos, ner 的model，
ckiplab/bert-base-chinese-ws
ckiplab/bert-base-chinese-pos
ckiplab/bert-base-chinese-ner

依照例子透過huggingFace上的run_ner.py 來執行，去置換model_name_or_path成以上三個 model來源來做訓練，
那這樣我在fine-tune這三種model時，我的訓練的data標記是只能有 B 跟 I 嗎? 不能額外標註類型嗎，例如 "B-PRODUCT", "I-PRODUCT" 的這種方式嗎? 也不能有O嗎? 因為我看先前的issue提問說是用B、I。

謝謝

Output的embedding獲取

您好：
想請教一下在使用CkipWordSegmenter, CkipPosTagger, CkipNerChunker
能從結果中獲取每一個output的embedding嗎?

像是範例中的字串長度為45的句子
傅達仁今將執行安樂死，卻突然爆出自己20年前遭緯來體育台封殺，他不懂自己哪裡得罪到電視台。
最終輸出時可以從某地方得到45x768這樣的結果嗎? 謝謝

Loading model error

When I tried to use CKIP-transformer to perform Chinese NER task pytorch. But when I loaded the model of level 3, The follow error occurs:
Traceback (most recent call last):
File "ner.py", line 3, in
ner_driver = CkipNerChunker(level=3, device=0)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/ckip_transformers/nlp/driver.py", line 224, in init
super().init(model_name=model_name, **kwargs)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 64, in init
self.model = AutoModelForTokenClassification.from_pretrained(model_name)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py", line 360, in from_pretrained
pretrained_model_name_or_path, *model_args, config=config, **kwargs
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1066, in from_pretrained
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/bert-base-chinese-ner' at '/home/nieyang/.cache/huggingface/transformers/46785b95696d8e6a5004a6a73fcee887d60745a5872af82ca7599b9470554ce3.bdaa5056a5c748eca59fe2c7eef8fa2d034f5092fc84ce6b008c27ddf6f0025c'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

So I added the flag from_tf=True to self.model = AutoModelForTokenClassification.from_pretrained(model_name) in ckip_transformers/nlp/util.py, but it then cames out that the model name is wrong.

So can you help me with this?

When I use model inference, why do the embedding generated by the same sentence are different every time

this is my code

First generated embedding ：

Second generated embedding ：

You can see that the sentence embedding generated twice are different

Chinese text classification model usage example

Hello,

Can you share please an example how to use your model to split Chinese text into separate words?

At this moment this code:

from transformers import (
   BertTokenizerFast,
   AutoModelForMaskedLM,
   AutoModelForCausalLM,
   AutoModelForTokenClassification,
)

# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above

encoded_input = tokenizer.encode(sample_input, return_tensors="pt")
# batch = []
# batch.append(encoded_input)
predictions = model.generate(encoded_input)

tokenizer.batch_decode(predictions)

gives ['[CLS] 之后你看看了我的出版请告诉我你认为什么 [SEP] 我'] for 之后你看看了我的出版请告诉我你认为什么 input.

At the same time your example in example.py in the repo gives correct output for my input:

之后你看看了我的出版请告诉我你认为什么
之后(Nd)　你(Nh)　看看(VE)　了(Di)　我(Nh)　的(DE)　出版(Nv)　请(VF)　告诉(VE)　我(Nh)　你(Nh)　认为(VE)　什么(Nep)

(in case you are interested in context of this issue, here is google doc with my R&D information on this task)

Pinning memory issue

Hi,

I'm currently using ckip-transformers-ws as a preprocessing tool in my project, and I noticed that the DataLoader's pin_memory flag was hard-coded True in util.py.

As pinning memory is incompatible with multiprocessing (or multiple workers) [1], when users leverage ckip-transformers in their collate_fn of DataLoader with multiple workers, a CUDA error will occur as shown in [1], even if only using CPU for inference.

Therefore, I think it would be better that:

Pin memory only when the device is GPU.
Add an option to decide whether or not to enable memory pinning.

Regards.

[1] https://discuss.pytorch.org/t/pin-memory-vs-sending-direct-to-gpu-from-dataset/33891/2

請問finetune模型前，有辦法更改模型原有label的個數嗎

您好，我想要微調ckiplab/bert-base-chinese-ner這個模型，但看到模型的label有72個，有辦法從72個label中選我會使用到的29個，然後再進行微調嗎？

请问POS任务中识别出的标签=Neu是什么意思呢

请问POS任务中识别出的标签=Neu是什么意思呢，指连串的数字？

啥时候能出一个支持gpt2 和 bloom的ner模型呀

Import nlp tools package error

Hello,

When I use：
from ckip_transformers.nlp import CkipWordSegmenter, CkipPosTagger, CkipNerChunker
It reports ValueError: source code string cannot contain null bytes

How to fix it？

Thanks a lot！

請教BERT-base-chinese預訓練方式

想請教一下，
貴單位BERT-base-chinese預訓練方式是完全遵照原始BERT的方式，
只有將資料集換成繁體中文、Tokenizer改變是嗎?

感謝

Add support to HuggingFace's transformers v4

HuggingFace's team released a new major version of transformers (v4).
We should add support to this version.

Is it possible to provide a demo code for bert-base-chinese-qa?

Hi, I am new in this field. Is it possible to provide a demo code for bert-base-chinese-qa?
I tried the following code, following the book "Getting Started with Google BERT":

from transformers import BertTokenizerFast, BertForQuestionAnswering

Tokenizer = BertTokenizerFast.from_pretrained("ckiplab/bert-base-chinese")
model = BertForQuestionAnswering.from_pretrained("ckiplab/bert-base-chinese-qa")

paragraph = "李同 也 沒有 在意 ， 大廈 中 ， 几乎 每 天 都 有 人 搬進 搬出 ， 原 不足為奇 。 \
             可是 ， 當 李同 走進 大廈 時 ， 卻 看見 了 那 個 老者 ， 那 老者 是 倒退 著 身子 走出來 的 ， \
             在 那 老者 的 面前 ， 兩 個 搬運 工人 ， 正 抬 著 一 只 箱子 。 那 是 一 只 木 箱子 ， \
             很 殘舊 了 ， 箱子 并 不 大 ， 但是 兩 個 搬運 工人 抬 著 ， 看來 十分 吃力 。[SEP]".strip(" ")

question = "[CLS]老者怎麼走出來的？[SEP]"

question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)

tokens = question_tokens + paragraph_tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)

segment_ids = [0] * len(question_tokens)
segment_ids += [1] * len(paragraph_tokens)

input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])

# Getting the answer

res = model(input_ids, token_type_ids=segment_ids)

start_scores, end_scores = res['start_logits'], res['end_logits']

start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)

print(" ".join(tokens[start_index:end_index+1]))

But, I got [CLS]. Could you provide a sample code to how how this Chinese QA model can work properly?
Thank you!

關於依存句法分析

您好 !
想請教之後有可能開發依存句法分析 dependency parsing 的工具嗎

感謝回答

Unable to load weights from pytorch checkpoint.

when computing this:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese-pos')

I have issue:
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/albert-tiny-chinese-pos' at

I have
transformers==4.2.2
ckip-transformers==0.2.1
torch==1.4.0

Originally posted by @WachaIPSOS in #3 (comment)

how to compare ckiplab/bert-base-chinese with bert-base-chinese?

Thanks so much for this excellent model and having it accessible in huggingface.

Would like to know why the ckiplab/bert-base-chinese seems a bit strange to me when compared to the usual bert-base-chinese which I think it mainly trained on simplified chinese. For instance, when I masked the word 風 of the phrase 颱風預測。 in the usual bert-base-chinese it managed to give me back 風 with high probability 0.992; in contrast, in the ckiplab/bert-base-chinese it didn't give back the masked word 風 in the top 5 but giving the word 的 with highest probability albeit only around 0.3 something which I am wondering.

Is it supposed that we have to fine-tune this MLM first? Or perhaps I interpreted it wrongly (as I'm very new in this field). Mind sharing a bit on your thought? Thanks very much and thanks in advance.

Implement custom Chinese tokenizer.

We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:

Disable word piece. Convert text to token ID character by character (e.g. tokenizer.convert_tokens_to_ids(list(input_text)))
Reimplement clean_up_tokenization method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.

NER 怎么使用输出ner对应标签

from transformers import (
BertTokenizerFast,
AutoModel,
)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/bert-base-chinese-ner')

。。。
这后面应该怎么写呢，才能对应输出这种

function pack_ws_pos_sentece() was not defined

In section README.rst
item
4. Show results
showed this line
print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
It gave error since this function pack_ws_pos_sentece() was not defined in this block of code.