jiesutd / latticelstm Goto Github PK

View Code? Open in Web Editor NEW

1.8K 42.0 458.0 338 KB

Chinese NER using Lattice LSTM. Code for ACL 2018 paper.

Python 99.18% Shell 0.82%

lattice-lstm lstm-crf ner chinese-ner lattice-lstm-crf

latticelstm's People

Contributors

Stargazers

Watchers

Forkers

threefoldo ai-surfing fendaq robets2020 gzjas shannonyu sidney1994 zhangyong15 yorick76ee cclauss ryfan-rs yuan39 yuanjie-ai infent rtygbwwwerr ochirgarid mattzheng paulfeng2018 colinsongf allensmile wangruicn michael92ht huashan anjie-apt turned123 precision2intelligence huaxinyuan cynsithia phychaos daniyuu zxsted youxuanxue lu839684437 lzbgt wangfeng0621 wuqingzhou828 taominer wanghm11 cheesezh fredericalee thomasyao3096 jluo41 currylym hankers casillas-qf hankers3 caoxu915683474 iambyd mrrace jankim hunnunlp wyxingyux liangsli greengrass2015 fancyerii ningding97 cadisrmz lbda1 shawn1993 dalinvip dongwandou gccrpm george86028 tk1363704 semsevens super-louis yclinyimeng gonewithgt wangmayue xtmhm2000 benny1830 xhades airob ifeynman luricheng borispolonsky zhouyonglong wurentidai sumhncku zgd716 michael-wzhu xiaxyun mingyates susangzj crumbled js418 hinshe zeroows allonbrooks littleredhat howl-anderson freedomkite jameshsu007 lion13140333 mzhengmit shdut lixixinxinxi apricotxingya leedong123 coontash

latticelstm's Issues

Maybe something wrong, the training spent a long time ,but the performance is too low in demo data

Hi Jie, thank for your works on Chinese NER, I downloaded the code and embedding files, and run the demo, however the training spent a long time, the performance is not good, f-value is adout 0.4 after 50 epoch. But I didn't know where I did wrong.
Environment: python2.7, pytorch0.3.0, gpu1080
Thanks!

decode后文本中缺少10处字符

我们用decode去做序列标注时，发现得到的raw.out中缺失了大概10处字符，没有字符也没有标签。这十几处不连续，也不是以句子为单位的缺失，每处缺失的字符数不同，多的地方缺失近200个字符。我们用的saved_model是中间某一次new score对应的，因为目前训练还没有结束。问题：1、一定要用最高的score对应的save_model也就是最后一次保存的模型才可以得到正确结果吗？2、目前这种句子有缺失的情况，代码得出的指标p,r,f还是正确的吗？3、出现这种缺失的原因是？

采用新数据集训练latticeLSTM

Hi @jiesutd 我最近在用latticeLSTM训练医疗数据集来标注，有如下问题：
1）现在的模型batch size固定为1，那么mask还起作用吗？
2）我看到了main文件里调用的是bilstmcrf模型，然后bilstmcrf里面调用了bilstm和crf两个模型，bilstm里面调用了latticelstm，所以整个project是实现了latticeLSTM一个模型还是实现了包括bilstmcrf，latticelstm等各种模型？我用默认配置运行run_main.sh，得到的是latticelstm模型的结果？
3）我如果在医疗数据上运行latticeLSTM，利用目前提供的gigaword_chn.all.a2b.uni.ite50.vec以及ctb.50d.vec可行吗

I upgraded the code to support python3.5 and pytorch 0.4.1 which passed the tests.

As title shows, I upgraded the code.If you want to use the code on pytorch 0.4.1 ,please refer to this.
new_version

Have you tried setting data.HP_batch_size to a value greater than 1?

Have you tried setting data.HP_batch_size to a value greater than 1?
I set data.HP_batch_size to 100, but the training results are not good. On the msra dataset, f1 is stable around 0.74. I switched to the adam optimization method, the effect is not as good as the previous default setting, f1 is about 0.91- 0.92.

Where can i get weibo and MSRA data?

weibo数据集达到不论文报告的精度

我尝试复现论文中的weibo数据集overall的结果，但是test集的F1值仅达到了54，论文是58，没有达到论文的精度；
我从这里 https://github.com/hltcoe/golden-horse
下载了weibo数据集，使用data/weiboNER_2nd_conll.*文件作为数据集，我使用BIO方式；
我想实现overall的效果，没有对数据进行修改，直接用了全部的数据。
我的命令如下：

python main.py --status train \
                --train ./Weibo/weiboNER_2nd_conll.train.bio \
                --dev ./Weibo/weiboNER_2nd_conll.dev.bio \
                --test ./Weibo/weiboNER_2nd_conll.test.bio \
                --savemodel ./Weibo/model \

这是相关的log输出：

Train file: ./Weibo/weiboNER_2nd_conll.train.bio
Dev file: ./Weibo/weiboNER_2nd_conll.dev.bio
Test file: ./Weibo/weiboNER_2nd_conll.test.bio
Raw file: None
Char emb: data/gigaword_chn.all.a2b.uni.ite50.vec
Bichar emb: None
Gaz file: data/ctb.50d.vec
Model saved to: ./Weibo/model
Load gaz file:  data/ctb.50d.vec  total size: 704368
gaz alphabet size: 10798
gaz alphabet size: 12235
gaz alphabet size: 13671
build word pretrain emb...
Embedding:
     pretrain word:11327, prefect match:3281, case_match:0, oov:75, oov%:0.0223413762288
build biword pretrain emb...
Embedding:
     pretrain word:0, prefect match:0, case_match:0, oov:42646, oov%:0.999976551692
build gaz pretrain emb...
Embedding:
     pretrain word:704368, prefect match:13669, case_match:0, oov:1, oov%:7.31475385853e-05
Training model...
DATA SUMMARY START:
     Tag          scheme: BIO
     MAX SENTENCE LENGTH: 250
     MAX   WORD   LENGTH: -1
     Number   normalized: True
     Use          bigram: False
     Word  alphabet size: 3357
     Biword alphabet size: 42647
     Char  alphabet size: 3357
     Gaz   alphabet size: 13671
     Label alphabet size: 18
     Word embedding size: 50
     Biword embedding size: 50
     Char embedding size: 30
     Gaz embedding size: 50
     Norm     word   emb: True
     Norm     biword emb: True
     Norm     gaz    emb: False
     Norm   gaz  dropout: 0.5
     Train instance number: 1350
     Dev   instance number: 270
     Test  instance number: 270
     Raw   instance number: 0
     Hyperpara  iteration: 100
     Hyperpara  batch size: 1
     Hyperpara          lr: 0.015
     Hyperpara    lr_decay: 0.05
     Hyperpara     HP_clip: 5.0
     Hyperpara    momentum: 0
     Hyperpara  hidden_dim: 200
     Hyperpara     dropout: 0.5
     Hyperpara  lstm_layer: 1
     Hyperpara      bilstm: True
     Hyperpara         GPU: True
     Hyperpara     use_gaz: True
     Hyperpara fix gaz emb: False
     Hyperpara    use_char: False
DATA SUMMARY END.
Data setting saved to file:  ./Weibo/model.dset

@jiesutd 请问你一下，问题可能出在那里啊？多谢！

关于对自己的文本进行命名实体识别的效果

你好，我尝试用训练好的模型对一些外部的文本进行命名实体识别。

我发现加载的文本不能是普通的文字序列，需要处理为 char lable 一行这样的格式。而且句子之间需要有空行。（句子长度是否有限制？）

然后我发现结果基本没有识别出我的句子中的实体。我用的训练数据集即为ResumeNER数据集，不知效果较差是否和这个训练数据集有关。

运行命令：

python main.py --status decode --raw ./data/bioes2.txt --savedset ./data/saved_model.dset --
loadmodel ./data/saved_models/saved_model.35.model --output ./data/res.out

bioes2.txt为需要命名实体识别的文本内容

res.out 为命名实体识别的结果

bioes2.txt的内容如下：
东 B-LOC
光 E-LOC
铁 E-LOC
佛 E-LOC
寺 E-LOC
位 O
于 O
沧 B-LOC
州 E-LOC
市 E-LOC
东 B-LOC
光 E-LOC
县 E-LOC
县 O
城 O
内 O
， O
是 O
沧 B-LOC
州 E-LOC
最 O
著 O
名 O
的 O
佛 B-LOC
教 E-LOC
寺 E-LOC
院 O
， O
已 O
有 O
千 O
年 O
历 O
史 O
， O
在 O
沧 B-LOC
州 E-LOC
当 O
地 O
自 O
古 O
就 O
有 O
“ O
沧 B-LOC
州 E-LOC
狮 B-LOC
子 E-LOC
景 E-LOC
州 E-LOC
塔 E-LOC
， O
东 B-LOC
光 E-LOC
县 E-LOC
的 O
铁 B-LOC
菩 E-LOC
萨 E-LOC
” O
的 O
说 O
法 O
， O
很 O
多 O
当 O
地 O
人 O
拜 O
佛 O
祈 O
福 O
都 O
会 O
选 O
择 O
这 O
里 O
。 O
铁 B-LOC
佛 E-LOC
寺 E-LOC
始 O
建 O
于 O
宋 O
代 O
， O
后 O
曾 O
经 O
被 O
毁 O
， O
寺 O
内 O
的 O
古 O
迹 O
和 O
古 O
铁 O
佛 O
早 O
已 O
不 O
存 O
， O
如 O
今 O
的 O
铁 B-LOC
佛 E-LOC
寺 E-LOC
是 O
九 O
十 O
年 O
代 O
时 O
重 O
新 O
修 O
建 O
的 O
。 O
但 O
修 O
建 O
后 O
的 O
寺 O
院 O
庄 O
严 O
大 O
气 O
， O
而 O
且 O
修 O
建 O
时 O
也 O
产 O
生 O
了 O
很 O
多 O
神 O
话 O
传 O
说 O
， O
使 O
得 O
如 O
今 O
的 O
铁 B-LOC
佛 E-LOC
寺 E-LOC
依 O
然 O
香 O
火 O
旺 O
盛 O
。 O
在 O
铁 B-LOC
佛 E-LOC
寺 E-LOC
内 O
游 O
玩 O
时 O
， O
可 O
以 O
着 O
重 O
观 O
看 O
寺 O
内 O
的 O
巨 O
大 O
铁 O
佛 O
， O
铁 O
佛 O
高 O
约 O
8 O
米 O
多 O
， O
非 O
常 O
壮 O
观 O
。 O
寺 O
内 O
另 O
有 O
多 O
座 O
佛 B-LOC
殿 E-LOC
， O
都 O
可 O
以 O
一 O
一 O
参 O
观 O
。 O
另 O
外 O
还 O
有 O
京 B-LOC
剧 O
名 O
旦 O
荀 O
慧 O
生 O
的 O
纪 B-LOC
念 E-LOC
馆 E-LOC
， O
可 O
以 O
进 O
入 O
了 O
解 O
一 O
下 O
。 O
在 O
寺 O
内 O
上 O
香 O
、 O
磕 O
头 O
时 O
， O
一 O
般 O
会 O
被 O
要 O
求 O
给 O
一 O
点 O
香 O
火 O
钱 O
， O
每 O
人 O
1 O
0 O

O
2 O
0 O
元 O
左 O
右 O
即 O
可 O
。 O
" B-LOC
沧 E-LOC
州 E-LOC
民 O
谣 O
： O
“ O
一 O
文 O
一 O
武 O
， O
一 O
国 O
宝 O
， O
一 O
人 O
祖 O
。 O
” O
文 O
者 O
， O
是 O
一 O
代 O
文 O
宗 O
纪 O
晓 O
岚 O
， O
武 O
者 O
， O
是 O
沧 B-LOC
州 E-LOC
乃 O
驰 O
名 O
中 B-LOC
外 O
的 O
武 O
术 O
之 O
乡 O
， O
国 O
宝 O
指 B-LOC
沧 E-LOC
州 E-LOC
铁 B-LOC
狮 E-LOC
， O
人 O
祖 O
即 O
盘 O
古 O
， O
盘 B-LOC
古 E-LOC
遗 E-LOC
址 E-LOC
就 O
在 O
今 B-LOC
沧 E-LOC
州 E-LOC
市 E-LOC
所 O
属 O
的 O
青 B-LOC
县 E-LOC
境 O
内 O
。 O
青 B-LOC
县 E-LOC
城 O
南 O
6 O
公 O
里 O
有 O
村 O
曰 O
“ O
大 O
盘 O
古 O
” O
， O
村 O
西 O
有 O
座 O
盘 O
古 O
庙 O
。 O

res.out 的结果如下：
东 O
光 O
铁 O
佛 O
寺 O
位 O
于 O
沧 O
州 O
市 O
东 O
光 O
县 O
县 O
城 O
内 O
， O

是 O
沧 O
州 O
最 O
著 O
名 O
的 O
佛 B-ORG
教 M-ORG
寺 M-ORG
院 E-ORG
， O

已 O
有 O
千 O
年 O
历 O
史 O
， O

在 O
沧 O
州 O
当 O
地 O
自 O
古 O
就 O
有 O
“ O
沧 O
州 O
狮 O
子 O
景 O
州 O
塔 O
， O

东 O
光 O
县 O
的 O
铁 O
菩 O
萨 O
” O
的 O
说 O
法 O
， O
很 O
多 O
当 O
地 O
人 O
拜 O
佛 O
祈 O
福 O
都 O
会 O
选 O
择 O
这 O
里 O
。 O

铁 O
佛 O
寺 O
始 O
建 O
于 O
宋 O
代 O
， O

后 O
曾 O
经 O
被 O
毁 O
， O

寺 O
内 O
的 O
古 O
迹 O
和 O
古 O
铁 O
佛 O
早 O
已 O
不 O
存 O
， O

如 O
今 O
的 O
铁 O
佛 O
寺 O
是 O
九 O
十 O
年 O
代 O
时 O
重 O
新 O
修 O
建 O
的 O
。 O

但 O
修 O
建 O
后 O
的 O
寺 O
院 O
庄 O
严 O
大 O
气 O
， O

而 O
且 O
修 O
建 O
时 O
也 O
产 O
生 O
了 O
很 O
多 O
神 O
话 O
传 O
说 O
， O

使 O
得 O
如 O
今 O
的 O
铁 O
佛 O
寺 O
依 O
然 O
香 O
火 O
旺 O
盛 O
。 O

在 O
铁 O
佛 O
寺 O
内 O
游 O
玩 O
时 O
， O

可 O
以 O
着 O
重 O
观 O
看 O
寺 O
内 O
的 O
巨 O
大 O
铁 O
佛 O
， O

铁 O
佛 O
高 O
约 O
0 O
米 O
多 O
， O
非 O
常 O
壮 O
观 O
。 O

寺 O
内 O
另 O
有 O
多 O
座 O
佛 O
殿 O
， O
都 O
可 O
以 O
一 O
一 O
参 O
观 O
。 O

另 O
外 O
还 O
有 O
京 O
剧 O
名 O
旦 O
荀 O
慧 O
生 O
的 O
纪 O
念 O
馆 O
， O

可 O
以 O
进 O
入 O
了 O
解 O
一 O
下 O
。 O

在 O
寺 O
内 O
上 O
香 O
、 O
磕 O
头 O
时 O
， O

一 O
般 O
会 O
被 O
要 O
求 O
给 O
一 O
点 O
香 O
火 O
钱 O
， O

每 O
人 O
0 O
0 O

O
0 O
0 O
元 O
左 O
右 O
即 O
可 O
。 O

" O
沧 O
州 O
民 O
谣 O
： O
“ O
一 O
文 O
一 O
武 O
， O
一 O
国 O
宝 O
， O
一 O
人 O
祖 O
。 O
” O
文 O
者 O
， O
是 O
一 O
代 O
文 O
宗 O
纪 O
晓 O
岚 O
， O
武 B-TITLE
者 E-TITLE
， O
是 O
沧 O
州 O
乃 O
驰 O
名 O
中 O
外 O
的 O
武 O
术 O
之 O
乡 O
， O

国 O
宝 O
指 O
沧 O
州 O
铁 O
狮 O
， O

人 O
祖 O
即 O
盘 O
古 O
， O
盘 O
古 O
遗 O
址 O
就 O
在 O
今 O
沧 O
州 O
市 O
所 O
属 O
的 O
青 O
县 O
境 O
内 O
。 O

青 O
县 O
城 O
南 O
0 O
公 O
里 O
有 O
村 O
曰 O
“ O
大 O
盘 O
古 O
” O
， O
村 O
西 O
有 O
座 O
盘 O
古 O
庙 O
。 O

metric.py代码中的标签问题

Hi！
我们在使用ResumeNER的数据时，发现标签列是“B-”，“M-”，“E-”和“O”，可是在LatticeLSTM/utils/metric.py中约76行以后的几行，给出的标签检索中没有M，所以问题是：1、“M-”对于BMES的标签方式是不是必须的？代码里是否是漏写？2、真实标签是“O”，代码中是“S-”，这是否影响最后结果？

Request the code for preprocessing OntoNotes 4

Hello, I am trying to reproduce your work on OntoNotes 4. Could you please provide some code or scripts for preprocessing that dataset? I mean, to split it into train/ dev/ test set, and to transform the original format in OntoNotes to CoNLL format (BMES).

I have downloaded OntoNotes 4 from LDC using my license, and tried to split that dataset according to the paper Named Entity Recognition with Bilingual Constraints, as mentioned in your ACL18 paper. However, some statistics are not consistent with the results shown in your paper. It will help a lot if you could provide the code for preprocessing. Thanks!

MSRA数据集

您好：
我在GitHub上下载的MSRA数据集是BIO格式的标签，您好像使用的是BMES格式的标签，请问能不能通过百度云或者其他方式分享一下呢

用自己的数据集MemoryError,log如下

CuDNN: True
GPU available: False
Status: train
Seg: True
Train file: ./rd_data/train.txt
Dev file: ./rd_data/dev.txt
Test file: ./rd_data/test.txt
Raw file: None
Char emb: data/gigaword_chn.all.a2b.uni.ite50.vec
Bichar emb: None
Gaz file: data/ctb.50d.vec
Model saved to: ./rd_data/demo_test
Load gaz file: data/ctb.50d.vec total size: 704368
gaz alphabet size: 31572
gaz alphabet size: 33642
gaz alphabet size: 35512
build word pretrain emb...
Embedding:
pretrain word:11327, prefect match:2497, case_match:0, oov:29, oov%:0.0114760585675
build biword pretrain emb...
Embedding:
pretrain word:0, prefect match:0, case_match:0, oov:91271, oov%:0.999989043737
build gaz pretrain emb...
Embedding:
pretrain word:704368, prefect match:35510, case_match:0, oov:1, oov%:2.81594953818e-05
Training model...
DATA SUMMARY START:
Tag scheme: BIO
MAX SENTENCE LENGTH: 250
MAX WORD LENGTH: -1
Number normalized: False
Use bigram: False
Word alphabet size: 2527
Biword alphabet size: 91272
Char alphabet size: 2527
Gaz alphabet size: 35512
Label alphabet size: 5
Word embedding size: 50
Biword embedding size: 50
Char embedding size: 30
Gaz embedding size: 50
Norm word emb: True
Norm biword emb: True
Norm gaz emb: False
Norm gaz dropout: 0.5
Train instance number: 28185
Dev instance number: 5885
Test instance number: 5977
Raw instance number: 0
Hyperpara iteration: 100
Hyperpara batch size: 1
Hyperpara lr: 0.015
Hyperpara lr_decay: 0.05
Hyperpara HP_clip: 5.0
Hyperpara momentum: 0
Hyperpara hidden_dim: 200
Hyperpara dropout: 0.5
Hyperpara lstm_layer: 1
Hyperpara bilstm: True
Hyperpara GPU: False
Hyperpara use_gaz: True
Hyperpara fix gaz emb: False
Hyperpara use_char: False
DATA SUMMARY END.
Traceback (most recent call last):
File "main_test.py", line 444, in
train(data, save_model_dir, seg)
File "main_test.py", line 240, in train
save_data_setting(data, save_data_name)
File "main_test.py", line 90, in save_data_setting
new_data = copy.deepcopy(data)
File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/usr/lib/python2.7/copy.py", line 298, in _deepcopy_inst
state = deepcopy(state, memo)
File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/usr/lib/python2.7/copy.py", line 230, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/usr/lib/python2.7/copy.py", line 230, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/usr/lib/python2.7/copy.py", line 230, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/usr/lib/python2.7/copy.py", line 192, in deepcopy
memo[d] = y
MemoryError

词向量问题

请问要下的两个词向量文件夹中的所有文件都要下吗，还是只要分别下一个就好？
谢谢

About the annotation of the Resume dataset

Hi. I'm now dealing with some clinical unannotated data and I wonder how did you manually annotate the resume data in your experiment. Did you use some tricks or ML based annotatation? Thanks XD.

关于character baseline效果

我用tensorflow复现文中的字模型baseline效果，使用了相同的实验配置和embeding，ontonotes 4 test数据集上只得到了61左右的f值，微博语料则要高一些。请问如果用您的代码，需要做什么改动来跑字模型的baseline效果呢？

The training time is too long.

I try to train a model on msra data using a nvidia 1080 Ti, and it takes about 120 seconds on 500 sentences. It is acceptable on small data set, but if the data set is larger, for instance, 5 times bigger than msra, the training time is too long.

Is there any way to speed up the training speed?

What is the raw data?

(1)We can feed three kinds of parameter:"train"，"test" and "decode" to the main.py. In "train" step you have use "dev" set to choose best mode and save it. It seems that you use the "test" data to print the model'performance each iteration. Am I right? When status="test",you also use the "dev" data and "test" data to show the model'performance, but your have used them during trainning stage. Is that OK？
(2)In the main.py you mention "raw" data when status argument is "decode".Where to get the "raw" data?

Demon data

run the demon-run_demo.sh,it seems request some more data in "onto4ner.cn" directory.So where to get
demo.train.char, demo.dev.char and demo.test.char?

hello,After running your code, the f value is only 0.4,sorry,Is there anything I should pay attention to when running

please tell me why?

Viterbi解码的cat操作中，tensor维度不一致

Traceback (most recent call last):
File "/home/rui/workspace/lattice-lstm/LatticeLSTM-master/main.py", line 459, in
train(data, save_model_dir, seg)
File "/home/rui/workspace/lattice-lstm/LatticeLSTM-master/main.py", line 286, in train
batch_charlen, batch_charrecover, batch_label, mask)
File "/home/rui/workspace/lattice-lstm/LatticeLSTM-master/model/bilstmcrf.py", line 32, in neg_log_likelihood_loss
scores, tag_seq = self.crf._viterbi_decode(outs, mask)
File "/home/rui/workspace/lattice-lstm/LatticeLSTM-master/model/crf.py", line 159, in _viterbi_decode
partition_history = torch.cat(partition_history,0).view(seq_len, batch_size,-1).transpose(1,0).contiguous() ## (batch_size, seq_len. tag_size)
RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 2 and 3 at /pytorch/torch/lib/THC/generic/THCTensorMath.cu:102

在对partition_history执行cat操作时，输入的tensor list维度不一致。

partition_history中，第一个tensor是 [batch_size, tag_size, 1]：

partition = inivalues[:, START_TAG, :].clone().view(batch_size, tag_size, 1)  # bat_size * to_target_size     
partition_history.append(partition)

而在for中，torch.max返回的partition形状为 [batch_size,tag_size]，与第一个tensor维度不一致，导致cat操作失败

cur_values = cur_values + partition.contiguous().view(batch_size, tag_size, 1).expand(batch_size, tag_size, tag_size)
partition, cur_bp = torch.max(cur_values, 1)
partition_history.append(partition)

请问如何修改

在计算字符cell时，一般有两个状态c和h，

通常c的计算，都会考虑前一个字符的c状态，为什么论文中这部分没有呢？为什么论文只考虑来当前字符相关的所有词的c状态呢？

Achieves 93.18% F1-value on MSRA dataset, where to find this dataset

Hi,
I see
It achieves 93.18% F1-value on MSRA dataset, which is the state-of-the-art result on Chinese NER task.
But, I try to find by google, I can't find this MSRA dataset. Please ask where I can find this dataset.

question about class WordLSTMCell

Sorry that I am new to pytorch , but here in the class WordLSTMCell ,I found that

f, i, g = torch.split(wh_b + wi, split_size=self.hidden_size, dim=1)

In the formula of your paper, wh_b and wi are not added , so Did I misunderstand your code?

def forward(self, input_, hx):
"""
Args:
input_: A (batch, input_size) tensor containing input
features.
hx: A tuple (h_0, c_0), which contains the initial hidden
and cell state, where the size of both states is
(batch, hidden_size).
Returns:
h_1, c_1: Tensors containing the next hidden and cell state.
"""

    h_0, c_0 = hx
    batch_size = h_0.size(0)
    bias_batch = (self.bias.unsqueeze(0).expand(batch_size, *self.bias.size()))
    wh_b = torch.addmm(bias_batch, h_0, self.weight_hh)  
    wi = torch.mm(input_, self.weight_ih)  
    f, i, g = torch.split(wh_b + wi, split_size=self.hidden_size, dim=1) 
    c_1 = torch.sigmoid(f)*c_0 + torch.sigmoid(i)*torch.tanh(g)
    return c_1

TypeError: mul() received an invalid combination of arguments - got (list), but expected one of:

Hi, run_main.sh by using ResumeNER, but error, I rarely use Pytorch .And I alone run torch.Tensor([1]*seqlen) is success. So I need your help!!!

运行自己的数据训练,decode过程出错

CuDNN: True
GPU available: False
Status: decode
Seg: True
Train file: data/conll03/train.bmes
Dev file: data/conll03/dev.bmes
Test file: data/conll03/test.bmes
Raw file: ./rd_data/test/test.txt
Char emb: data/gigaword_chn.all.a2b.uni.ite50.vec
Bichar emb: None
Gaz file: data/ctb.50d.vec
Data setting loaded from file: ./rd_data/test/test.dset
DATA SUMMARY START:
Tag scheme: BMES
MAX SENTENCE LENGTH: 250
MAX WORD LENGTH: -1
Number normalized: False
Use bigram: False
Word alphabet size: 2596
Biword alphabet size: 31940
Char alphabet size: 2596
Gaz alphabet size: 13634
Label alphabet size: 18
Word embedding size: 50
Biword embedding size: 50
Char embedding size: 30
Gaz embedding size: 50
Norm word emb: True
Norm biword emb: True
Norm gaz emb: False
Norm gaz dropout: 0.5
Train instance number: 0
Dev instance number: 0
Test instance number: 0
Raw instance number: 0
Hyperpara iteration: 100
Hyperpara batch size: 1
Hyperpara lr: 0.015
Hyperpara lr_decay: 0.05
Hyperpara HP_clip: 5.0
Hyperpara momentum: 0
Hyperpara hidden_dim: 200
Hyperpara dropout: 0.5
Hyperpara lstm_layer: 1
Hyperpara bilstm: True
Hyperpara GPU: False
Hyperpara use_gaz: True
Hyperpara fix gaz emb: False
Hyperpara use_char: False
DATA SUMMARY END.
Load Model from file: ./rd_data/test/demo_test.6.model
build batched lstmcrf...
build batched bilstm...
build LatticeLSTM... forward , Fix emb: False gaz drop: 0.5
load pretrain word emb... (13634, 50)
build LatticeLSTM... backward , Fix emb: False gaz drop: 0.5
load pretrain word emb... (13634, 50)
build batched crf...
Traceback (most recent call last):
File "main_test.py", line 454, in
decode_results = load_model_decode(model_dir, data, 'raw', gpu, seg)
File "main_test.py", line 348, in load_model_decode
model.load_state_dict(torch.load(model_dir))
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 487, in load_state_dict
.format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named lstm.word_embeddings.weight, whose dimensions in the model are torch.Size([2596, 50]) and whose dimensions in the checkpoint are torch.Size([2527, 50]).

只修改过main文件,run_demo.sh文件

自己训练的数据集在test时候报错

用自己已经标注过的语料做了训练，保存了模型到磁盘。在测试阶段，重新加载模型，然后执行后报错。
报错显示维度不匹配。
错误信息如下：

build batched crf...
Traceback (most recent call last):
  File "main.py", line 442, in <module>
    load_model_decode(model_dir, data, 'test', gpu, seg)
  File "main.py", line 348, in load_model_decode
    model.load_state_dict(torch.load(model_dir),strict=False)
  File "/root/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 487, in load_state_dict
    .format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named lstm.hidden2tag.weight, whose dimensions in the model are torch.Size([9, 200]) and whose dimensions in the checkpoint are torch.Size([7, 200]).

但是训练和测试的时候，网络结构并没有改动，怎么会出现维度失配？还是其他原因？

weibo和MSRS char baseline达不到论文中的值

我根据您的代码，仅仅用char embedding来复现基于char的weibo和MSRA实验，我发现webo和MSRA的结果都达不到论文中的引用值，weibo test只有: 0.475, 论文中是0.5277； MSRA test只有85.75，论文中是88.81。所以，我想请教一下作者，这大概是什么原因造成的？我调试了很久，但是始终没有太大的提升。

新的训练集问题

您好，我现在想用自己的语料库训练，标签集必须要改成BIOES吗，还是BIO也可以，在哪里改标签集合呢？
谢谢

对您本篇中的疑惑

你好，读了你的论文，感觉很棒，刚入坑的小白一枚，有几个地方想请教一下。

论文中提到没有使用分词，但word-embedding难道不是分词后训练得到的吗？如果没有用分词，那么word-embedding是怎么得到的呢？
如果使用了分词，那么一种分词方法对于“南京市长江大桥”只能得到一种分词结果，为什么在论文的模型中，会出现“大桥”“长江大桥”这些输入到“桥”的cell中呢？
希望得到您的回复，非常感谢。

运行demo出错

Epoch: 0/100
Learning rate is setted as: 0.015
Traceback (most recent call last):
File "main.py", line 436, in
train(data, save_model_dir, seg)
File "main.py", line 281, in train
loss, tag_seq = model.neg_log_likelihood_loss(gaz_list, batch_word, batch_biword, batch_wordlen, batch_char, batch_charlen, batch_charrecover, batch_label, mask)
File "/root/receiveData/LatticeLSTM/model/bilstmcrf.py", line 32, in neg_log_likelihood_loss
scores, tag_seq = self.crf._viterbi_decode(outs, mask)
File "/root/receiveData/LatticeLSTM/model/crf.py", line 159, in _viterbi_decode
partition_history = torch.cat(partition_history,0).view(seq_len, batch_size,-1).transpose(1,0).contiguous() ## (batch_size, seq_len. tag_size)
RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 2 and 3 at /pytorch/torch/lib/THC/generic/THCTensorMath.cu:102

Lattice LSTM可做迁移学习吗？

OntoNote4 数据集划分

dev和test是否只是对chtb 0001-0325, chtb 1001-1078根据奇数偶数编号来做了一下划分呢？剩余的用来做训练集？

pertained embedding link is not available now

尝试过类似方法在中文分词的效果

1.这种边界信息对分词应该也有帮助啊，有尝试过吗？
2.paper中lattice用到的分词，是现成的分词器，还是用的无监督分词来产生词表啊？

谢谢。

In the experiment of msra dataset, where did I do wrong?

In my experiment, char embeddings and word embeddings are gigaword_chn.all.a2b.uni.ite50.vec and ctb.50d.vec respectively, while bichar_emb is set to None. Other parameters take the default value in the code. Currently, 80 epoches on a nividia 1080 Ti GPU has been run, the test result on the msra test dataset did not reach the result in the paper, and the best result is acc: 0.9891, p: 0.9331, r: 0.9093, f: 0.9210. Where did I do wrong?

In addition, if char embeddings trained on Chinese Wikipedia (bigger than gigaword, and the embeddings contain 16115 words, 100 dimensions) are used instead of gigaword_chn.all.a2b.uni.ite50.vec(11327 words, 50 dimensions), the difference of test results between Bi-LSTM+CRF based on char + softword and LatticeLSTM (also using the same char embeddings trained on Chinese Wikipedia) is small. Is the big difference in the paper because of the use of a weaker char embedding?

Which pretrained character embeding is used？

In character-based NER, which one is used as the pretrained embeding?gigaword_chn.all.a2b.uni.ite50.vec or joint4.all.b10c1.2h.iter17.mchar?

为什么输出后数字变成了“0”

语句如下：
输入语句：在全国高等医药教材建设研究会和卫生部教材办公室的指导和组织下，在第6版的基础上，经过编委们的精心修改、编撰，完成了本教材的第7版。
通过使用训练出来的模型文件（xxx.model）,使用decode后：
输出：在全国高等医药教材建设研究会和卫生部教材办公室的指导和组织下，在第0版的基础上，经过编委们的精心修改、编撰，完成了本教材的第0版。
我可以保证的的是词嵌入和字典里面都是有数字vec的。
请问：为什么输出后数字变成了“0”？

why can't change the batch_size？为什么不能调整batchsize的大小

hello, why we should fixed the batch_size to 1? I have not read your code carefully, can you say something to me in advance?

模型怎么用GPU，需要改几处？

gpu=True，但是也没用GPU, 哪儿需要改动，model后面加个cuda()就完了吗？还是还有其他也要改。

读取gigaword_chn.all.a2b.bi.ite50.vec文件出错

下面这两的向量维度不是50，一个是15，一个是 55，是这个文件本来的问题，还是我下载过程中传输出错了（没有百度网盘会员，下的真慢）？

森悄 -0.420138 -0.189634 0.346326 -0.235297 -0.389551 -0.588 1.164976 -0.610863 0.073047 0.531165 -3.343037 -0.666090 2.384061 0.129748 -1.972636

系v 0.108717 -0.042028 -2.452340 -0.387857 1.953125 0.230040 2.203831 3.083842 0.400699 -0.449208 1.321026 -2.430978 1.369693 0.100625 -1.246027 -0.846308 -2.649471 0.168484 0.593922 -0.481574 0.546810 -2.844704 -0.956998 -2.017416 1.072134 -1.407300 -0.145390 -0.086188 -0.896394 2.064528 1.660699 0.500353 0.773185 -2.036687 3.072354 0.667415 -0.520374 -1.668948 0.729110 0.385540 -0.868025 0.600913 1.883432 3.111219 -1.039192 1.274076 1.103154 3.524141 -0.77819 -2.084318 -1.281501 -2.526086 -2.124930 -0.793325 -0.496073

biword embedding?

Hi, can you share me with pretrained biword-embedding?

What is bichar_emb?

In main python script, bichar_emb is set to be none. What is this embedding?

About the speed of training on MSRA dataset...

Excuse me, I have some trouble training your model on MSRA dataset with a GTX 1080Ti card. I've found the speed of training is quite slow. So, may I know your solution to this problem? (Note: The video memory almost runs out, but there is still much unused computing power left.)

代码的一些疑问

你好。在查阅ACL2018时，看见您的论文。按照您论文以及实验代码中的**，我想确认几点问题：
1、完全摒弃了char特征（ps：代码中未看见通过lstm提取字特征），是不是没有结合char特征和Lattice提取出的特征，只是单单使用Lattice提取出的特征？
2、代码中的bi_word也就是代表了词信息，但是在LatticeLSTM中并未参与计算？这个是没有使用么？
3、仅仅使用Lattice网络在长期依赖的问题处理上，能不能保证和lstm能达到相同的效果？有没有一些论证呢？

About regularization

I read in the paper that you set a weight-decay in the optimizer, but I didn't see that term in your initialization of optimizer in main.py here. I wonder if I have skipped something or you really didn't set the regularization in your code? Thanks.

预训练的字向量和词向量

你好，请问在预训练embedding时有什么tricks吗？我发现自己使用word2vec训练的字向量在NER上效果并不好。

为什么在字符嵌入和词嵌入里面对同一字符的向量不同啊？如果相同有什么影响吗？他们之间的关系是包含的关系吗？

character and word embeddings

In the readme, you mentioned that the pretrained character and word embeddings are the same with the embeddings in the baseline of RichWordSegmentor, i.e., character and word embeddings are gigaword_chn.all.a2b.uni.ite50.vec and ctb.50d.vec respectively. These seems not be mentioned in the paper. Are the experimental results of latticeLSTM in the paper obtained using these two embeddings？

In the paper, you mentioned that the word embeddings is pretrained using word2vec (Mikolov et al., 2013) over automatically segmented Chinese Giga-Word. Dose this word embedding is only used in those baseline methods?

how can I pretrain my own embedding?

I want to use this model for my project.Thx

请问你的各个实验结果中lstm-crf中的字向量，都是使用预训练的吗？还是有些是随机初始化的？

如题，谢谢。