Coder Social home page Coder Social logo

bi-lstm-crf's Introduction

bi-lstm-crf

基于Universal Transformer的分词模型见:https://github.com/GlassyWing/transformer-word-segmenter:

简介

不同于英文自然语言处理,中文自然语言处理,例如语义分析、文本分类、词语蕴含等任务都需要预先进行分词。要将中文进行分割,直观的方式是通过为语句中的每一个字进行标记,以确定这个字是位于一个词的开头还是之中:

例如“成功入侵**党的电脑系统”这句话,我们为其标注为:

"成功 入侵  **党 的 电脑系统"
 B I  B I  B I I  S  B I I I

其中B表示一个词语的开头,I表示非一个词语的开头,S表示单字成词。这样我们就能达到分词的效果。

对于句子这样的序列而言,要为其进行标注,常用的是使用Bi-LSTM卷积网络进行序列标注,如下图:

通过Bi-LSTM获得每个词所对应的所有标签的概率,取最大概率的标注即可获得整个标注序列,如上图序列W0W1W2的标注为BIS。但这样有可能会取得不合逻辑的标注序列,如BSSI等。我们需要为其设定一些约束,如:

  • B后只能是I
  • S之后只能是B、S
  • ...

而要做到这一点,我们可以在原有的模型基础之上,加上一个CRF层,该层的作用即是学习符号之间的约束(如上所述)。模型架构变为Embedding + Bi-LSTM + CRF,原理参考论文:https://arxiv.org/abs/1508.01991。

语料预处理

要训练模型,首先需要准备好语料,这里选用人民日报2014年的80万语料作为训练语料。语料格式如下:

"人民网/nz 1月1日/t 讯/ng 据/p 《/w [纽约/nsf 时报/n]/nz 》/w 报道/v ,/w 美国/nsf 华尔街/nsf 股市/n 在/p 2013年/t 的/ude1 最后/f 一天/mq 继续/v 上涨/vn ,/w 和/cc [全球/n 股市/n]/nz 一样/uyy ,/w 都/d 以/p [最高/a 纪录/n]/nz 或/c 接近/v [最高/a 纪录/n]/nz 结束/v 本/rz 年/qt 的/ude1 交易/vn 。/w "

原格式中每一个词语使用空格分开后面使用POS标记词性,而本模型所需要的语料格式如下:

          B-N I-N I-N B-NR I-NR I-NR S-W

使用命令:

python tools/data_preprocess.py people-2014/train 2014_processed -c True -s True

可将原文件转换为用BIS标签(B:表示语句块的开始,I:表示非语句块的开始,S:表示单独成词)标注的文件。

如上将会使用people-2014/train下的文件生成文本文件2014_processed

生成字典

使用命令:

python tools/make_dicts.py 2014_processed -s src_dict.json -t tgt_dict.json

这会使用文件2014_processed,生成两个字典文件,src_dict.json, tgt_dict.json

使用方式见:python tools/make_dicts.py -h

转换为hdf5格式

使用命令:

python tools/convert_to_h5.py 2014_processed 2014_processed.h5 -s src_dict.json -t tgt_dict.json

可将文本文件2014_processed转换为hdf5格式,提升训练速度,

使用方式见:python tools/convert_to_h5.py -h

训练

训练示例见:

train_example.py

训练时,默认会生成模型配置文件data/default-config.json, 权重文件将会生成在models文件夹下。

使用字(词)向量

在训练时可以使用已训练的字(词)向量作为每一个字的表征,字(词)向量的格式如下:

 -0.037438 0.143471 0.391358 ...
 -0.045985 -0.065485 0.251576 ...
 -0.085605 0.081578 0.227135 ...
可以 0.012544 0.069829 0.117207 ...
 -0.321195 0.065808 0.089396 ...
 -0.186070 0.189417 0.265060 ...
 0.037873 0.075681 0.239715 ...
 -0.197969 0.018578 0.233496 ...
 -0.115746 -0.025029 -0 ...

每一行,为一个字(词)和它所对应的特征向量。

汉字字(词)向量来源 可从https://github.com/Embedding/Chinese-Word-Vectors获得字(词)向量。字(词)向量文件中每一行格式为一个字(词)与其对应的300维向量。

训练效果

训练时模型配置如下:

config = {
        "vocab_size": 6864,
        "chunk_size": 259,
        "embed_dim": 300,
        "bi_lstm_units": 256,
        "max_num_words": 20000,
        "dropout_rate": 0.1
    }

其它参数:

参数
batch size 32
epochs 32
steps_per_epoch 2000
validation_steps 20

: 训练未使用词向量

最终效果:

在迭代32次后,验证集精度达到98%

分词/解码

  1. 编码方式:

    import time
    
    from dl_segmenter import get_or_create, DLSegmenter
    
    if __name__ == '__main__':
        segmenter: DLSegmenter = get_or_create("../data/default-config.json",
                                src_dict_path="../data/src_dict.json",
                                tgt_dict_path="../data/tgt_dict.json",
                                weights_path="../models/weights.32--0.18.h5")
    
        for _ in range(1):
            start_time = time.time()
            for sent, tag in segmenter.decode_texts([
                "美国司法部副部长罗森·施泰因(Rod Rosenstein)指,"
                "这些俄罗斯情报人员涉嫌利用电脑病毒或“钓鱼电邮”,"
                "成功入侵**党的电脑系统,偷取**党高层成员之间的电邮,"
                "另外也从美国一个州的电脑系统偷取了50万名美国选民的资料。"]):
                print(sent)
                print(tag)
            print(f"cost {(time.time() - start_time) * 1000}ms")

    get_or_create

    • 参数:
      • config_path: 模型配置路径
      • src_dict_path:源字典文件路径
      • tgt_dict_path:目标字典文件路径
      • weights_path:权重文件路径
    • 返回: 分词器对象

    decode_texts

    • 参数:
      • 字符串序列(即可同时处理多段文本)
    • 返回:
      • 一个序列,序列中每一个元素为对应语句的分词结果和每个词的词性标签。
  2. 命令方式:

    python examples/predict.py -s <语句>

    命令方式所使用的模型配置文件、字典文件等如编程方式中所示。进行分词时,多句话可用空格分隔,具体使用方式可使用predict.py -h查看。

分词效果展示

  1. 科技类

物理仿真引擎的作用,是让虚拟世界中的物体运动符合真实世界的物理定律,经常用于游戏领域,以便让画面看起来更富有真实感。PhysX是由英伟达提出的物理仿真引擎,其物理模拟计算由专门加速芯片GPU来进行处理,在节省CPU负担的同时还能将物理运算效能成倍提升,由此带来更加符合真实世界的物理效果。

['物理', '仿真引擎', '的', '作用', ',', '是', '让', '虚拟世界', '中', '的', '物体运动', '符合', '真实世界', '的', '物理定律', ',', '经常', '用于', '游戏', '领域', ',', '以便', '让', '画面', '看起来', '更', '富有', '真实感', '。', 'PhysX', '是', '由', '英伟达', '提出', '的', '物理', '仿真引擎', ',', '其', '物理模拟计算', '由', '专门', '加速', '芯片', 'GPU', '来', '进行', '处理', ',', '在', '节省', 'CPU', '负担', '的', '同时', '还', '能', '将', '物理运算', '效能', '成', '倍', '提升', ',', '由此', '带来', '更加', '符合', '真实世界', '的', '物理', '效果', '。']
['n', 'n', 'ude1', 'n', 'w', 'vshi', 'v', 'gi', 'f', 'ude1', 'nz', 'v', 'nz', 'ude1', 'nz', 'w', 'd', 'v', 'n', 'n', 'w', 'd', 'v', 'n', 'v', 'd', 'v', 'n', 'w', 'x', 'vshi', 'p', 'nz', 'v', 'ude1', 'n', 'n', 'w', 'rz', 'nz', 'p', 'd', 'vi', 'n', 'x', 'vf', 'vn', 'vn', 'w', 'p', 'v', 'x', 'n', 'ude1', 'c', 'd', 'v', 'd', 'nz', 'n', 'v', 'q', 'v', 'w', 'd', 'v', 'd', 'v', 'nz', 'ude1', 'n', 'n', 'w']
  1. 政治类

昨晚,英国首相特里萨•梅(Theresa May)试图挽救其退欧协议的努力,在布鲁塞尔遭遇了严重麻烦。倍感失望的欧盟领导人们指责她没有拿出可行的提案来向充满敌意的英国议会兜售她的退欧计划。

['昨晚', ',', '英国', '首相', '特里萨•梅', '(', 'TheresaMay', ')', '试图', '挽救', '其', '退', '欧', '协议', '的', '努力', ',', '在', '布鲁塞尔', '遭遇', '了', '严重', '麻烦', '。', '倍感', '失望', '的', '欧盟', '领导', '人们', '指责', '她', '没有', '拿出', '可行', '的', '提案', '来', '向', '充满', '敌意', '的', '英国议会', '兜售', '她', '的', '退欧', '计划', '。']
['t', 'w', 'ns', 'nnt', 'nrf', 'w', 'x', 'w', 'v', 'vn', 'rz', 'v', 'b', 'n', 'ude1', 'ad', 'w', 'p', 'nsf', 'v', 'ule', 'a', 'an', 'w', 'v', 'a', 'ude1', 'n', 'n', 'n', 'v', 'rr', 'v', 'v', 'a', 'ude1', 'n', 'vf', 'p', 'v', 'n', 'ude1', 'nt', 'v', 'rr', 'ude1', 'nz', 'n', 'w']
  1. 新闻类

印度尼西亚国家抗灾署此前发布消息证实,印尼巽他海峡附近的万丹省当地时间22号晚遭海啸袭击。

['印度尼西亚', '国家', '抗灾署', '此前', '发布', '消息', '证实', ',', '印尼', '巽他海峡', '附近', '的', '万丹省', '当地时间', '22号', '晚', '遭', '海啸', '袭击', '。']
['nsf', 'n', 'nz', 't', 'v', 'n', 'v', 'w', 'ns', 'nz', 'f', 'ude1', 'ns', 'nz', 'mq', 'tg', 'v', 'n', 'vn', 'w']

分词评估结果

使用开发集进行评估:

result-(epoch:32):
标准词数20744词数正确率0.939404词数错误率0.049653 
标准行数317行数正确率0.337539行数错误率0.662461 
Recall: 0.939404
Precision: 0.949798
F MEASURE: 0.944572
ERR RATE: 0.049653

其它

如何评估

使用与黄金标准文件进行对比的方式,进行评估。

  1. 数据预处理

    为了生成黄金标准文件和无分词标记的原始文件,可用下列命令:

    python examples/score_preprocess.py --corups_dir <评估用语料文件夹> \
    --gold_file_path <生成的黄金标准文件路径> \
    --restore_file_path <生成无标记的原始文件路径>
  2. 读取无标记的原始文件,并进行分词,输出到文件:

    python examples/predict.py -f <要分割的文本文件的路径> -o <保存分词结果的文件路径>
  3. 生成评估结果:

    执行score.py可生成评估文件,默认使用黄金分割文件../data/gold.utf8,使用模型分词后的文件../data/gold.utf8,评估结果保存到../data/prf_tmp.txt中。

    def main():
        F = prf_score('../data/gold.utf8', '../data/gold.utf8', '../data/prf_tmp.txt', 15)

附录

  1. 分词语料库: https://pan.baidu.com/s/1EtXdhPR0lGF8c7tT8epn6Q 密码: yj9j
  2. 已训练模型权重、配置及字典: https://pan.baidu.com/s/1_IK-e8CDrgaCn-jZqozKJA 提取码: grng

bi-lstm-crf's People

Contributors

glassywing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bi-lstm-crf's Issues

steps_per_epoch是2000

您好~ 代码里使用的steps_per_epoch是2000,而不是 数据长度//batch_size?是因为使用的是随机训练而不是所有数据嘛?

能否请教一下,仍然是default-config.json问题

我使用您文末分享的两个_dic.json和default-config.json搭配使用是可以正常训练的,但是加入预训练字向量的路径后出现和之前几位仁兄一样的报错:

KeyError: 'max_num_words'
Traceback (most recent call last):
File "train_example_emb.py", line 55, in
save_config(segmenter, config_save_path)
File "/home/ziyao/projects/bi-lstm-crf/dl_segmenter/init.py", line 9, in save_config
json.dump(obj.get_config(), file)
AttributeError: 'NoneType' object has no attribute 'get_config'

能否请教一下?

最新版本下预料预处理出错

我按照您所在README.md里所说的方法在执行预料预处理的命令时File "tools/ner_data_preprocess.py", line 130出现了SyntaxError: name 'MAX_LEN_SIZE' is assigned to before global declaration的错误。

AttributeError,查不到解决办法,求指点

Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use rate instead of keep_prob. Rate should be set to rate = 1 - keep_prob.
Traceback (most recent call last):
File "/home/wll/transformer-word-segmenter/examples/init.py", line 320, in get_or_create
TFSegmenter.__singleton = TFSegmenter(**config)
File "/home/wll/transformer-word-segmenter/examples/init.py", line 118, in init
self.model, self.parallel_model = self.__build_model()
File "/home/wll/transformer-word-segmenter/examples/init.py", line 134, in __build_model
enc_output = self.__encoder(emb_output, mask)
File "/home/wll/transformer-word-segmenter/examples/init.py", line 177, in __encoder
next_step_input = transformer_enc_layer(next_step_input, padding_mask=mask)
File "/home/wll/transformer-word-segmenter/examples/transformer.py", line 193, in call
output = self.attention_layer(_input, padding_mask=padding_mask)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/base_layer.py", line 409, in call
with K.name_scope(self.name):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 6088, in enter
return self._name_scope.enter()
File "/usr/lib/python3.5/contextlib.py", line 59, in enter
return next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3994, in name_scope
raise ValueError("'%s' is not a valid scope name" % name)
ValueError: '{name}_self_attention' is not a valid scope name
Traceback (most recent call last):
File "train_example.py", line 63, in
save_config(segmenter, config_save_path)
File "/home/wll/transformer-word-segmenter/examples/init.py", line 333, in save_config
json.dump(obj.get_config(), file)
AttributeError: 'NoneType' object has no attribute 'get_config'

您好,训练发现loss为负数,crf_accuracy一直下降

Epoch 1/32
2000/2000 [==============================] - 2107s 1s/step - loss: 0.4634 - crf_accuracy: 0.8636 - val_loss: 0.1261 - val_crf_accuracy: 0.9155
Epoch 2/32
2000/2000 [==============================] - 2095s 1s/step - loss: 0.1106 - crf_accuracy: 0.9120 - val_loss: 0.1009 - val_crf_accuracy: 0.8968
Epoch 3/32
2000/2000 [==============================] - 2088s 1s/step - loss: 0.0789 - crf_accuracy: 0.8760 - val_loss: 0.0157 - val_crf_accuracy: 0.8600
Epoch 4/32
2000/2000 [==============================] - 2086s 1s/step - loss: -0.0747 - crf_accuracy: 0.8349 - val_loss: -0.2816 - val_crf_accuracy: 0.8311
Epoch 5/32
2000/2000 [==============================] - 2092s 1s/step - loss: -0.4189 - crf_accuracy: 0.8114 - val_loss: -0.7276 - val_crf_accuracy: 0.8185
Epoch 6/32
2000/2000 [==============================] - 2083s 1s/step - loss: -0.9522 - crf_accuracy: 0.7970 - val_loss: -1.4038 - val_crf_accuracy: 0.8003
Epoch 7/32
2000/2000 [==============================] - 2079s 1s/step - loss: -1.6091 - crf_accuracy: 0.7862 - val_loss: -2.0125 - val_crf_accuracy: 0.7894
Epoch 8/32
2000/2000 [==============================] - 2081s 1s/step - loss: -2.3388 - crf_accuracy: 0.7761 - val_loss: -2.7899 - val_crf_accuracy: 0.7861
Epoch 9/32
2000/2000 [==============================] - 1947s 974ms/step - loss: -3.0759 - crf_accuracy: 0.7687 - val_loss: -0.8432 - val_crf_accuracy: 0.7548
Epoch 10/32
2000/2000 [==============================] - 2082s 1s/step - loss: -4.0530 - crf_accuracy: 0.7636 - val_loss: -4.4467 - val_crf_accuracy: 0.7617
Epoch 11/32
2000/2000 [==============================] - 2010s 1s/step - loss: -5.0285 - crf_accuracy: 0.7501 - val_loss: -5.6542 - val_crf_accuracy: 0.7364
Epoch 12/32
2000/2000 [==============================] - 2082s 1s/step - loss: -6.1917 - crf_accuracy: 0.7440 - val_loss: -7.3522 - val_crf_accuracy: 0.7747
Epoch 13/32
2000/2000 [==============================] - 2091s 1s/step - loss: -7.3947 - crf_accuracy: 0.7500 - val_loss: -8.3709 - val_crf_accuracy: 0.7618
Epoch 14/32
2000/2000 [==============================] - 2088s 1s/step - loss: -8.7067 - crf_accuracy: 0.7481 - val_loss: -9.4891 - val_crf_accuracy: 0.7552
Epoch 15/32
2000/2000 [==============================] - 2081s 1s/step - loss: -10.0457 - crf_accuracy: 0.7477 - val_loss: -10.8923 - val_crf_accuracy: 0.7486
Epoch 16/32
2000/2000 [==============================] - 2075s 1s/step - loss: -11.6075 - crf_accuracy: 0.7496 - val_loss: -12.9277 - val_crf_accuracy: 0.7411
Epoch 17/32
2000/2000 [==============================] - 2084s 1s/step - loss: -13.4381 - crf_accuracy: 0.7502 - val_loss: -14.8373 - val_crf_accuracy: 0.7625
Epoch 18/32
2000/2000 [==============================] - 2088s 1s/step - loss: -14.9084 - crf_accuracy: 0.7379 - val_loss: -15.6858 - val_crf_accuracy: 0.7443
Epoch 19/32
2000/2000 [==============================] - 2091s 1s/step - loss: -16.6394 - crf_accuracy: 0.7373 - val_loss: -18.9170 - val_crf_accuracy: 0.7803
Epoch 20/32
2000/2000 [==============================] - 2095s 1s/step - loss: 8.7573 - crf_accuracy: 0.7414 - val_loss: -4.6712 - val_crf_accuracy: 0.7364
Epoch 21/32
2000/2000 [==============================] - 2053s 1s/step - loss: -0.2120 - crf_accuracy: 0.7432 - val_loss: -16.9051 - val_crf_accuracy: 0.7578
Epoch 22/32
2000/2000 [==============================] - 2091s 1s/step - loss: -17.4731 - crf_accuracy: 0.7500 - val_loss: -22.1556 - val_crf_accuracy: 0.7590
Epoch 23/32
2000/2000 [==============================] - 2090s 1s/step - loss: -22.3906 - crf_accuracy: 0.7471 - val_loss: -25.7236 - val_crf_accuracy: 0.7367
Epoch 24/32
2000/2000 [==============================] - 2087s 1s/step - loss: -24.3706 - crf_accuracy: 0.7457 - val_loss: -26.9764 - val_crf_accuracy: 0.7515
Epoch 25/32
2000/2000 [==============================] - 1965s 982ms/step - loss: -25.0672 - crf_accuracy: 0.7494 - val_loss: -23.7698 - val_crf_accuracy: 0.7411
Epoch 26/32
2000/2000 [==============================] - 2096s 1s/step - loss: -4.4255 - crf_accuracy: 0.7520 - val_loss: -20.3805 - val_crf_accuracy: 0.7790
Epoch 27/32
2000/2000 [==============================] - 2086s 1s/step - loss: -29.5299 - crf_accuracy: 0.7589 - val_loss: -34.7593 - val_crf_accuracy: 0.7548
Epoch 28/32
2000/2000 [==============================] - 2090s 1s/step - loss: -33.6366 - crf_accuracy: 0.7511 - val_loss: -35.7223 - val_crf_accuracy: 0.7674

这个说明过拟合了嘛,应该调整代码哪里呢,谢谢

生成字典出错

我在examples文件夹下新建了一个data文件夹,把生成的BIS文件放到里面。然后跑了python dict_test.py报错说文件找不到,我认为可能是由于路径有错所导致,我把dict_test.py里面../data/2014改写成data,这个错误就没有了但是又有ValueError: not enough value to unpack (expected 2, got 1),原因是dltokenizer/tools.py第13行chars, tags = line.split(sent_delimiter)等号左边要两个值而等号右边只生成了一个值。这是一个BUG吗?还是其他什么原因? BI-LSTM-CRF还在持续更新吗?谢谢!

在海外不可以下载档案

我不可以用百度网盘。能否请你以另一种方式把档案发给我?我可以用微信,电邮。谢谢。

为什么我的也会报错 内容

File "D:/FirstGrade/python/classification/nlp/fenci/train_example.py", line 80, in
X_train, Y_train, X_valid, Y_valid = DataLoader.load_data(h5_dataset_path, frac=0.9)
File "D:\FirstGrade\python\classification\nlp\fenci\dl_segmenter\data_loader.py", line 72, in load_data
X, Y = dfile['X'][:], dfile['Y'][:]
File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "E:\python3.6.4\lib\site-packages\h5py_hl\group.py", line 262, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'X' doesn't exist)"

版主,求助帖,运行train_example时出现问题。

Using TensorFlow backend.
ERROR:root:No weights found, create a new model.
G:\Software\anaconda\lib\site-packages\keras\callbacks\tensorboard_v2.py:92: UserWarning: The TensorBoard callback batch_size argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored.
warnings.warn('The TensorBoard callback batch_size argument '
Traceback (most recent call last):
File "C:/Users/Administrator/Desktop/kgcar/bi-lstm-crf-master/examples/train_example.py", line 73, in
X_train, Y_train, X_valid, Y_valid = DataLoader.load_data(h5_dataset_path, frac=0.8)
File "G:\Software\anaconda\lib\site-packages\dl_segmenter\data_loader.py", line 71, in load_data
with h5py.File(h5_file_path, 'r') as dfile:
File "C:\Users\Administrator\AppData\Roaming\Python\Python37\site-packages\h5py_hl\files.py", line 408, in init
swmr=swmr)
File "C:\Users\Administrator\AppData\Roaming\Python\Python37\site-packages\h5py_hl\files.py", line 173, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = '../data/2014_processed.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
报这个错误,请问版主知道是什么问题吗?

同一句话运行预测文件每一次结果都不一致

运行程序 tools/predict.py
输入文本:
我的家乡在东北,东北天气很冷常下雪

每一次的结果都不一致(下面只复制了其中两次的结果)
(1)[(['我的家', '乡', '在东北', ',', '东北天', '气', '很冷常', '下', '雪'], ['vx', 'k', 'vx', 'k', 'vx', 'k', 'vx', 'k', 'uguo'])]
(2)[(['我', '的', '家', '乡在', '东北,东北', '天气很冷', '常下雪'], ['nt', 'gp', 'rz', 'rys', 'pbei', 'ng', 'ude3'])]

精确度问题

我使用的相同的数据集,跑了1个epoch,为什么f1值只有57%,下载您的模型的f1值只有45%。是我测试集有问题吗?

数据集

作者您好,请问一下您2014年进行词性标注的数据集在哪里下载的?可以分享一下么

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.