Coder Social home page Coder Social logo

happylee1991 / nlp-tools Goto Github PK

View Code? Open in Web Editor NEW

This project forked from haif-liu/nlp-tools

0.0 1.0 0.0 1.46 MB

😋本项目旨在通过Tensorflow基于BiLSTM+CRF实现中文分词、词性标注、命名实体识别(NER)。

Jupyter Notebook 49.54% Python 50.46%

nlp-tools's Introduction

NLP-tools

本项目旨在通过Tensorflow基于BiLSTM+CRF实现字符级序列标注模型。

功能:

1、对未登录字(词)识别能力

2、Http接口

3、可快速实现分词、词性标注、NER、SRL等序列标注模型

欢迎各位大佬吐槽。

说明

环境配置:创建新的conda环境

 $ conda env create -f environment.yaml

语料处理

不同标注语料格式不同,需额外处理,在example/DataPreprocessing.ipynb中提供了人民日报2014预处理过程(该语料集未上传至github,只有部分样例于corpus,可通过互联网找到。若找不到可email me),语料格式:人民网/nz 1月4日/t 讯/ng 据/p [法国/nsf 国际/n。

生成word2id字典和训练数据于data/xx.pkl中。

模型训练

 $ python train.py 
 [-h] [--dict_path DICT_PATH] [--train_data TRAIN_DATA]
      [--ckpt_path CKPT_PATH] [--embed_size EMBED_SIZE]
      [--hidden_size HIDDEN_SIZE] [--batch_size BATCH_SIZE] 
      [--epoch EPOCH] [--lr LR]
      [--save_path SAVE_PATH]

训练生成checkpoint存入SAVE_PATH, CKPT_PATH用于模型做finetune。

模型默认超参数

  • 嵌入层向量长度:256

  • BiLstm层数:2

  • 隐藏层节点数:512

  • Batch宽度:128

  • 初始学习率:1e-3 (不同任务需做finetune)

模型测试

模型测试示例位于Modeltest.ipynb中。

HTTP接口

一个简单的web server

 $ python app.py

执行python,默认本机测试代码:(linux和windows下格式不同)

 $ curl -i -H "Content-Type: application/json" -X POST -d '{"text":"\u5f20\u51cc\u745e\u3002"}' http://localhost:7777/cws

现状

在人民日报上的分词能达到正确率97%,词性标注能达到正确率96%。

通过对该模型在上亿条句子上的训练结果测试,将CWS、POS、NER标签做成end2end的融合标签,综合正确率能达到96%,且对未登录字(词)识别能力佳,拥有对语义的捕获能力。

(在Modeltest.ipynb中列举了一些例子)

最近一直在看Google神奇BERT,后续会添加BERT的序列标注训练模块进来,让模型在不同领域进行迁移。

参考

本项目模型BiLSTM+CRF参考论文:http://www.aclweb.org/anthology/N16-1030

nlp-tools's People

Contributors

ericlingrui avatar

Watchers

happylee avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.