Coder Social home page Coder Social logo

jackhcc / chinese-tokenization Goto Github PK

View Code? Open in Web Editor NEW
32.0 1.0 4.0 46.45 MB

利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】

Python 90.73% Perl 9.27%
bert-crf bilstm-crf hmm-viterbi-algorithm ngram nlp tokenization

chinese-tokenization's Introduction

自然语言处理中文分词

方法概述

  • 传统算法:使用N-gram,HMM,最大熵,CRF等实现中文分词
  • 神经⽹络⽅法:CNN、Bi-LSTM、Transformer等
  • 预训练语⾔模型⽅法:Bert等

数据集概述

  • PKU 与 MSR 是 SIGHAN 于 2005 年组织的中⽂分词⽐赛 所⽤的数据集,也是学术界测试分词⼯具的标准数据集。

实验过程

实验结果

PKU数据集

模型 准确率 召回率 F1分数
Uni-Gram 0.8550 0.9342 0.8928
Uni-Gram+规则 0.9111 0.9496 0.9300
HMM 0.7936 0.8090 0.8012
CRF 0.9409 0.9396 0.9400
Bi-LSTM 0.9248 0.9236 0.9240
Bi-LSTM+CRF 0.9366 0.9354 0.9358
BERT 0.9712 0.9635 0.9673
BERT-CRF 0.9705 0.9619 0.9662
jieba 0.8559 0.7896 0.8214
pkuseg 0.9512 0.9224 0.9366
THULAC 0.9287 0.9295 0.9291

MSR数据集

模型 准确率 召回率 F1分数
Uni-Gram 0.9119 0.9633 0.9369
Uni-Gram+规则 0.9129 0.9634 0.9375
HMM 0.7786 0.8189 0.7983
CRF 0.9675 0.9676 0.9675
Bi-LSTM 0.9624 0.9625 0.9624
Bi-LSTM+CRF 0.9631 0.9632 0.9632
BERT 0.9841 0.9817 0.9829
BERT-CRF 0.9805 0.9787 0.9796
jieba 0.8204 0.8145 0.8174
pkuseg 0.8701 0.8894 0.8796
THULAC 0.8428 0.8880 0.8648

chinese-tokenization's People

Contributors

jackhcc avatar jaclab-beta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.