Coder Social home page Coder Social logo

lyhiving / wordsegment Goto Github PK

View Code? Open in Web Editor NEW

This project forked from liuhuanyong/wordsegment

0.0 1.0 0.0 14.84 MB

Chinese WordSegment based on algorithms including Maxmatch (forward, backward, bidirectional), HMM,N-gramm(max prob ngram, biward ngam) etc...中文分词算法的实现,包括最大向前匹配、最大向后匹配,最大双向匹配,ngram,HMM,及其性能对比

Python 100.00%

wordsegment's Introduction

WordSegment

Chinese WordSegment based on algorithms including Maxmatch (forward, backward, bidirectional), HMM etc...

项目介绍

1、MaxMatch:
dict.txt: 分词用词典位置
max_forward_cut:正向最大匹配分词
max_backward_cut:逆向最大匹配分词
max_biward_cut:双向最大匹配分词
result:
输入:我们在野生动物园玩
输出:
forward_cutlist: ['我们', '在野', '生动', '物', '园', '玩']
backward_cutlist: ['我们', '在', '野生', '动物园', '玩']
biward_seglit: ['我们', '在', '野生', '动物园', '玩']

2、HMM:
hmm_train.py:基于人民日报语料29W句子,训练初始状态概率,发射概率,转移概率
data:训练语料,放在 ./data/train.txt
model: 保存训练的概率模型,训练完成后可直接调用
trans_path = './model/prob_trans.model'
emit_path = './model/prob_emit.model'
start_path = './model/prob_start.model'

hmm_cut.py:基于训练得到的model,结合viterbi算法进行分词
输入:我们在野生动物园玩
输出:['我们', '在', '野', '生动', '物园', '玩']

3、N-gram
train_ngram.py:基于人民日报语料29W句子,训练词语出现概率,2-gram条件概率
data: 训练语料,放在 ./data/train.txt
model: 保存概率模型,训练完成后可直接调用
word_path = './model/word_dict.model' (词语出现概率)
trans_path = './model/trans_dict.model'(2-gram条件概率)
max_ngram.py: 最大化概率2-gram分词算法
biward_ngram.py: 基于ngram的前向后向最大匹配算法

4、算法比较
1、评测语料:微软评测语料,共3985个句子
2、性能比较

Algorithm Precision Recall F1-score Cost-Time
HMM 0.65 0.75 0.70 4.87
MaxForward 0.76 0.87 0.81 244.14
MaxBackward 0.76 0.87 0.81 280.61
MaxBiWard 0.76 0.87 0.81 443.23
MaxProbNgram 0.76 0.87 0.81 8.99
MaxBiwardNgram 0.74 0.86 0.80 3.96

wordsegment's People

Contributors

liuhuanyong avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.