Coder Social home page Coder Social logo

templeblock / pycorrector Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shibing624/pycorrector

0.0 2.0 0.0 32.24 MB

pycorrector is a toolkit for text error correction. It was developed to facilitate the designing, comparing, and sharing of deep text error correction models.

Home Page: https://shibing624.github.io/pycorrector/

License: Apache License 2.0

Python 100.00%

pycorrector's Introduction

pycorrector

License Apache 2.0

中文错别字纠正工具。音似、形似错字(或变体字)纠正,可用于中文拼音、笔画输入法的错误纠正。python3开发。

pycorrector依据语言模型检测错别字位置,通过拼音音似特征、笔画五笔编辑距离特征及语言模型困惑度特征纠正错别字。

特征

语言模型

  • Kenlm(统计语言模型工具)
  • RNNLM(TensorFlow、PaddlePaddle均有实现栈式双向LSTM的语言模型)

错误检测

  • 字粒度:语言模型困惑度(ppl)检测某字的似然概率值低于句子文本平均值,则判定该字是疑似错别字的概率大。
  • 词粒度:切词后不在词典中的词是疑似错词的概率大。

错误纠正

  1. 通过错误检测定位所有疑似错误后,取所有疑似错字的音似、形似候选词,
  2. 使用候选词替换,基于语言模型得到类似翻译模型的候选排序结果,得到最优纠正词。

思考

  1. 现在的处理手段,在词粒度的错误召回还不错,但错误纠正的准确率还有待提高,更多优质的纠错集及纠错词库会有提升,我更希望算法上有更大的突破。
  2. 另外,现在的文本错误不再局限于字词粒度上的拼写错误,需要提高中文语法错误检测(CGED, Chinese Grammar Error Diagnosis)及纠正能力,列在TODO中,后续调研。

demo

http://www.borntowin.cn/nlp/corrector.html

使用说明

依赖

pip3 install -r requirements.txt

pip3 install git+https://www.github.com/keras-team/keras-contrib.git

安装

  • 全自动安装:pip3 install pycorrector
  • 半自动安装:
git clone https://github.com/shibing624/pycorrector.git
cd pycorrector
python3 setup.py install

纠错

使用示例:

import pycorrector

corrected_sent, detail = pycorrector.correct('少先队员因该为老人让坐')
print(corrected_sent, detail)

输出:

少先队员应该为老人让座 [[('因该', '应该', 4, 6)], [('坐', '座', 10, 11)]]

自定义语言模型

语言模型对于纠错步骤至关重要,目前我能收集到的语料数据有人民日报数据。大家可以用中文维基(繁体转简体,pycorrector.utils下有此功能)等更大的语料数据训练效果更好的语言模型, 对于纠错效果会有比较好的提升。

  1. kenlm语言模型训练工具的使用,请见博客:http://blog.csdn.net/mingzai624/article/details/79560063
  2. 附上训练语料<人民日报2014版熟语料>,包括: 1)标准人工切词及词性数据people2014.tar.gz, 2)未切词文本数据people2014_words.txt, 3)kenlm训练字粒度语言模型文件及其二进制文件people2014corpus_chars.arps/klm, 4)kenlm词粒度语言模型文件及其二进制文件people2014corpus_words.arps/klm。

网盘链接:https://pan.baidu.com/s/1971a5XLQsIpL0zL0zxuK2A 密码:uc11。尊重版权,传播请注明出处。

贡献及优化点(TODO)

  • 使用RNN语言模型来提高纠错准确率。
  • 优化形似字字典,提高形似字纠错准确率。
  • 整理中文纠错集,使用seq2seq做深度中文纠错模型。
  • 添加中文语法错误检测及纠正能力。

参考

  1. 基于文法模型的中文纠错系统
  2. Norvig’s spelling corrector
  3. 《Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape》[Yu, 2013]
  4. 《Chinese Spelling Checker Based on Statistical Machine Translation》[Chiu, 2013]
  5. 《Chinese Word Spelling Correction Based on Rule Induction》[yeh, 2014]
  6. 《Neural Language Correction with Character-Based Attention》[Ziang Xie, 2016]

pycorrector

Chinese text error correction tool.

pycorrector Use the language model to detect errors, pinyin feature and shape feature to correct chinese text error, it can be used for Chinese Pinyin and stroke input method.

Features

language model

  • Kenlm
  • RNNLM

Usage

install

correct

input:

import pycorrector

corrected_sent, detail = pycorrector.correct('少先队员因该为老人让坐')
print(corrected_sent, detail)

output:

少先队员应该为老人让座 [[('因该', '应该', 4, 6)], [('坐', '座', 10, 11)]]

Future work

  1. P(c), the language model. We could create a better language model by collecting more data, and perhaps by using a little English morphology (such as adding "ility" or "able" to the end of a word).

  2. P(w|c), the error model. So far, the error model has been trivial: the smaller the edit distance, the smaller the error. Clearly we could use a better model of the cost of edits. get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters.

  3. It turns out that in many cases it is difficult to make a decision based only on a single word. This is most obvious when there is a word that appears in the dictionary, but the test set says it should be corrected to another word anyway: correction('where') => 'where' (123); expected 'were' (452) We can't possibly know that correction('where') should be 'were' in at least one case, but should remain 'where' in other cases. But if the query had been correction('They where going') then it seems likely that "where" should be corrected to "were".

  4. Finally, we could improve the implementation by making it much faster, without changing the results. We could re-implement in a compiled language rather than an interpreted one. We could cache the results of computations so that we don't have to repeat them multiple times. One word of advice: before attempting any speed optimizations, profile carefully to see where the time is actually going.

Further Reading

Reference

  1. Norvig’s spelling corrector
  2. Norvig’s spelling corrector(java version)

pycorrector's People

Contributors

shibing624 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.