Coder Social home page Coder Social logo

lstm-for-chinese-punctuation-restoration's Introduction

LSTM-for-Chinese-Punctuation-Restoration

基于Pytorch 1.0 实现的中文断句与标点符号恢复。

1.Data Preparation

LSTM的输入为200维的,中文维基百科语料库训练出的Word2Vec词向量。

wiki processor.ipynb

训练集为一百万个词组成的维基百科中提取的句子,提取的规则为:

  1. 不存在阿拉伯数字与英文字母(懒得写正则替换)
  2. 该段落内至少有一个感叹号或者问号

将筛选出的段落随机拼接之后,平均分为一万个词向量序列,并保存相应的标点符号标注,每个单词的标注为六维one-hot编码的数组。

  • 0:'',
  • 1:',',
  • 2:'。',
  • 3:'!',
  • 4:'?',
  • 5:'、'

2.Training

cyberpunc_trainer.ipynb & train_tlstm.py

不提供整理好的训练数据和标签。 注意修改词向量(.model)与模型权重(.pkl)的位置!

3.Using without Training

cyberpunc.py & cyberpunc_notebook.ipynb

直接运行即可,已开源LSTM state_dict,注意修改词向量(.model)与模型权重(.pkl)的位置!

(词向量文件155.9M,上传不了,有人可以提供解决方案吗)

4.Examples

我想做我的毕业设计,还有一堆论文要写啊,我要死了!

你们这个是什么?群啊,你们这是害人不浅啊?你们这个群麻烦你们真的太过分了。你们搞这个群干什么?我儿子每一科的成绩,都不过那个平均分啊!他现在初二你叫我儿子怎么办?啊,他还不到高中啊,好不好,你们这是什么群啊,你们害死我,儿子了,谁是群主快点出来你们群主再不出来,我去报警了?啊?我跟你们说你们这一帮人啊,一天到晚啊,搞这些什么游戏啊,动漫啊,会害死你们的啊?你们没有前途我跟你们说你们这四百多个人好好学习不好吗?一天到晚在这上网有什么意思,有什么意思?啊,麻烦你们重视一下你们生活的目标好不好?有一点学习目标,好不好?一天到晚上网是不是人?啊,你们一天到晚上网

没了,心,如何相配?盘铃声清脆帷幕间灯火葳蕤,我和你最天生一对

日日重复同样的事情,遵循着与昨日相同的惯例若能避开猛烈的狂喜,自然也不会有悲痛来袭,胆小鬼连幸福都会害怕摸棉花都会刺伤

我去年进的看守所,看守所,里面的人个个都是人才,说话有好听,诶哟!超喜欢在里面的

TODO

  • 数据并行训练
  • 将模型升级为BiLSTM-CRF

lstm-for-chinese-punctuation-restoration's People

Contributors

alvinisonomia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

lstm-for-chinese-punctuation-restoration's Issues

請問能提供詞向量的載點嗎

謝謝你提供的repo,但我用了自己的詞向量,發現效果不好,因此想嘗試一下作者的詞向量,不知道是否方便提供呢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.