Coder Social home page Coder Social logo

geeker9 / chinese-poetry Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yonggie/chinese-poetry

0.0 0.0 0.0 242.24 MB

A dataset under construction of Chinese Poetry 中文古文诗词数据集,,近1,4000诗人, 107,891唐诗,275,581宋词。

License: Apache License 2.0

Python 100.00%

chinese-poetry's Introduction

chinese-poetry

Chinse-Poetry: 中文诗歌古典文集数据集

本项目Fork于:https://github.com/chinese-poetry/chinese-poetry

本项目主要贡献点:

  • 变更全简体;
  • 处理类似「」等杂字符;
  • 整理统一目录,修改汉字,方便理解和程序统一处理;
  • 统一键值对,方便构建机器学习数据集

数据集包含 107891首唐诗、275581首宋诗和其他古典文集。诗人包括唐宋两朝近 1.4 万古诗人,和两宋时期 1.5 千古词人。

我已经做成全简体文本,并且简化了数据和说明。

适合构建古文诗词数据集和机器学习

使用

带上depth=1会快很多,不然大部分是git的commit pack文件占用时间,慢。

git clone xxx --depth=1

数据集格式

统一的json键值对,一般是在title,content两个键里面,部分带authorsection等其他信息。

其中在元曲的文本中,中文的“”符号被统一替换成了符号。

规模

目前粗略估计(未去重) 数据集总规模:396242个(诗/词/文) 唐诗:107891个 宋词:275581个 千家诗:226个

赞助

本项目目的构建方便于机器学习使用的中文诗歌数据集,基于他人项目,站在巨人肩膀上进行工作。欢迎更多人来维护,你可以通过以下方法来参与贡献:

  • 直接提交 PR 或者通过 issue 讨论。

  • 也可以通过「支付宝」或者「微信赞赏码」进行一次性赞助(备注留下邮箱)。

chinese-poetry chinese-poetry

使用此诗歌数据的机器学习案例

数据来自:https://github.com/chinese-poetry/chinese-poetry

License

Apache

chinese-poetry's People

Contributors

jackeygao avatar fleetingwang avatar hongzhiw avatar o70 avatar akakaras avatar liuxsdev avatar snowtraces avatar rainrambler avatar zhongwencm avatar yonggie avatar kc910521 avatar chienmy avatar xinglie avatar oldpotter avatar zgjie avatar china-longyin avatar breakstring avatar bluesword12350 avatar chinainfant avatar rustingsword avatar zhangtemplar avatar gt-zhangacer avatar yuxiang-gao avatar zlvalien avatar wptoux avatar ayayagit avatar bit-fan avatar jigsawk avatar cfeibiao avatar oud5 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.