Coder Social home page Coder Social logo

egret-wenda-corpus's Introduction

重要提示

训练机器学习模型,评测算法和交流,可以使用另外一个质量更好的语料库了 - 机器学习保险行业问答开放数据集

chatoper banner

Egret Wenda Corpus

中文问答语料

QA Corpus, based on egret bbs.

在做机器学习的过程中,训练问答机器人的过程往往需要高质量的数据。针对英文,有很多庞大的预料库,针对中文,公开的资料很少。 在学习的过程中,我接触到了Ubuntu Dialogue Corpus,这也启发在技术社区挖掘出一些数据,制作语料。

目前这版语料,是从白鹭时代官方论坛问答板块10,000+ 问题中,选择被标注了“最佳答案”的纪录汇总而成。

  • 使用爬虫将目标数据存储到数据库
  • 从数据库生成raw data
  • 人工review raw data,给每一个问题,一个可以接受的答案。

目前,语料库包含2907个问答,虽然问题库很小,但针对一个垂直领域而言,也许足够了。

DESCRIPTION

In all files the field separator is " +++$+++ "

egret_wenda_lines.txt

- contains the actual text of each utterance
- fields:
	- lineID
	- person id (who uttered this phrase)
	- text of the utterance

egret_wenda_conversations.txt

- the structure of the conversations
- fields
	- conversationId
	- person id of the first character involved in the conversation
	- person id of the second character involved in the conversation
	- date of the post
	- source of this conversation in URL
	- list of the utterances that make the conversation, in chronological 
		order: ['Question lineID','Answer lineID']
		has to be matched with egret_wenda_lines.txt to reconstruct the actual content

What's more

Data in raw are raw data from BBS.

To make it more suitable for training, I have personally reviewed the raw data and modify some utterances, such as deleting codes in utterances.

processer.js

Generate raw data from data collection, the data collection is built with Egret问答专区.

Tips

NOTE: If you have results to report on these corpora, please send email to [email protected], so I can add you to list of people using this data.

Thanks!

egret-wenda-corpus's People

Contributors

hailiang-wang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.