Coder Social home page Coder Social logo

voidf / parallel_corpus_mnbvc Goto Github PK

View Code? Open in Web Editor NEW

This project forked from liyongsea/parallel_corpus_mnbvc

0.0 0.0 0.0 541 KB

平行语料组的遗留代码留档备份,本地不再储存,这个库不用于pr,pr直接提交到源仓库的分支上

License: Apache License 2.0

Python 69.62% Jupyter Notebook 30.38%

parallel_corpus_mnbvc's Introduction

parallel_corpus_mnbvc

parallel corpus dataset from the mnbvc project

Install the requirements

pip install -r requirements.txt

输出的jsonl格式说明

对于每一个文件,他的json结构层次如下:

{
    '文件名': '文件.txt',
    '是否待查文件': False,
    '是否重复文件': False,
    '段落数': 0,
    '去重段落数': 0,
    '低质量段落数': 0,
    '段落': []
}

将每一行为一个段落,段落的json结构层次如下:

{
    '行号': line_number,
    '是否重复': False,
    '是否跨文件重复': False,
    'zh_text_md5': zh_text_md5,
    'zh_text': 中文,
    'en_text': 英语,
    'ar_text': 阿拉伯语,
    'nl_text': 荷兰语,
    'de_text': 德语,
    'eo_text': 世界语,
    'fr_text': 法语,
    'he_text': 希伯来文,
    'it_text': 意大利语,
    'ja_text': 日语,
    'pt_text': 葡萄牙语,
    'ru_text': 俄语,
    'es_text': 西班牙语,
    'sv_text': 瑞典语,
    'ko_text': 韩语,
    'th_text': 泰语,
    'other1_text': 小语种1,
    'other2_text': 小语种2,
}

parallel_corpus_mnbvc's People

Contributors

genggui001 avatar hayesyang avatar liyongsea avatar voidf avatar wzixiao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.