Coder Social home page Coder Social logo

nlpcc-wordseg-weibo's Introduction

NLPCC2016-WordSeg-Weibo

NLPCC 2016 微博分词评测项目

##Description of the Task

Word is the fundamental unit in natural language understanding. However, Chinese sentences consists of the continuous Chinese characters without natural delimiters. Therefore, Chinese word segmentation has become the first mission of Chinese natural language processing, which identifies the sequence of words in a sentence and marks the boundaries between words.

Different with the popular used news dataset, we use more informal texts from Sina Weibo. The training and test data consist of micro-blogs from various topics, such as finance, sports, entertainment, and so on.

Each participant will be allowed to submit the three runs: closed track run, semi-open track run and open track run.

  1. In the closed track, participants could only use information found in the provided training data. Information such as externally obtained word counts, part of speech information, or name lists was excluded.
  2. In the semi-open track, participants could use the information extracted from the provided background data in addition to the provided training data. Information such as externally obtained word counts, part of speech information, or name lists was excluded.
  3. In the open track, participants could use the information which should be public and be easily obtained. But it is not allowed to obtain the result by the manual labeling or crowdsourcing way.

Data

The data are collected from Sina Weibo. Both the training and test files are UTF-8 encoded. Besides the training data, we also provide the background data, from which the training and test data are drawn. The purpose of providing the background data is to find the more sophisticated features by the unsupervised way.

Download

The dataset provides a standard training/dev/test split. Specifically, the researchers interested in the dataset should download and fill up this Agreement Form and send the scanned version back to Xipeng Qiu ([email protected]; Email title: Fudan Micro-blog Dataset data request).

本数据集提供标准的训练集/开发集/测试集分割。如果您在论文中使用了本数据集,请您给我们发一份 使用协议。请签名后扫描,将扫描的协议书发给我们 (邮件地址:[email protected]; 邮件主题: 复旦微博数据集申请)。

Evaluation Metric

Different with the standard precision, recall, F1-score, we will provide a new measure metric this year. The detailed information can be found in http://aclweb.org/anthology/P/P16/P16-1206.pdf .

Papers

  1. Peng Qian, Xipeng Qiu, Xuanjing Huang, A New Psychometric-inspired Evaluation Metric for Chinese Word Segmentation, In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2016. [PDF]
  2. Xipeng Qiu, Peng Qian, Zhan Shi, Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Segmentation for Micro-blog Texts, In Proceedings of The Fifth Conference on Natural Language Processing and Chinese Computing & The Twenty Fourth International Conference on Computer Processing of Oriental Languages, 2016.

Citation

如果你在论文中使用了本数据集,请引用下面文献。

@InProceedings{qiu2016overview,
  Title                    = {Overview of the {NLPCC-ICCPOL} 2016 Shared Task: Chinese Word Segmentation for Micro-blog Texts},
  Author                   = {Xipeng Qiu and Peng Qian and Zhan Shi},
  Booktitle                = {Proceedings of The Fifth Conference on Natural Language Processing and Chinese Computing \& The Twenty Fourth
International Conference on Computer Processing of Oriental Languages},
  Year                     = {2016}
}

Contact Information

For any questions about this shared task, please contact: Xipeng Qiu Group of NLP & DL School of Computer Science, Fudan University Email: [email protected]

nlpcc-wordseg-weibo's People

Contributors

xpqiu avatar gitsamshi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.