Coder Social home page Coder Social logo

ml2457 / multi-criteria-cws Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hankcs/multi-criteria-cws

0.0 1.0 0.0 51 MB

Simple Solution for Multi-Criteria Chinese Word Segmentation

Home Page: http://www.hankcs.com/nlp/segment/multi-criteria-cws.html

License: GNU General Public License v3.0

Python 89.91% Shell 0.63% Perl 9.46%

multi-criteria-cws's Introduction

multi-criteria-cws

Codes and corpora for paper "Effective Neural Solution for Multi-Criteria Word Segmentation" (accepted & forthcoming at SCI-2018).

Dependency

Quick Start

Run following command to prepare corpora, split them into train/dev/test sets etc.:

python3 convert_corpus.py 

Then convert a corpus $dataset into pickle file:

./script/make.sh $dataset
  • $dataset can be one of the following corpora: pku, msr, as, cityu, sxu, ctb, zx, cnc, udc and wtb.
  • $dataset can also be a joint corpus like joint-sighan2005 or joint-10in1.
  • If you have access to sighan2008 corpora, you can also make joint-sighan2008 as your $dataset.

Finally, one command performs both training and test on the fly:

./script/train.sh $dataset

Performance

sighan2005

sighan2005

sighan2008

sighan2008

10-in-1

Since SIGHAN bakeoff 2008 datasets are proprietary and difficult to obtain, we decide to conduct additional experiments on more freely available datasets, for the public to test and verify the efficiency of our method. We applied our solution on 6 additional freely available datasets together with the 4 sighan2005 datasets.

10in1

Corpora

In this section, we will briefly introduce those corpora used in this paper.

10 corpora in this repo

Those 10 corpora are either from official sighan2005 website, or collected from open-source project, or from researchers' homepage. Licenses are listed in following table.

licence

sighan2008

As sighan2008 corpora are proprietary, we are unable to distribute them. If you have a legal copy, you can replicate our scores following these instructions.

Firstly, link the sighan2008 to data folder in this project.

ln -s /path/to/your/sighan2008/data data/sighan2008

Then, use HanLP for Traditional Chinese to Simplified Chinese conversion, as shown in the following Java code snippets:

        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(
            "data/sighan2008/ckip_seg_truth&resource/ckip_truth_utf16.seg"
        ), "UTF-16"));
        String line;
        BufferedWriter bw = IOUtil.newBufferedWriter(
            "data/sighan2008/ckip_seg_truth&resource/ckip_truth_utf8.seg");
        while ((line = br.readLine()) != null)
        {
            for (String word : line.split("\\s"))
            {
                if (word.length() == 0) continue;
                bw.write(HanLP.convertToSimplifiedChinese(word));
                bw.write(" ");
            }
            bw.newLine();
        }
        br.close();
        bw.close();

You need to repeat this for the following 4 files:

  1. ckip_train_utf16.seg
  2. ckip_truth_utf16.seg
  3. cityu_train_utf16.seg
  4. cityu_truth_utf16.seg

Then, uncomment following codes in convert_corpus.py:

    # For researchers who have access to sighan2008 corpus, use official corpora please.
    print('Converting sighan2008 Simplified Chinese corpus')
    datasets = 'ctb', 'ckip', 'cityu', 'ncc', 'sxu'
    convert_all_sighan2008(datasets)
    print('Combining those 8 sighan corpora to one joint corpus')
    datasets = 'pku', 'msr', 'as', 'ctb', 'ckip', 'cityu', 'ncc', 'sxu'
    make_joint_corpus(datasets, 'joint-sighan2008')
    make_bmes('joint-sighan2008')

Finally, you are ready to go:

python3 convert_corpus.py
./script/make.sh joint-sighan2008
./script/train.sh joint-sighan2008

Acknowledgments

  • Thanks for those friends who helped us with the experiments.
  • Credits should also be given to those generous researchers who shared their corpora with the public, as listed in license table. Your datasets indeed helped those small groups (like us) without any funding.
  • Model implementation modified from a Dynet-1.x version by rguthrie3.

multi-criteria-cws's People

Contributors

hankcs avatar

Watchers

jiahong.qiu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.