Coder Social home page Coder Social logo

paraphrase-corpus's Introduction

Tokyo Metropolitan University Paraphrase Corpus (TMUP)

TMUP is an evaluation corpus for Japanese paraphrase identification. It consists of 655 sentence pairs in total.

  • 363 paraphrase sentence pairs
  • 292 non-paraphrase sentence pairs

Candidate Acquisition Method

To acquire both paraphrase and non-paraphrase instances, we

  • generated sentence pairs using Google PBMT and NMT to acquire paraphrases
  • extracted sentence pairs from Japanese Wikipedia to acquire non-paraphrases

To acquire both trivial and non-trivial instances, we

  • calculated word overlap rate (Jaccard score) of each sentence pair and uniformly sampled candidates

Annotation

Two annotators judged whether the candidates are paraphrases.

*For more details, please refer to the paper.

Data Format

label <TAB> sentence_A_ja <TAB> sentence_B_ja <TAB> source_sentence_en (if applicable)

Labels

  • 1: Paraphrase
  • 0: Non-paraphrase

Citing

If you make use of this corpus, please cite the following publication:

Yui Suzuki, Tomoyuki Kajiwara and Mamoru Komachi. Building a Non-Trivial Paraphrase Corpus using Multiple Machine Translation Systems. In Proceedings of ACL 2017 Student Research Workshop, Vancouver, Canada. July 2017 (to appear).

@inproceedings{,
    author      = {Suzuki, Yui and Kajiwara, Tomoyuki and Komachi, Mamoru},
    title       = {Building a Non-Trivial Paraphrase Corpus
                  using Multiple Machine Translation Systems},
    booktitle   = {Proceedings of ACL 2017 Student Research Workshop},
    month       = {July},
    year        = {2017},
    address     = {Vancouver, Canada},
    publisher   = {Association for Computational Linguistics},
    pages     = {(to appear)},
    url       = {http://www.aclweb.org/anthology/}
}

License

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Copyright (c) 2017 TMU-NLP

Contact

For inquiry and feedback please contact the authors below:

  • Yui Suzuki <suzuki-yui at ed.tmu.ac.jp>
  • Tomoyuki Kajiwara <kajiwara-tomoyuki at ed.tmu.ac.jp>
  • Mamoru Komachi <komachi at tmu.ac.jp>

paraphrase-corpus's People

Contributors

yui8e avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.