Coder Social home page Coder Social logo

meghdadfar / mwes_m1 Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 10.86 MB

Java implementation of an MWE identification method (only two word noun compound category) that is based on the non-substitutability of MWEs.

License: GNU General Public License v3.0

Makefile 29.94% Java 70.06%

mwes_m1's Introduction

mwes_m1

Java implementation of an MWE identification method (only two word noun compound category) that is based on the non-substitutability of MWEs.

Introduction:

The unige.cui.meghdad.nlp.mwe1 package implements a model of extracting two-word multiword expressions (MWEs) or collocations based on non-substitutability criterion. Non-substitutability means that the components of a MWE can not be replaced with their near synonyms. For instance "swimming pool" can not be rephrased as "swimming pond" although the latter is a semantically and syntactically plausible alternative. Efficient extraction of MWEs can improve the performance of several other NLP tasks such as IE, parsing, topic models and sentiment analysis.

For more information about non-substitutability see:

  • Manning, Chris, and Hinrich Schütze. "Collocations." Foundations of statistical natural language processing (1999): 141-77.;

  • Pearce, Darren. "Synonymy in collocation extraction." Proceedings of the workshop on WordNet and other lexical resources, second meeting of the NAACL. 2001).

unige.cui.meghdad.nlp.mwe1 with some modifications implements the model presented at: Farahmand, Meghdad, and Joakim Nivre. "Modeling the Statistical Idiosyncrasy of Multiword Expressions." Proceedings of NAACL-HLT. 2015. (ONLY BIDIRECTIONAL MODEL IS AVAILABLE IN THIS RELEASE).

Note

Since MWEs are better defined on a spectrum of idiosyncrasy and not as a binary phenomena, the program generates a ranked list of MWEs. The compounds at the top of this list are those that are least non-substitutable and consequently more idiosyncratic or lexically rigid. The compounds at the bottom of the list on the other hand are more substitutable and hence less idiosyncratic.

Command Line Quick Start

The program can be used in two ways.

1. To generate a ranked list of MWEs that are directly extracted from corpus.

Here, path to the POS tagged corpus must be provided through "-p2corpus" option. Other flags that are optional include:

-maxRank Indicates the top n ranked MWEs that will be returned. Defaul=200.

-rc Ranking criterion: delta_12, delta_21, or combined. Default = delta_21. (for more information about the criteria see the article).

Example:

java -cp dist/cui-mf-nlp-mwe-m1.jar unige.cui.meghdad.nlp.mwe1.Collocational_Bidirect_Prob_Corpus -p2corpus "PATH_2_POSTAGGED_CORPUS"

2. To rank a list of MWE candidates that are provided in an input file.

Here, path to the list of POS tagged two-word candidates (through -p2POSTaggedCandidates), path to a list of all bigrams (through -p2bigrams) and all unigrams (through -p2unigrams) extracted from the corpus must be provided. Other flags that are optional include:

-rc Ranking criteria: delta_12, delta_21, or combined. Default = delta_21.

Example:

java -cp dist/cui-mf-nlp-mwe-m1.jar unige.cui.meghdad.nlp.mwe1.Collocational_Bidirect_Prob_File -p2POSTaggedCandidates "PATH_2_POSTAGGED_CANDIDATES" -p2bigrams "PATH_2_BIGRAMS" -p2unigrams "PATH_2_UNIGRAMS"

Contact:

To report bugs and other issues and if you have any question please contact: [email protected]

mwes_m1's People

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.