Coder Social home page Coder Social logo

guilhermevarela / srlbr Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 66.32 MB

Propbank Br 1.1 using classical methods

Python 44.02% Perl 55.04% Shell 0.94%
machine-learning srl natural-language-processing propbankbr semantic-role-labeling feature-engineering support-vector-machine

srlbr's Introduction

SEMANTIC ROLE LABELING BR

Semantic Role Labeling is a task within the domain of natural language processing, that attempts to answer who, did what, to whom given a sentence. This repository is a partial implementation of a masters dissertation, (BELTRAO, 2016) that uses Propbank Br v1.1 as the golden set, Liblinear implemention for Support Vector Machines for multiclass classfication, Official Conll 2005 Shared Task scrips for evaluation and three sets of feature enginnering.

A Golden Set Example:

ID FORM LEMMA GPOS MORPH DTREE FUNC CTREE PRED ARG
1 A o art F|S 2 >N (FCL(NP* - *
2 série série n F|S 8 SUBJ * - (A1*)
3 exibida exibir v-pcp F|S 2 N< (ICL(VP*) - *
4 aqui aqui adv - 3 ADVL (ADVP*) - *
5 por por prp - 3 PASS (PP* - *
6 a o art F|S 7 >N (NP* - *
7 Cultura cultura n F|S 5 P< *))) - *
8 estreiou estrear v-fin PS|3S|IND 0 STA (VP*) estreiar *
9 em em prp - 8 ADVL (PP* - (AM-LOC*)
10 a o art F|S 11 >N (NP* - *
11 TVI TVI prop F|S 9 P< * - *
12 de de prp _ 11 N< (PP* - *
13 Portugal Portugal prop M|S 12 P< (NP*)))) _ *
14 . . pu - 8 PUC *) - *

The Golden Set Features:

Num Name Description
1 ID Counter of tokens that begins at 1 for each new proposition.
2 FORM Tokenized word or punctuation sign.
3 LEMMA Lemma gold-standard for FORM.
4 GPOS Post tagging gold-standard for FORM.
5 MORPH Morphological features gold-standard.
6 DTREE Dependency tree gold-standard.
7 FUNC Syntactic function of the token for your regent in dependency tree.
8 CTREE Syntactic tree gold-standard.
9 PRED Semantic Predicates on preposition.
10 ARG Semantic role label for the regent of the argument on DTREE according to PropBank annotations.

Sematic Role Labels

Tag Description
A0 Usually the agent (actor).
A1 Usually the patient or theme (receiver).
A2...A5 Verb oriented tags.
AM-ADV Modifier adverb.
AM-CAU Cause.
AM-DIR Direction.
AM-DIS Discursive.
AM-EXT Extension.
AM-MED Non documented tag.
AM-LOC Location.
AM-MNR Manner.
AM-NEG Negation.
AM-PNC Purpose.
AM-PRD Secondary predication.
AM-REC Reciprocal.
AM-TMP Temporal.

FEATURE ENGINEERING (BELTRAO, 2016)

There are 5 groups of features described on the dissertation, 4 of which are implemented as follows:

Golden Standard Features

All golden standard features except ARG.

Attribute Description
FORM FORM Tokenized word or punctuation sign.
LEMMA Lemma gold-standard for FORM.
GPOS Post tagging gold-standard for FORM.
MORPH Morphological features gold-standard.
FUNC Syntactic function of the token for your regent in dependency tree.

Window Features

Lead and lag features around the token.

Attribute Description
LeftForm 1,2, and 3 FORM from the 3 tokens to the left.
RightForm 1,2 and 3 FORM from the 3 tokens to the right.
LeftFunc 1,2 and 3 FUNC from the 3 tokens to the left.
RightFunc 1,2 and 3 FUNC from the 3 tokens to the right.
LeftLemma 1,2 and 3 LEMMA from the 3 tokens to the left.
RightLemma 1,2, and 3 LEMMA from the 3 tokens to the right.
LeftGPOS 1,2 and 3 GPOS from the 3 tokens to the left.
RightGPOS 1,2 and 3 GPOS from the 3 tokens to the right.

Context Features

Lead and lag features around predicate.

Attribute Description
PredLemma LEMMA of target verb
PredLeftLemma LEMMA of token to the left of target verb.
PredRightLemma LEMMA of token to the right of the target verb.
PredGPOS GPOS of target verb.
PredLeftGPOS GPOS of token to the left of target verb.
PredRightGPOS GPOS of token to the right of target verb.
PredFunc FUNC of target verb.
PredLeftFunc FUNC of the token to the left of the verb.
PredRightFunc FUNC of the token to the right of the verb.
PredicateDistance ID of the target verb minus ID of current token.
PredMorph 1..n Set of 32 MORPH for target verb.
PassiveVoice Passive voice indicator. True if verb has GPOS=v-pcp and is proceeded for token with LEMMA=ser having or not _token_with GPOS=adv between them.
PosRelVerb If token is before or after verb.

Dependency Tree Features

Token depentencies and path to predicate.

Atribute Description
DepLemmaParent LEMMA from the father of the token.
DepLemmaGrandparent LEMMA from the grandfather of the token.
DepLemmaChild 1, 2 and 3 LEMMA from the 3 first children of the token.
DepGPOSParent GPOS from the father of the token.
DepGPOSGrandParent GPOS from the grandfather of the token.
DepGPOSChild 1, 2 and 3 GPOS from the 3 first children of the token.
DepFuncParent FUNC from the father of the token.
DepFuncGrandParent FUNC from the grandfather of the token.
DepFuncChild 1, 2 and 3 FUNC from the 3 first children of the token.
DepPathFunc Path of FUNC tags between token and target verb passing through minor common ancestor.
DepPathGPOS Path of GPOS tags between token and target verb passing through minor common ancestor.

Project

The following defines the current project structure tree:

- srlbr/
    - datasets_1.1/
        - conll/
        - csvs/
        - props/
    - models/
        - lib/
        - __init__.py
        - evaluator.py
        - feature_factory.py
        - svm.py 
        - utils.py        
    - srlconll-1.1/
    - requirements.txt
    - README.md
    - srl.py        
    - .gitignore

datasets_1.1/conll/

Formatted Train, test and validation gold-standard in a format suitable for machine learning models. Originals can be found here.

datasets_1.1/csvs/

Stored feature columns.

datasets_1.1/props/

gold-standard propositions which are evaluated by conll script.

models/lib/

Liblinear executables and python liblinear.py and liblinearutil.py

models/__init__.py

Exports svm_srl function

models/evaluator.py

Thin wrapper for the conll script uses subprocess and runs the official script.

models/feature_factory.py

Here the features are engineered

models/svm.py

Wrapper of liblinear calls on liblinear lib and main function svm_srl.

models/utils.py

Handles storing.

models/srlconll-1.1/

Official Conll 2005 Shared Task Pearl script

requirements.txt

Python project requirements.

srl.py

Command line callable script to run the script

SETUP

Python

    > pip install -r requirements.txt

Pearl

Conll 2005 shared task uses Pearl 5. Chances are that is already installed.

    > perl -v

Set the PERL5LIB environment variable.

    > setenv PERL5LIB $HOME/path/to/srlbr/srlconll-1.1/lib:$PERL5LIB

Usage:

    > perl $HOME/path/to/srlbr/srlconll-1.1/bin/srl-eval.pl <gold_props> <predicted_props>

The full installation installation instructions can be found srlconll-1.1/

Liblinear

Download liblinear for python. Navegate to the folder.

    > make

Place executables *.so. files at models/lib and liblinear.py and liblinearutils.py.

USAGE

    > python srl.py -h

Summons help

    > python srl.py

Runs multi-class classification using the following solvers:

Solver Description
0 L2-regularized logistic regression (primal)
1 L2-regularized L2-loss support vector classification (dual)
2 L2-regularized L2-loss support vector classification (primal)
3 L2-regularized L1-loss support vector classification (dual)
4 support vector classification by Crammer and Singer
5 L1-regularized L2-loss support vector classification
6 L1-regularized logistic regression
7 L2-regularized logistic regression (dual),for regression

Using only gold-standard and windows set of features.

    > python srl.py 1 -context -dtree -windows

Runs L2-regularized L2-loss support vector classification (dual) with context dtree and windows set of features.

RESULTS

State of the art.

C Precision Recall F1
(BELTRAO, 2016) 0.0625 82.17% 82.88% 82.52%

Validation results

Features C S Precision Recall F1
context + dtree + window 0.0625 1 76.80% 75.67% 76.23%
dtree 0.0625 4 76.31% 75.13% 75.72%
context + window 0.0625 4 57.13% 53.96% 55.50%
context 0.0625 4 57.37% 46.42% 51.32%
window 0.0625 2 40.61% 26.79% 32.28%

BIBLIOGRAPHY (PT)

(ALVA-MANCHEGO, 2013)
Anotação automática semissupervisionada de papéis semânticos para o português do Brasil

http://www.teses.usp.br/teses/disponiveis/55/55134/tde-14032013-150816/publico/dissrevfernandoAlva.pdf

(BELTRAO, 2016)
Anotador de Papeis Semânticos para Português

https://www.maxwell.vrac.puc-rio.br/30371/30371.PDF

srlbr's People

Contributors

guilhermevarela avatar

Stargazers

 avatar

Watchers

 avatar

srlbr's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.