Coder Social home page Coder Social logo

afet's Introduction

AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding

Source code and data for EMNLP'16 paper AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding.

Given a text corpus with entity mentions detected and heuristically labeled by distant supervision, this code performs training of a rank-based loss over distant supervision and predict the fine-grained entity types for each test entity mention. For example, check out AFET's output on WSJ news articles.

An end-to-end tool (corpus to typed entities) is under development. Please keep track of our updates.

Performance

Performance of fine-grained entity type classification over Wiki (Ling & Weld, 2012) dataset.

Method Accuray Macro-F1 Micro-F1
HYENA (Yosef et al., 2012) 0.288 0.528 0.506
FIGER (Ling & Weld, 2012) 0.474 0.692 0.655
FIGER + All Filter (Gillick et al., 2014) 0.453 0.648 0.582
HNM (Dong et al., 2015) 0.237 0.409 0.417
WSABIE (Yogatama et al,., 2015) 0.480 0.679 0.657
AFET (Ren et al., 2016) 0.533 0.693 0.664

System Output

The output on BBN dataset can be found here. Each line is a sentence in the test data of BBN, with entity mentions and their fine-grained entity typed identified.

Dependency

  • python 2.7, g++
  • Python library dependencies
$ pip install pexpect unidecode six requests protobuf
$ cd DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip
$ rm stanford-corenlp-full-2016-10-31.zip

Data

We pre-processed three public datasets (train/test sets) to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:

  • Wiki (Ling & Weld, 2012): 1.5M sentences sampled from 780k Wikipedia articles. 434 news sentences are manually annotated for evaluation. 113 entity types are organized into a 2-level hierarchy (download JSON)
  • OntoNotes (Weischedel et al., 2011): 13k news articles with 77 of them are manually labeled for evaluation. 89 entity types are organized into a 3-level hierarchy. (download JSON)
  • BBN (Weischedel et al., 2005): 2,311 WSJ articles that are manually annotated using 93 types in a 2-level hierarchy. (download JSON)
  • Type hierarches for each dataset are included.
  • Please put the data files in the corresponding subdirectories under AFET/Data/.

Makefile

$ cd AFET/Model; make

Default Run

Run AFET for fine-grained entity typing on BBN dataset

$ java -mx4g -cp "DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
$ ./run.sh  

Parameters - run.sh

Dataset to run on.

Data="BBN"
  • concrete parameters for running each dataset can be found in the README in corresponding data folder under AFET/Data/

Evaluation

Evaluate prediction results (by classifier trained on de-noised data) over test data

python Evaluation/emb_prediction.py $Data pl_warp bipartite maximum cosine 0.25
python Evaluation/evaluation.py $Data pl_warp bipartite
  • python Evaluation/evaluation.py -DATA(BBN/ontonotes/FIGER) -METHOD(hple/...) -EMB_MODE(hete_feature)

Publication

Please cite the following paper if you find the codes and datasets are helpful:

@inproceedings{Ren2016AFETAF,
  title={AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding},
  author={Xiang Ren and Wenqi He and Meng Qu and Lifu Huang and Heng Ji and Jiawei Han},
  booktitle={EMNLP},
  year={2016}
}

afet's People

Contributors

ellenmellon avatar little8hwq avatar shanzhenren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

afet's Issues

Some problem with the StanfordNLP

After doing everything mentioned in README I tried $ ./run.sh, and got the following:

Step 1 Generate Features
Start nlp parsing
Traceback (most recent call last):
  File "DataProcessor/feature_generation.py", line 29, in <module>
    parse(raw_train_json, train_json)
  File "AFET/DataProcessor/nlp_parse.py", line 38, in parse
    parser = NLPParser('DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20')
  File "AFET/DataProcessor/nlp_parse.py", line 16, in __init__
    self.parser = StanfordCoreNLP(corenlp_dir)
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_python-3.3.10-py2.7.egg/corenlp/corenlp.py", line 348, in __init__
    self._spawn_corenlp()
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_python-3.3.10-py2.7.egg/corenlp/corenlp.py", line 337, in _spawn_corenlp
    self.corenlp.expect("\nNLP> ")
  File "/usr/local/lib/python2.7/dist-packages/pexpect/spawnbase.py", line 321, in expect
    timeout, searchwindowsize, async)
  File "/usr/local/lib/python2.7/dist-packages/pexpect/spawnbase.py", line 345, in expect_list
    return exp.expect_loop(timeout)
  File "/usr/local/lib/python2.7/dist-packages/pexpect/expect.py", line 105, in expect_loop
    return self.eof(e)
  File "/usr/local/lib/python2.7/dist-packages/pexpect/expect.py", line 50, in eof
    raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
<pexpect.pty_spawn.spawn object at 0x7f0a158ab0d0>
command: /usr/bin/java
args: ['/usr/bin/java', '-Xmx3g', '-cp', 'DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/joda-time.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/javax.json-api-1.0-sources.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/javax.json.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2-sources.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/jollyday.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/protobuf.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2-models.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/ejml-0.23.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/jollyday-0.4.7-sources.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/xom.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/xom-1.2.10-src.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/joda-time-2.1-sources.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2-javadoc.jar', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-props', '/usr/local/lib/python2.7/dist-packages/stanford_corenlp_python-3.3.10-py2.7.egg/corenlp/default.properties']
buffer (last 100 chars): ''
before (last 100 chars): 's(ClassLoader.java:358)\r\n\tat sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)\r\n'
after: <class 'pexpect.exceptions.EOF'>
match: None
match_index: None
exitstatus: None
flag_eof: True
pid: 9081
child_fd: 7
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 8192
ignorecase: False
searchwindowsize: 80
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
    0: re.compile("
NLP> ")

Seems to be the same problem with dasmith/stanford-corenlp-python#34, i.e., the problem is not AFET's but StanforNLP's. Still I am wondering if you are getting the same problem and if yes, how do you solve it. Thanks!

Make error

Hello, i'm trying to run the model but when i run make in the Model directory, I get this error:
g++ -lm -pthread -O2 -march=native -Wall -funroll-loops -Wno-unused-result -lgsl -lm -lgslcblas -o pl_warp pl_warp.cpp utils.o hierarchy.o /usr/bin/ld: cannot find -lgsl collect2: error: ld returned 1 exit status make: *** [makefile:5: pl_warp] Error 1
If it can be helpful my pc is running manjaro linux Ornara 21.0.5

invalid option argument ‘-Ofast’

When I run make in the Model directory, I get:

g++ -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result -lgsl -lm -lgslcblas -o pl_warp pl_warp.cpp utils.o hierarchy.o
cc1plus: error: invalid option argument ‘-Ofast’
cc1plus: warning: unrecognized command line option "-Wno-unused-result"
make: *** [pl_warp] Error 1

Wrong labels in BBN/train.json

Hi, I find there are some mentions with labels seem wrong in "BBN/train.json" that I download from the provided link. Here is an example.

{"tokens": ["The", "almanac", "will", "be", "making", "new", "friends", "and", "enemies", "on", "Oct.", "27", ",", "when", "an", "updated", "version", "will", "be", "released", "."], "senid": 4, "mentions": [{"start": 0, "labels": ["/ORGANIZATION/CORPORATION", "/WORK_OF_ART", "/ORGANIZATION", "/WORK_OF_ART/BOOK"], "end": 1}], "fileid": "WSJ1826"}

It labels the token "The" as ["/ORGANIZATION/CORPORATION", "/WORK_OF_ART/BOOK"] and this doesn't make sense. There are many mentions with similar mistake in BBN/train.json. BBN is manually annotated, so all the mentions should have been labeled right.

Could you please clarify where this error comes from? Is it from the original dataset or from your data preprocess procedure?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.