ink-usc / afet Goto Github PK

AFET: Automatic Fine-Grained Entity Typing (EMNLP'16)

License: GNU General Public License v3.0

Python 67.30% C++ 31.64% Makefile 0.40% Shell 0.66%

afet's Introduction

AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding

Source code and data for EMNLP'16 paper AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding.

Given a text corpus with entity mentions detected and heuristically labeled by distant supervision, this code performs training of a rank-based loss over distant supervision and predict the fine-grained entity types for each test entity mention. For example, check out AFET's output on WSJ news articles.

An end-to-end tool (corpus to typed entities) is under development. Please keep track of our updates.

Performance

Performance of fine-grained entity type classification over Wiki (Ling & Weld, 2012) dataset.

Method	Accuray	Macro-F1	Micro-F1
HYENA (Yosef et al., 2012)	0.288	0.528	0.506
FIGER (Ling & Weld, 2012)	0.474	0.692	0.655
FIGER + All Filter (Gillick et al., 2014)	0.453	0.648	0.582
HNM (Dong et al., 2015)	0.237	0.409	0.417
WSABIE (Yogatama et al,., 2015)	0.480	0.679	0.657
AFET (Ren et al., 2016)	0.533	0.693	0.664

System Output

The output on BBN dataset can be found here. Each line is a sentence in the test data of BBN, with entity mentions and their fine-grained entity typed identified.

Dependency

python 2.7, g++
Python library dependencies

$ pip install pexpect unidecode six requests protobuf

Setup stanford coreNLP and its python wrapper.

$ cd DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip
$ rm stanford-corenlp-full-2016-10-31.zip

Data

We pre-processed three public datasets (train/test sets) to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:

Wiki (Ling & Weld, 2012): 1.5M sentences sampled from 780k Wikipedia articles. 434 news sentences are manually annotated for evaluation. 113 entity types are organized into a 2-level hierarchy (download JSON)
OntoNotes (Weischedel et al., 2011): 13k news articles with 77 of them are manually labeled for evaluation. 89 entity types are organized into a 3-level hierarchy. (download JSON)
BBN (Weischedel et al., 2005): 2,311 WSJ articles that are manually annotated using 93 types in a 2-level hierarchy. (download JSON)

Type hierarches for each dataset are included.
Please put the data files in the corresponding subdirectories under AFET/Data/.

Makefile

$ cd AFET/Model; make

Default Run

Run AFET for fine-grained entity typing on BBN dataset

$ java -mx4g -cp "DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
$ ./run.sh

Parameters - run.sh

Dataset to run on.

Data="BBN"

concrete parameters for running each dataset can be found in the README in corresponding data folder under AFET/Data/

Evaluation

Evaluate prediction results (by classifier trained on de-noised data) over test data

python Evaluation/emb_prediction.py $Data pl_warp bipartite maximum cosine 0.25
python Evaluation/evaluation.py $Data pl_warp bipartite

python Evaluation/evaluation.py -DATA(BBN/ontonotes/FIGER) -METHOD(hple/...) -EMB_MODE(hete_feature)

Publication

Please cite the following paper if you find the codes and datasets are helpful:

@inproceedings{Ren2016AFETAF,
  title={AFET: Automatic Fine-Grained Entity Typing by Hierarchical Partial-Label Embedding},
  author={Xiang Ren and Wenqi He and Meng Qu and Lifu Huang and Heng Ji and Jiawei Han},
  booktitle={EMNLP},
  year={2016}
}

afet's People

Contributors

Stargazers

Watchers

afet's Issues

Some problem with the StanfordNLP

After doing everything mentioned in README I tried $ ./run.sh, and got the following:

Step 1 Generate Features
Start nlp parsing
Traceback (most recent call last):
  File "DataProcessor/feature_generation.py", line 29, in <module>
    parse(raw_train_json, train_json)
  File "AFET/DataProcessor/nlp_parse.py", line 38, in parse
    parser = NLPParser('DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20')
  File "AFET/DataProcessor/nlp_parse.py", line 16, in __init__
    self.parser = StanfordCoreNLP(corenlp_dir)
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_python-3.3.10-py2.7.egg/corenlp/corenlp.py", line 348, in __init__
    self._spawn_corenlp()
  File "/usr/local/lib/python2.7/dist-packages/stanford_corenlp_python-3.3.10-py2.7.egg/corenlp/corenlp.py", line 337, in _spawn_corenlp
    self.corenlp.expect("\nNLP> ")
  File "/usr/local/lib/python2.7/dist-packages/pexpect/spawnbase.py", line 321, in expect
    timeout, searchwindowsize, async)
  File "/usr/local/lib/python2.7/dist-packages/pexpect/spawnbase.py", line 345, in expect_list
    return exp.expect_loop(timeout)
  File "/usr/local/lib/python2.7/dist-packages/pexpect/expect.py", line 105, in expect_loop
    return self.eof(e)
  File "/usr/local/lib/python2.7/dist-packages/pexpect/expect.py", line 50, in eof
    raise EOF(msg)
pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
<pexpect.pty_spawn.spawn object at 0x7f0a158ab0d0>
command: /usr/bin/java
args: ['/usr/bin/java', '-Xmx3g', '-cp', 'DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/joda-time.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/javax.json-api-1.0-sources.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/javax.json.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2-sources.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/jollyday.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/protobuf.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2-models.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/ejml-0.23.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/jollyday-0.4.7-sources.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/xom.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/xom-1.2.10-src.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/joda-time-2.1-sources.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2.jar:DataProcessor/stanford-corenlp-python/corenlp/stanford-corenlp-full-2015-04-20/stanford-corenlp-3.5.2-javadoc.jar', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-props', '/usr/local/lib/python2.7/dist-packages/stanford_corenlp_python-3.3.10-py2.7.egg/corenlp/default.properties']
buffer (last 100 chars): ''
before (last 100 chars): 's(ClassLoader.java:358)\r\n\tat sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)\r\n'
after: <class 'pexpect.exceptions.EOF'>
match: None
match_index: None
exitstatus: None
flag_eof: True
pid: 9081
child_fd: 7
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 8192
ignorecase: False
searchwindowsize: 80
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
    0: re.compile("
NLP> ")

Seems to be the same problem with dasmith/stanford-corenlp-python#34, i.e., the problem is not AFET's but StanforNLP's. Still I am wondering if you are getting the same problem and if yes, how do you solve it. Thanks!

Make error

Hello, i'm trying to run the model but when i run make in the Model directory, I get this error:
g++ -lm -pthread -O2 -march=native -Wall -funroll-loops -Wno-unused-result -lgsl -lm -lgslcblas -o pl_warp pl_warp.cpp utils.o hierarchy.o /usr/bin/ld: cannot find -lgsl collect2: error: ld returned 1 exit status make: *** [makefile:5: pl_warp] Error 1
If it can be helpful my pc is running manjaro linux Ornara 21.0.5

invalid option argument ‘-Ofast’

When I run make in the Model directory, I get:

g++ -lm -pthread -Ofast -march=native -Wall -funroll-loops -Wno-unused-result -lgsl -lm -lgslcblas -o pl_warp pl_warp.cpp utils.o hierarchy.o
cc1plus: error: invalid option argument ‘-Ofast’
cc1plus: warning: unrecognized command line option "-Wno-unused-result"
make: *** [pl_warp] Error 1

Wrong labels in BBN/train.json

Hi, I find there are some mentions with labels seem wrong in "BBN/train.json" that I download from the provided link. Here is an example.

{"tokens": ["The", "almanac", "will", "be", "making", "new", "friends", "and", "enemies", "on", "Oct.", "27", ",", "when", "an", "updated", "version", "will", "be", "released", "."], "senid": 4, "mentions": [{"start": 0, "labels": ["/ORGANIZATION/CORPORATION", "/WORK_OF_ART", "/ORGANIZATION", "/WORK_OF_ART/BOOK"], "end": 1}], "fileid": "WSJ1826"}

It labels the token "The" as ["/ORGANIZATION/CORPORATION", "/WORK_OF_ART/BOOK"] and this doesn't make sense. There are many mentions with similar mistake in BBN/train.json. BBN is manually annotated, so all the mentions should have been labeled right.

Could you please clarify where this error comes from? Is it from the original dataset or from your data preprocess procedure?

Unable to download Wiki dataset

As the title