Coder Social home page Coder Social logo

ink-usc / ple Goto Github PK

View Code? Open in Web Editor NEW
53.0 12.0 14.0 2.53 MB

Label Noise Reduction in Entity Typing (KDD'16)

License: GNU General Public License v3.0

Python 1.36% C++ 75.16% Makefile 0.02% Shell 0.23% C 0.74% CMake 2.63% Fortran 19.67% JavaScript 0.11% CSS 0.07%
natural-language-processing information-extraction knowledgebase machine-learning

ple's Introduction

Heterogeneous Partial-Label Embedding

Source code and data for SIGKDD'16 paper Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding.

Given a text corpus with entity mentions detected and heuristically labeled by distant supervision, this code performs (1) label noise reduction over distant supervision, and (2) learning type classifiers over de-noised training data. For example, check out PLE's output on Tech news.

An end-to-end tool (corpus to typed entities) is under development. Please keep track of our updates.

Performance

Performance of fine-grained entity type classification over Wiki (Ling & Weld, 2012) dataset. We applied PLE to clean training data and ran FIGER (Ling & Weld, 2012) and over the de-noised labeled data to train type classifiers (thus the FIGER + PLE is the name of our final system).

Method Accuray Macro-F1 Micro-F1
HYENA (Yosef et al., 2012) 0.288 0.528 0.506
WSABIE (Yogatama et al,., 2015) 0.480 0.679 0.657
FIGER (Ling & Weld, 2012) 0.474 0.692 0.655
FIGER + All Filter (Gillick et al., 2014) 0.453 0.648 0.582
FIGER + PLE (Ren et al., 2016) 0.599 0.763 0.749

System Output

The output on BBN dataset can be found here. Each line is a sentence in the test data of BBN, with entity mentions and their fine-grained entity typed identified.

Dependencies

  • python 2.7, g++
  • Python library dependencies
$ pip install pexpect unidecode six requests protobuf
$ cd DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip
$ rm stanford-corenlp-full-2016-10-31.zip

Data

We processed (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:

  • Wiki (Ling & Weld, 2012): 1.5M sentences sampled from 780k Wikipedia articles. 434 news sentences are manually annotated for evaluation. 113 entity types are organized into a 2-level hierarchy (download JSON)
  • OntoNotes (Weischedel et al., 2011): 13k news articles with 77 of them are manually labeled for evaluation. 89 entity types are organized into a 3-level hierarchy. (download JSON)
  • BBN (Weischedel et al., 2005): 2,311 WSJ articles that are manually annotated using 93 types in a 2-level hierarchy. (download JSON)
  • Type hierarches for each dataset are included.
  • Please put the data files in the corresponding subdirectories under PLE/Data/.

Makefile

We have included compilied binaries. If you need to re-compile hple.cpp under your own g++ environment

$ cd PLE/Model/ple/; make

Default Run

Run PLE for the task of Reduce Label Noise on the BBN dataset

$ java -mx4g -cp "DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
$ ./run.sh  
  • The run.sh contains parameters for running on three datasets.

Parameters - run.sh

Dataset to run on.

Data="BBN"

Evaluation

Evaluate prediction results (by classifier trained on de-noised data) over test data

python Evaluation/evaluation.py BBN hple hete_feature
  • python Evaluation/evaluation.py -DATA(BBN/ontonotes/FIGER) -METHOD(hple/...) -EMB_MODE(hete_feature)

Reference

Please cite the following paper if you found the codes/datasets useful:

@inproceedings{ren2016label,
  title={Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding},
  author={Ren, Xiang and He, Wenqi and Qu, Meng and Voss, Clare R and Ji, Heng and Han, Jiawei},
  booktitle={Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  pages={1825--1834},
  year={2016},
  organization={ACM}
}

ple's People

Contributors

ellenmellon avatar little8hwq avatar shanzhenren avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ple's Issues

Reproducibility

Hi, Nice project! I have some questions regarding to the reproducibility.

I ran the pipeline using ./run.sh (I changed the hyperparameters according to the instructions in this script.).
Here is what I got:

For OntoNotes

Evaluate on test data...
Predicted labels (embedding):
prediction:9514, ground:9514
accuracy: 0.46766869876
macro_precision, macro_recall, macro_f1: 0.71447340761 0.590795143998 0.646774820366
micro_precision, micro_recall, micro_f1: 0.684940833959 0.515314774253 0.588141602293

For BBN

Evaluate on test data...
Predicted labels (embedding):
prediction:13276, ground:13276
accuracy: 0.641398011449
macro_precision, macro_recall, macro_f1: 0.813840398621 0.73504585342 0.77243891306
micro_precision, micro_recall, micro_f1: 0.695182970784 0.724010775481 0.709304081359

Looks like the experimental results in Table 9 in the paper are not reproducible, especially on the OntoNotes dataset. Do you have any insight about what's going on here?

For Wiki

Running the pipeline on Wiki seems to have a memory issue. I'm using the pre-compiled hple binary; at the stage of Heterogeneous Partial-Label Embedding, it quickly consumes all memory (16G Ram) on my machine. Here is the log:

Heterogeneous Partial-Label Embedding...
Mode: b
Reading nodes from file: Intermediate/Wiki/mention.txt, DONE!
Node size: 2669531
Node dims: 50
Reading nodes from file: Intermediate/Wiki/feature.txt, DONE!
Node size: 1913511
Node dims: 50
Reading nodes from file: Intermediate/Wiki/type.txt, DONE!
Node size: 128
Node dims: 50
Reading edges from file: Intermediate/Wiki/mention_feature.txt, DONE!
Edge size: 78770560
./run.sh: line 20: 19369 Killed                  Model/ple/hple -data $Data -mode bcd -size 50 -negatives 10 -iters 50 -threads 30 -lr 0.25 -alpha 0.0001

Look forward to your reply. Thanks in advance!

Reproducibility after updating run.sh

Hi, thanks for sharing! I have seen the issue #3. And the run.sh that I am using now has been updated with correct parameters as your newest run.sh. But I still met the same problem as the issue mentioned.

I still cannot get the experimental results in Table 9 in the paper.

And here is what I got:

For BBN dataset:

Evaluate on test data...
Predicted labels (embedding):
prediction:13276, ground:13276
accuracy: 0.655302802049
macro_precision, macro_recall, macro_f1: 0.80969198831 0.738642569298 0.772537128728
micro_precision, micro_recall, micro_f1: 0.691044043291 0.730320393923 0.710139554854

For OntoNotes dataset:

Evaluate on test data... 
Predicted labels (embedding): 
prediction:9514, ground:9514 
accuracy: 0.507778011352
macro_precision, macro_recall, macro_f1: 0.747326308598 0.616188643052 0.675451310714 micro_precision, micro_recall, micro_f1: 0.719988123515 0.535425351516 0.614140119916

I have already ran the program several times. However, the program does not seem to get better results. Do you have any insight about what's going here?
Look forward to your reply. Thanks!

bbn entity mention has negative index

Why bbn test.json has the negative index for entity mention. The following is an example.
{"tokens": ["In", "1973", ",", "Wells", "Fargo", "&", "amp", ";", "Co.", "of", "San", "Francisco", "launched", "the", "Gold", "Account", ",", "which", "included", "free", "checking", ",", "a", "credit", "card", ",", "safe-deposit", "box", "and", "travelers", "checks", "for", "a", "$", "3", "monthly", "fee", "."], "senid": 24, "mentions": [{"start": -1, "labels": ["/ORGANIZATION/CORPORATION", "/ORGANIZATION"], "end": -1}, {"start": 10, "labels": ["/GPE/CITY", "/GPE"], "end": 12}], "fileid": "WSJ0085"}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.