Coder Social home page Coder Social logo

sdm-tib / falcon2.0 Goto Github PK

View Code? Open in Web Editor NEW
109.0 6.0 21.0 4.55 MB

Falcon 2.0 is a joint entity and relation linking tool over Wikidata.

Home Page: https://labs.tib.eu/falcon/falcon2/

License: MIT License

Python 100.00%
entity-linking relation-extraction entity-extraction wikidata dbpedia knowledge-graph natural-language-processing nlp

falcon2.0's Introduction

FALCON 2.0

Falcon 2.0 is an entity and relation linking tool over Wikidata (accepted in CIKM 2020). The full CIKM paper can be found at the link: Falcon 2.0 Paper

It leverages fundamental principles of the English morphology (e.g., N-Gram tiling and N-Gramsplitting) to accurately map entities and relations in short texts to resources in Wikidata. Falcon is available as Web API and can be queried using CURL:

curl --header "Content-Type: application/json" \
  --request POST \
  --data '{"text":"Who painted The Storm on the Sea of Galilee?"}' \
  https://labs.tib.eu/falcon/falcon2/api?mode=long

This is the first resource of this repository. The second resource is described in the ElasticSearch section.

Implementation

To begin with, install the libraries stated in the requirements.txt file as follows:

pip install -r requirements.txt

The FALCON 2.0 tool's code has three main aspects: elastic search, algorithm, and evaluation.

Elastic Search and Background Knowledge

Before beginning working with the Wikidata Dump, we first need to connect to an elasticsearch endpoint and a Wikidata endpoint. The elasticsearch endpoint is used to interact with our cluster through the Elasticsearch API. The ElasticSearch dump (Also knowns as R2: Background Knowledge) for Falcon 2.0 can be downloaded from this link: https://doi.org/10.6084/m9.figshare.11362883

To import the Elasticsearch dump please use elasticdump and execute the following commands:

elasticdump  --output=http://localhost:9200/wikidataentityindex/  --input=wikidataentity.json  --type=data

elasticdump  --output=http://localhost:9200/wikidatapropertyindex/  --input=wikidatapropertyindex.json  --type=data

To change your elasticsearch endpoint, makes changes in Elastic/searchIndex.py and Elastic/addIndex.py:

es = Elasticsearch(['http://localhost:9200'])

Wikidata SPARQL endpoint helps us to quickly search and analyze big volumes of the data stored in the knowledge graph (here, Wikidata). To change Wikidata endpoint, make changes in main.py:

wikidataSPARQL = " "

We then create indices for property search and entity search over Wikidata. Refer to the following two functions in Elastic/addIndex.py for the code:

def propertyIndexAdd(): ...
def entitiesIndexAdd(): ...

Furthermore, we need to execute a search query and get back search hits that match the query. The search query feature is used to find whether a mention is an entity or a property in Wikidata. Note that Elasticsearch uses JSON as the serialization format for the documents. The elasticsearch query used to retrieve candidates from elasticsearch is as follows:

{
  "query": {
    "match" : { "label" : "operating income" }
  }
}

Search queries over Wikidata are implemented in Elastic/searchIndex.py. Refer to the following two functions in the same file for entity search and property search in Wikidata:

def entitySearch(query): ...
def propertySearch(query): ...

Algorithm

main.py contains the code for automatic entity and relation linking to resources in Wikidata using rule-based learning. Falcon 2.0 uses the same approach for Wikidata knowledge graph as used in Falcon for DBpedia(https://labs.tib.eu/falcon/). The rules that represent the English morphology are maintained in a catalog; a forward chaining inference process is performed on top of the catalog during the tasks of extraction and linking. Falcon 2.0 also comprises several modules that identify and link entities and relations to Wikidata knowledge graph. These modules implement POS Tagging, Tokenization & Compounding, N-Gram Tiling, Candidate ListGeneration, Matching & Ranking, Query Classifier, and N-Gram Splitting. The modules are reused from the implementation of Falcon.

Evaluation

Usage

To run Falcon 2.0, you have to call the function "process_text_E_R(question)" where the question is the short text to be processed by Falcon 2.0 We

For evaluating Falcon 2.0, we relied on three different question answering datasets, namely SimpleQuestion dataset for Wikidata, WebQSP-WD, and LC-QuAD 2.0.

For reproducing the results, "evaluateFalconAPI.py" and "evaluateFalconAPI_entities.py" can be used.

"evaluateFalconAPI_entities.py" evaluates entity linking.

"evaluateFalconAPI.py" evaluates entity and relation linking.

Experimental Results for Entity Linking

SimpleQuestions dataset

SimpleQuestion dataset contains 5622 test questions which are answerable using Wikidata as underlying Knowledge Graph. Falcon 2.0 reports precision value 0.56, recall value 0.64 and F-score value 0.60 on this dataset.

LC-QuAD 2.0 dataset

LC-Quad 2.0 contains 6046 test questions that are mostly complex (more than one entity and relation). On this dataset, Falcon 2.0 reports a precision value 0.50, recall value 0.56 and F-score 0.53.

WebQSP-WD dataset

WebQSP-WD contains 1639 test questions with a single entity and relation per question. Falcon 2.0 outperforms all other baselines with the highest F-score value 0.82, precision value 0.80, and highest recall value 0.84 on the WebQSP-WD dataset.

Experimental Results for Relation Linking

SimpleQuestions dataset

Falcon 2.0 reports a precision value of 0.35, recall value 0.44 and F-score 0.39 on SimpleQuestions dataset for relation linking task.

LC-QuAD 2.0

Falcon 2.0 reports a precision value of 0.44, recall value 0.37 and F-score 0.40 on LC-Quad 2.0 dataset.

Cite our work

@inproceedings{10.1145/3340531.3412777,
author = {Sakor, Ahmad and Singh, Kuldeep and Patel, Anery and Vidal, Maria-Esther},
title = {Falcon 2.0: An Entity and Relation Linking Tool over Wikidata},
year = {2020},
isbn = {9781450368599},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3340531.3412777},
doi = {10.1145/3340531.3412777},
booktitle = {Proceedings of the 29th ACM International Conference on Information & Knowledge Management},
pages = {3141โ€“3148},
numpages = {8},
keywords = {wikidata, dbpedia, relation linking, nlp, english morphology, entity linking, background knowledge},
location = {Virtual Event, Ireland},
series = {CIKM '20}
}

falcon2.0's People

Contributors

ahmadsakor avatar anerypatel avatar kulsingh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

falcon2.0's Issues

Handling `'s` in entity indexing

Had difficulty parsing the following:

print( process_text_E_R("Hong Kong's",rules) )

The resulting error is:

ValueError: 'Kong' is not in list

This seems to come from the mismatch where ["Hong", "Kong's"].index("Kong").

I've tried a fix by adding new rule in the various entity cleaning portions. Hoping to hear if this would make sense with the rules and parsing. Thank you ๐Ÿ˜ธ

            for ent in entities: 
                ent=ent.replace("?","")
                ent=ent.replace(".","")
                ent=ent.replace("!","")
                ent=ent.replace("\\","")
                ent=ent.replace("#","")
                ent=ent.replace("'s","") < added new rule in line 439
                if token.text in ent:

File not found

Hi, thanks for your effort on developing this useful tool~

I follow the instruction to create index by

    propertyIndexAdd()
    entitiesIndexAdd()

but got error
FileNotFoundError: [Errno 2] No such file or directory: '../data/dbpredicateindex.json'

I want to use falcon2 as a relation linking tool, what should I do?

Besides, I find my import speed is very very slow when I import the wikidataentity.json into elasticsearch, do you have any idea about it?

Thanks.

Elasticdump for wikidata dump takes a long time

Hi, I've followed the instructions to use elasticdump to place the wikidata into elasticsearch. However, elasticdump has been running for a long time.

  • Is there an estimate on how long will it take for the 9gb of data for just the entities?
  • Is there a smaller dataset that I can try this on?

Thanks.

Named entity recognition

Hi, Does this project have named entity recognition cz um new to this area. If so can you tell me the scripts names which include it

Small query on the output format

Hi

Would like to raise 2 (points / questions):

(1) the doctype should be doc or _doc in the Elastic submodule?
The source code by default reads doc, but the Elasticdump seems to add _doc by default.
It's a small point, but thought it should be raised, in case this affects adding new docs.

(2)
How to interpret the result?
Trying Falcon on random questions provide the following results. How do we interpret the integers that come after the list of links? Thank you.

>>>    process_text_E_R('Who is Michelle Obama?',rules)
>>>    process_text_E_R('Where is Gracht?',rules)
0
['Who is Michelle Obama?', [], [['<http://www.wikidata.org/entity/Q13133>', 'Michelle obama']], 0, 0, 0, 0]
1
['Where is Gracht?', [], [['<http://www.wikidata.org/entity/Q896611>', 'Gracht']], 0, 0, 0, 0]

FileNotFound Error

Hi @AhmadSakor,
When I set up this code, I got an error called FileNotFoundError: [Errno 2] No such file or directory: 'datasets/results/test_api/falcon_lcquad2.csv' in evaluateFalconAPI.py file. The same error came when running evaluateFalconAPI_entities.py file (falcon_simple_test.csv not found). Pls, be kind enough to provide me with these CSV files or give any solution to solve these errors.

Some kind of entity sorting error

Hi,

I've encountered errors when querying entities of single digits e.g. Earth is Q2.

The error is logged below.

    for entity in sorted(raw , key=lambda x: (-x[3],-x[2],int(x[1][x[1].rfind("/")+2:-1])))[:k]:
ValueError: invalid literal for int() with base 10: ''

I've managed to fix this with the following indexing where the -1 in the slicing is removed.

    for entity in sorted(raw , key=lambda x: (-x[3],-x[2],int(x[1][x[1].rfind("/")+2:])))[:k]:

I believe this -1 truncates the sorted id by a digit unintentionally at the back. For example:

1. Q2' -> ''
2. 'Q123' -> 12

Hoping to hear if this is a correct change please or if it can affect the overall package. Thanks ๐Ÿ˜„

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.