Coder Social home page Coder Social logo

bayrakmustafa / wikipedia-to-elastic Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aloneirew/wikipedia-to-elastic

0.0 1.0 0.0 2.47 MB

Analyze and Export Wikipedia XML dump to ElasticSearch for use as knowledge resource (multilingual support)

License: Apache License 2.0

Java 97.79% Dockerfile 0.49% Shell 1.72%

wikipedia-to-elastic's Introduction

DOI

Wikipedia to ElasticSearch

This project generates a knowledge resource based on wikipedia.
It also includes a multilingual parsing mechanism that enables parsing of Wikipedia, Wikinews, Wikidata and other Wikimedia .bz2 dumps into an ElasticSearch index.

Supported languages: {English, French, Spanish, German, Chinese}
*Note Relations integrity tested only for English. Other languages might require some adjustments.

Table Of Contents


Introduction

Exploited Wiki Resources

3 different types of Wikipedia pages are used: {Redirect/Disambiguation/Title} in order to extract 6 different semantic features for tasks such as Identifying Semantic Relations, Entity Linking, Cross Document Co-Reference, Knowledge Graphs, Summarization and other.

Extracted Relations Types

List of Wikidata properties which can extend above Wikipedia relations (by running Wikidata postprocess described below).

Links for further details on those properties:


Prerequisites


Configuration

  • conf.json - Main project configuration
    "indexName": "enwiki_v3" (Set your desired Elastic Search index name)  
    "docType": "wikipage" (Set your desired Elastic Search documnent type)
    "extractRelationFields": true (Weather to extract relations fields while processing the data, support only english wikipedia)
    "insertBulkSize": 100 (Number of pages to bulk insert to elastic search every iteration (found this number to give best preformence))
    "mapping": "mapping.json" (Elastic Mapping file, should point to src/main/resources/mapping.json)
    "setting": "en_map_settings.json" (Elastic Setting file, current support {en, fr, es, de, zh})
    "host": "localhost" (Elastic host, were Elastic instance is installed and running)
    "port": 9200 (Elastic port, host port were Elastic is installed and running, elastic defualt is set to 9200)
    "wikipediaDump": "dumps/enwiki-latest-pages-articles.xml.bz2" (Wikipedia .bz2 downloaded dump file location)
    "scheme": "http" (Elastic host schema, should probebly stay unchanged)
    "shards": 1 (Number of Elastic shards to use)
    "replicas": 0 (Number of Elastic replicas to use)
    "lang": "en" (current support {en, fr, es, de, zh})
    "includeRawText": true (will include wikipedia page text, parsed and clean as possible)
    "relationTypes": ["Category", "Infobox", "Parenthesis", "PartName"] (Which relations to extract, full list at /src/main/java/wiki/data/relations/RelationType.java)
  • src/main/resources/mapping.json - Elastic wiki index mapping (Should probably stay unchanged)
  • src/main/resources/{en,es,fr,de,zh}_map_settings.json - Elastic index settings (Should probably stay unchanged)
  • src/main/resources/lang/{en,es,fr,de,zh}.json - language specific configuration files
  • src/main/resources/stop_words/{en,es,fr,de,zh}.txt - language specific stop-words list

Build Run and Test

  • Make sure Elastic process is running and active on your host (if running Elastic locally your IP is http://localhost:9200/)

  • Checkout/Clone the repository

  • Put wiki xml.bz2 dump file (no need to extract the bz2 file!) in: dumps folder under root checkout repository.
    Recommendation: Start with a small wiki dump, make sure you like what you get (or modify configurations to meet your needs) before moving to a full blown 15GB dump export..

  • Make sure conf.json configuration for Elastic are set as expected (default localhost:9200)

  • From command line navigate to project root directory and run:
    ./gradlew clean build -x test
    Should get a message saying: BUILD SUCCESSFUL in 7s

  • Extract the build zip file created at this location build/distributions/WikipediaToElastic-1.0.zip

  • Run the process from command line:
    java -Xmx6000m -DentityExpansionLimit=2147480000 -DtotalEntitySizeLimit=2147480000 -Djdk.xml.totalEntitySizeLimit=2147480000 -jar build/distributions/WikipediaToElastic-1.0/WikipediaToElastic-1.0.jar

  • To test/query, you can run from terminal:
    curl -XGET 'http://localhost:9200/enwiki_v3/_search?pretty=true' -H 'Content-Type: application/json' -d '{"size": 5, "query": {"match_phrase": { "title.near_match": "Alan Turing"}}}'

  • Should return a wikipedia page on Alan Turing


Integrating Wikidata Attributes

Running this process require a Wikipedia index (generated by the above process)

Wikidata Configuration Files

  • wikidata_conf.json - basic process configuration
    "indexName" : "enwiki_v3" (Set your Elastic Search index to be modidied)  
    "docType" : "wikipage" (Set your desired Elastic Search documnent type)
    "insertBulkSize": 100 (Number of pages to bulk insert to elastic search every iteration (found this number to give best preformence))
    "host" : "localhost" (Elastic host, were Elastic instance is installed and running)
    "port" : 9200 (Elastic port, host port were Elastic is installed and running, elastic defualt is set to 9200)
    "wikidataDump" : "dumps/enwiki-latest-pages-articles.xml.bz2" (Wikidata .bz2 downloaded dump file location)
    "scheme" : "http" (Elastic host schema, should probebly stay unchanged)
    "lang": "en" (should corrolate with the wikipedia index)

Wikidata Running and Testing

  • Make sure Elastic process is running and active on your host (if running Elastic locally your IP is http://localhost:9200/)

  • Make sure wikidata_conf.json configuration are set as expected

  • Run the process from command line:
    java -cp WikipediaToElastic-1.0.jar wiki.wikidata.WikiDataFeatToFile
    Process will read the full wikidata dump, parse, extract the relations and merge them relative wikipedia data in search index. Process might take a while to finish.

  • To test/query, you can run from terminal:
    curl -XGET 'http://localhost:9200/enwiki_v3/_search?pretty=true' -H 'Content-Type: application/json' -d '{"size": 5, "query": {"match_phrase": { "title.near_match": "Alan Turing"}}}'

This should return a wikipedia page on Alan Turing including the new Wikidata relations.


Usage

Elastic Page Query

Once process is complete, two main query options are available (for more details and title query options, see mapping.json):

  • title.plain - fuzzy search (sorted)
  • title.keyword - exact match

Generated Elastic Page Example

Pages that have been created with the following structures (also see "Created Fields Attributes" for more details):

Page Example (Extracted from Wikipedia disambiguation page):

{
  "_index": "enwiki_v3",
  "_type": "wikipage",
  "_id": "40573",
  "_version": 1,
  "_score": 20.925367,
  "_source": {
    "title": "NLP",
    "text": "{{wiktionary|NLP}}\n\n'''NLP''' may refer to:\n\n; .....",
    "relations": {
      "isPartName": false,
      "isDisambiguation": true,
      "disambiguationLinks": [
        "Natural language programming",
        "New Labour",
        "National Library of the Philippines",
        "Neuro linguistic programming",
        "Natural language processing",
        "National Liberal Party",
        "Natural Law Party",
        "National Labour Party",
        "Normal link pulses",
        "New Labour Party"
      ],
      "categories": [
        "disambiguation"
      ],
      "infobox": "",
      "titleParenthesis": [],
      "partOf": [],
      "aliases": [
        "LmxM36.1060"
      ],
      "hasPart": [],
      "hasEffect": [],
      "hasCause": [],
      "hasImmediateCause": []
    }
  }
}

Page Example (Extracted from Wikipedia redirect page):

{
  "_index": "enwiki_v3",
  "_type": "wikipage",
  "_id": "2577248",
  "_version": 1,
  "_score": 20.925367,
  "_source": {
    "title": "Nlp",
    "text": "#REDIRECT",
    "redirectTitle": "NLP",
    "relations": {
      "isPartName": false,
      "isDisambiguation": false
    }
  }
}

Fields & Attributes

json field Value comment
_id Text Wikipedia page id
_source.title Text Wikipedia page title
_source.text Text Wikipedia page text
_source.redirectTitle Text (optional) Wikipedia page redirect title
_source.relations.infobox Text (optional) The article infobox element
_source.relations.categories List (optional) Categories relation list
_source.relations.isDisambiguation Bool (optional) is Wikipedia disambiguation page
_source.relations.isPartName List (optional) is Wikipedia page name description
_source.relations.titleParenthesis List (optional) List of disambiguation secondary links
_source.relations.aliases List (optional) Wikidata Rel
_source.relations.partOf List (optional) Wikidata Rel
_source.relations.hasPart List (optional) Wikidata Rel
_source.relations.hasEffect List (optional) Wikidata Rel
_source.relations.hasCause List (optional) Wikidata Rel
_source.relations.hasImmediateCause List (optional) Wikidata Rel

wikipedia-to-elastic's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.