Coder Social home page Coder Social logo

word2vec-wiki's Introduction

word2vec-wiki

Generate word/phrase embedding using Wikipedia articles.

This is a documentation (for my own reference) on generating word/phrase embedding from Wikipedia articles.

1 - Download latest Wiki dump

This can be found at https://dumps.wikimedia.org/enwiki/. Specifically, we need pages-articles.xml.bz2.

2 - Extract plaintext from Wikitext

WikiExtractor.py is used.

$ python WikiExtractor.py -l -ns 0 --no-templates -o [output_folder] --processes 16 pages-articles.xml.bz2

In the command above, -l preserves links; -ns 0 only accepts Wikipedia pages in namespace 0, which are main articles rather than categories or other types.

3 - Prepare training corpus for word2vec tools

Extract sentences from Wikipedia pages into the following format: one sentence = one line; words already preprocessed and separated by whitespace.

python2.7 wiki2vec_corpus.py -h
usage: wiki2vec_corpus.py [-h] -folder FOLDER -output_folder OUTPUT_FOLDER
                          [-output_prefix OUTPUT_PREFIX] [-nproc NPROC]
                          [--add_wiki_title] [--keep_anchor] [--no_punct]
                          [--lower] [--debug]

optional arguments:
  -h, --help            show this help message and exit
  -folder FOLDER        path to Wikipedia extracted by WikiExtractor
  -output_folder OUTPUT_FOLDER
                        folder to save outpus
  -output_prefix OUTPUT_PREFIX
                        output prefix
  -nproc NPROC          # processes
  --add_wiki_title      whether to export Wiki title in the sentence
  --keep_anchor         if export wiki title, whether to keep anchor text
  --no_punct            whether to remove punctuations
  --lower               lower case
  --debug

Note: when --add_wiki_title is set, Wikipeida title is preserved in addition to the anchor text.

Every Wikipedia link to an article within wiki is replaced by WIKI/{link}.

e.g:
[[ Barack Obama | B.O ]] is the president of [[USA]]
is transformed into:
    WIKI/Barack_Obama B.O is the president of WIKI/USA USA

word2vec-wiki's People

Contributors

harrylclc avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.