The nutrimatic from vhxs

This is Nutrimatic (http://nutrimatic.org/usage.html).

To build the source, run "./build.py".  You will need the following installed:
   * Python
   * g++
   * libxml2 (ubuntu: apt-get install libxml2-dev; osx: pip install lxml)
   * libtre (ubuntu: apt-get install libtre-dev; osx: brew install tre)

To do anything useful, you will need to build an index from Wikipedia.

1. Download the latest Wikipedia database dump (this is a ~13GB file!):

     wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

2. Extract the text from the articles using Wikipedia Extractor
   (this generates ~12GB of text, and can take several hours!):

     # See http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
     wget https://raw.githubusercontent.com/attardi/wikiextractor/master/WikiExtractor.py
     python WikiExtractor.py enwiki-latest-pages-articles.xml.bz2

   This will write many files named text/??/wiki_??.

3. Index the text (this generates ~50GB of data, and can also take hours!):

     find text -type f | xargs cat | bin/make-index wikipedia

   This will write many files named wikipedia.?????.index.
   (You can break this up; run make-index several times with different
   sets of input files, replacing "wikipedia" with unique names each time.)

4. Merge the indexes; I normally do this in two stages:

     for x in 0 1 2 3 4 5 6 7 8 9
     do bin/merge-indexes 2 wikipedia.????$x.index wiki-merged.$x.index
     done

     bin/merge-indexes 5 wiki-merged.*.index wiki-merged.index

   There's nothing magical about this appproach with 10 batches, you can use
   any way you like to merge the files. The 2 and 5 numbers are minimum phrase
   frequency cutoffs (how many times a string must occur to be included).

5. Enjoy your new index:

     bin/find-expr wiki-merged.index '<aciimnrttu>'

If you want to set up the web interface, write a short shell wrapper that runs
cgi-search.py with arguments pointing it at your binaries and data files, e.g.:

     #!/bin/sh

     export NUTRIMATIC_FIND_EXPR=/path/to/nutrimatic/bin/find-expr
     export NUTRIMATIC_INDEX=/path/to/nutrimatic/data/wiki-merged.index
     exec /path/to/nutrimatic/cgi-search.py

Then arrange for your web server to invoke that shell wrapper as a CGI script.

Have fun,

-- [email protected]

vhxs / nutrimatic Goto Github PK

nutrimatic's Introduction

nutrimatic's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent