Coder Social home page Coder Social logo

meta-toolkit / meta Goto Github PK

View Code? Open in Web Editor NEW
686.0 62.0 233.0 31.11 MB

A Modern C++ Data Sciences Toolkit

Home Page: https://meta-toolkit.org

License: MIT License

CMake 1.68% Python 0.16% Ruby 0.01% Shell 0.14% C 0.28% C++ 97.74%
nlp nlp-parsing search-engine inverted-index pos-tag text-analysis text-analytics text-classification language-modeling graph-algorithms

meta's Introduction

MeTA: ModErn Text Analysis

Please visit our web page for information and tutorials about MeTA!

Build Status (by branch)

  • master: Build Status Windows Build Status
  • develop: Build Status Windows Build Status

Outline

Intro

MeTA is a modern C++ data sciences toolkit featuring

  • text tokenization, including deep semantic features like parse trees
  • inverted and forward indexes with compression and various caching strategies
  • a collection of ranking functions for searching the indexes
  • topic models
  • classification algorithms
  • graph algorithms
  • language models
  • CRF implementation (POS-tagging, shallow parsing)
  • wrappers for liblinear and libsvm (including libsvm dataset parsers)
  • UTF8 support for analysis on various languages
  • multithreaded algorithms

Documentation

Doxygen documentation can be found here.

Tutorials

We have walkthroughs for a few different parts of MeTA on the MeTA homepage.

Citing

If you used MeTA in your research, we would greatly appreciate a citation for our ACL demo paper:

@InProceedings{meta-toolkit,
  author    = {Massung, Sean and Geigle, Chase and Zhai, Cheng{X}iang},
  title     = {{MeTA: A Unified Toolkit for Text Retrieval and Analysis}},
  booktitle = {Proceedings of ACL-2016 System Demonstrations},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {91--96},
  url       = {http://anthology.aclweb.org/P16-4016}
}

Project setup

Mac OS X Build Guide

Mac OS X 10.6 or higher is required. You may have success with 10.5, but this is not tested.

You will need to have homebrew installed, as well as the Command Line Tools for Xcode (homebrew requires these as well, and it will prompt for them during install, or you can install them with xcode-select --install on recent versions of OS X).

Once you have homebrew installed, run the following commands to get the dependencies for MeTA:

brew update
brew install cmake jemalloc lzlib icu4c

To get started, run the following commands:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
CXX=clang++ cmake ../ -DCMAKE_BUILD_TYPE=Release -DICU_ROOT=/usr/local/opt/icu4c
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Ubuntu Build Guide

The directions here depend greatly on your installed version of Ubuntu. To check what version you are on, run the following command:

cat /etc/issue

Based on what you see, you should proceed with one of the following guides:

If your version is less than 12.04 LTS, your operating system is not supported (even by your vendor!) and you should upgrade to at least 12.04 LTS (or 14.04 LTS, if possible).

Ubuntu 12.04 LTS Build Guide

Building on Ubuntu 12.04 LTS requires more work than its more up-to-date 14.04 sister, but it can be done relatively easily. You will, however, need to install a newer C++ compiler from a ppa, and switch to it in order to build meta. We will also need to install a newer CMake version than is natively available.

Start by running the following commands to get the dependencies that we will need for building MeTA.

# this might take a while
sudo apt-get update
sudo apt-get install python-software-properties

# add the ppa that contains an updated g++
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update

# this will probably take a while
sudo apt-get install g++ g++-4.8 git make wget libjemalloc-dev zlib1g-dev

wget http://www.cmake.org/files/v3.2/cmake-3.2.0-Linux-x86_64.sh
sudo sh cmake-3.2.0-Linux-x86_64.sh --prefix=/usr/local

During CMake installation, you should agree to the license and then say "n" to including the subdirectory. You should be able to run the following commands and see the following output:

g++-4.8 --version

should print

g++-4.8 (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

/usr/local/bin/cmake --version

should print

cmake version 3.2.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
CXX=g++-4.8 /usr/local/bin/cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Ubuntu 14.04 LTS Build Guide

Ubuntu 14.04 has a recent enough GCC for building MeTA, but we'll need to add a ppa for a more recent version of CMake.

Start by running the following commands to install the dependencies for MeTA.

# this might take a while
sudo apt-get update
sudo apt-get install software-properties-common

# add the ppa for cmake
sudo add-apt-repository ppa:george-edison55/cmake-3.x
sudo apt-get update

# install dependencies
sudo apt-get install g++ cmake libicu-dev git libjemalloc-dev zlib1g-dev

Once the dependencies are all installed, you should double check your versions by running the following commands.

g++ --version

should output

g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

cmake --version

should output

cmake version 3.2.2

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Ubuntu 15.10 Build Guide

Ubuntu's non-LTS desktop offering in 15.10 has enough modern software in its repositories to build MeTA without much trouble. To install the dependencies, run the following commands.

apt update
apt install g++ git cmake make libjemalloc-dev zlib1g-dev

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Arch Linux Build Guide

Arch Linux consistently has the most up to date packages due to its rolling release setup, so it's often the easiest platform to get set up on.

To install the dependencies, run the following commands.

sudo pacman -Sy
sudo pacman -S clang cmake git icu libc++ make jemalloc zlib

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
CXX=clang++ cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Fedora Build Guide

This has been tested with Fedora 22+ (the oldest currently supported Fedora as of the time of writing). You may have success with earlier versions, but this is not tested. (If you're on an older version of Fedora, use yum instead of dnf for the commands given below.)

To get started, install some dependencies:

# These may be already installed
sudo dnf install make git wget gcc-c++ jemalloc-devel cmake zlib-devel

You should be able to run the following commands and see the following output:

g++ --version

should print

g++ (GCC) 5.3.1 20151207 (Red Hat 5.3.1-2)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

cmake --version

should print

cmake version 3.3.2

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system with the following command:

./unit-test --reporter=spec

CentOS Build Guide

MeTA can be built in CentOS 7 and above. CentOS 7 comes with a recent enough compiler (GCC 4.8.5), but too old a version of CMake. We'll thus install the compiler and related libraries from the package manager and install our own more recent cmake ourselves.

# install build dependencies (this will probably take a while)
sudo yum install gcc gcc-c++ git make wget zlib-devel epel-release
sudo yum install jemalloc-devel

wget http://www.cmake.org/files/v3.2/cmake-3.2.0-Linux-x86_64.sh
sudo sh cmake-3.2.0-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir

You should be able to run the following commands and see the following output:

g++ --version

should print

g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

/usr/local/bin/cmake --version

should print

cmake version 3.2.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
/usr/local/bin/cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

EWS/EngrIT Build Guide

Note: Please don't do this if you are able to get MeTA working in any other possible way, as the EWS filesystem has a habit of being unbearably slow and increasing compile times by several orders of magnitude. For example, comparing the cmake, make, and unit-test steps on my desktop vs. EWS gives the following:

system cmake time make time unit-test time
my desktop 0m7.523s 2m30.715s 0m36.631s
EWS 1m28s 11m28.473s 1m25.326s

If you are on a machine managed by Engineering IT at UIUC, you should follow this guide. These systems have software that is much too old for building MeTA, but EngrIT has been kind enough to package updated versions of research software as modules. The modules provided for GCC and CMake are recent enough to build MeTA, so it is actually mostly straightforward.

To set up your dependencies (you will need to do this every time you log back in to the system), run the following commands:

module load gcc
module load cmake/3.5.0

Once you have done this, double check your versions by running the following commands.

g++ --version

should output

g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

cmake --version

should output

cmake version 3.5.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

If your versions are correct, you should be ready to build. To get started, run the following commands:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
CXX=`which g++` CC=`which gcc` cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Windows Build Guide

MeTA can be built on Windows using the MinGW-w64 toolchain with gcc. We strongly recommend using MSYS2 as this makes fetching the compiler and related libraries significantly easier than it would be otherwise, and it tends to have very up-to-date packages relative to other similar MinGW distributions.

Note: If you find yourself confused or lost by the instructions below, please refer to our visual setup guide for Windows which includes screenshots for every step, including updating MSYS2 and the MinGW-w64 toolchain.

To start, download the installer for MSYS2 from the linked website and follow the instructions on that page. Once you've got it installed, you should use the MinGW shell to start a new terminal, in which you should run the following commands to download dependencies and related software needed for building:

pacman -Syu git make patch mingw-w64-x86_64-{gcc,cmake,icu,jemalloc,zlib} --force

(the --force is needed to work around a bug with the latest MSYS2 installer as of the time of writing.)

Then, exit the shell and launch the "MinGW-w64 Win64" shell. You can obtain the toolkit and get started with:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
cmake .. -G "MSYS Makefiles" -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Generic Setup Notes

  • There are rules for clean, tidy, and doc. After you run the cmake command once, you will be able to just run make as usual when you're developing---it'll detect when the CMakeLists.txt file has changed and rebuild Makefiles if it needs to.

  • To compile in debug mode, just replace Release with Debug in the appropriate cmake command for your OS above and rebuild using make after.

  • Don't hesitate to reach out on the forum if you encounter problems getting set up. We routinely build with a wide variety of compilers and operating systems through our continuous integration setups (travis-ci for Linux and OS X and Appveyor for Windows), so we can be fairly certain that things should build on nearly all major platforms.

meta's People

Contributors

31z4 avatar canoefzh avatar domarps avatar hazimehh avatar siddshuk avatar skystrife avatar smassung avatar yxkemiya avatar zero323 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

meta's Issues

Check for malformed index on load?

Right now, if the index structure was created (I think all it takes is for the folder to exist?) but not finished, you get an exception thrown about not being able to open the postings file for the index.

This is programmatically fine, but maybe we should be doing something on that exception in the example applications (such as just forcing a re-indexing)?

Add ability to load classifiers from model files

Currently some of the classifiers save model files (like sgd), but they don't have a way of initializing from a pre-trained model. For sgd specifically, the constructor calls reset() immediately.

We should add another constructor that allows for loading a pre-trained classifier from its model file. We'll need to change the way that the model file is stored most likely, but this should be a relatively easy change (for sgd anyway).

Partition data more explicitly during indexing?

Based on our observations for the performance of indexing speed with the reddit dataset, I wonder whether or not we would get better performance if we pre-split the data before passing the documents off to the analyzer in each thread as opposed to the current situation where the threads all potentially compete for the mutex that surrounds the shared queue.

Basically, what I'm imagining is lazily-loading the document's content instead of loading it when it's read from the corpus, and then just creating a huge vector of all of the documents, partitioning it into num_threads parts, and then having each thread tokenize just that segment. Perhaps that can eliminate the contention for the mutex? This is likely to be a bigger concern when documents are super small.

Reduce POSIX/Linux assumptions

We make some strong assumptions currently about the underlying system's capabilities. We should try to relax and/or encapsulate these assumptions so that we can build on multiple platforms (e.g., Windows and BSD).

Things I can remember that are assuming POSIX or Linux:

  • file descripors
  • mmap()
  • system() in unit-test for deleting a directory recursively
  • fork()/waitpid() in unit-test (Windows is probably going to be a pain here...)
  • endian-ness assumptions in the index files (e.g., disk_vector)

ngram_pos_analyzer using CRF

Change ngram_pos_analyzer to use MeTA's CRF for POS tagging. Trained model can be specified in the config file.

Might also want to look into a general analyzer function extract_sentences since diff_analyzer (and the future tree_analyzer) requires this method. extract_sentences could return either a sequence::sequence or lm::sentence. It would also make sense to convert between the two.

Add unit-tests to travis-ci integration.

Right now, travis-ci builds with gcc 4.8 and clang 3.3. This is a great start, but it would be great to extend this to also run the unit tests.

There area few problems that need tackled to do this. First, we need some way of getting the dataset(s) used in the unit-tests onto the travis-ci machine. A public webserver that hosts these would probably be a good idea (can we put them on timan103 or another department server?), since then we can just wget them.

Second, the current unit-test framework reports the success/failure of the test only on the command line via text. It ignores the return code of the child processes that are running each individual test. Ideally, we'd like the unit-test executable to also return 0 on success and 1 on failure.

reformulate language model rankers

It may be possible to simplify the formula for the language model retrieval methods. Additionally, we should look into a fast approximate log implementation.

Doxygen all the files

Make sure everything is commented correctly.

Sean:

  • index
  • analyzers/filters
  • util (self-authored)
  • model
  • corpus
  • io
  • test

Chase:

  • classifiers
  • topics
  • logging
  • utf
  • util (self-authored)
  • parallel
  • caching
  • cluster

Adjust unit-test timeouts for Debug mode

We could get fancy and look at #ifdef NDEBUG, or just increase these across the board.

These are the unit tests that fail for me in debug mode because of timeouts:

 winnow-cv-file                                   [ FAIL ] Time limit exceeded
 winnow-split-file                                [ FAIL ] Time limit exceeded

 winnow-cv-line                                   [ FAIL ] Time limit exceeded
 winnow-split-line                                [ FAIL ] Time limit exceeded

 ranker-dirichlet-prior                           [ FAIL ] Time limit exceeded
 ranker-jelinek-mercer                            [ FAIL ] Time limit exceeded
 ranker-okapi-bm25                                [ FAIL ] Time limit exceeded
 ranker-pivoted-length                            [ FAIL ] Time limit exceeded

Separate learning algorithms from models

As a general refactoring, I think it would be good to separate the learning (or inference) algorithms from the models themselves.

For example, sgd could be separated out into a linear classifier model + the sgd learning algorithm. Once the model is learned, we don't really care about the algorithm that was used to learn it. This also will help separate out the logic for learning from the logic for doing things with a learned model.

This can likely also apply to the topic models, where fundamentally we have lda as a model, and then four different inference algorithms for it.

Option to evenly split labels in classifier

Also in a config option, something under [classifier], like even-split = true (false by default). Finds the label with the lowest number of documents and randomly truncates the rest to be that amount.

It should be split during classifier runtime (still index the whole corpus).

Use CTest for unit testing

This may address some of our concerns with making our current unit-test framework cross-platform.

CTest is part of CMake and is configured within CMakeLists.txt. Here's an example:

enable_testing()
add_test(name-of-test executable argument1 argument2 argument3)
set_tests_properties(name-of-test PROPERTIES TIMEOUT 5) # 5 second time limit

Then, to run, you can issue either make test or ctest on the command line. For MSVC, it will add a target for building the tests.

My proposal, then, is basically to eliminate the fork() calls in unit_test.cpp (make debug mode default) and convert our existing run_test calls to add_test() in the CMakeLists.txt. We'd need to enhance the harness to allow for a single test to run, but I think this can be done elegantly.

This does not deprecate the unit testing framework as a whole as we still need something to do the actual unit testing for us, but it would deprecate the timeout/signal handling parts.

Thoughts? You can test this right now by doing something like:

enable_testing()
add_test(classifier-tests unit-test classifiers)

in CMakeLists.txt, regenerating, and running ctest or make test.

Remove zip/tarball downloads from website

Since Github currently doesn't package things recursively, they're worse than useless and seem to be causing confusion.

Unfortunately, there will still be the zip download option on the repository main page (and I don't see an immediate way of disabling that), so we will still have to make it clear that you need to check out from git to compile from source.

Add progress reporting to filesystem::copy_file()

Alternatively, we could have another function, filesystem::copy_file_with_progress() or something.

Basically, there's what seems like a long "hang" when making a forward_index from several gigabyte files of libsvm data because it has to copy over the file first. It would be nice if we could give progress output while this is happening.

This can be done with filesystem::file_size() to get the number of total bytes, reading the file in with ifstream::read(), using ifstream::gcount() to get the actual number of bytes read in each chunk-read, and then using printing::progress::operator() appropriately to signal the current number of bytes that have been processed (or, perhaps better, the number of "chunks").

Phrase Analyzer

Use the output of a chunker to create features based on strings of words. This will be particularly useful when combined with the topic modeling algorithms to create phrase-based models.

feature scaling for classifiers

Option to scale feature vectors in a predefined range (e.g. [0,1]). This should reduce SGD's convergence rate as well as make distance calculations proportional.

Winnow broken under gcc

 winnow-cv-file                                   [ FAIL ] Time limit exceeded
 winnow-split-file                                acc: 0.119048
[ FAIL ] Assertion failed: mtx.accuracy() > min_accuracy (/home/chase/projects/meta/test/classifier_test.h:46)

and

 winnow-cv-line                                   [ FAIL ] Time limit exceeded
 winnow-split-line                                acc: 0.119048
[ FAIL ] Assertion failed: mtx.accuracy() > min_accuracy (/home/chase/projects/meta/test/classifier_test.h:46)

(I added a printout of the accuracy of the classifier to the unit tests, which shows it's clearly confused about something.)

Is clang/libc++ giving us some behavior that's not guaranteed by the standard that we're depending on?

This is with everything compiled with g++ in release mode.

Streamline installation instructions

The top of the README/front page of the website should be modified to

  1. Contain concrete minimum system requirements for Ubuntu as well as OSX
  2. Contain instructions on installing the ICU dependency on the latest Ubuntu as well as OSX.
  3. Contain instructions on getting CMake >= 3.0 on Ubuntu as well as OSX
  4. Make it more immediately clear how to check out from git with all submodules (e.g. git clone https://github.com/meta-toolkit/meta.git --recursive)
  5. Make it more immediately clear how to choose between g++ and clang++ as a compiler (basically being Linux == g++, OSX == clang++ unless you really know what you're doing)

"Favoritism" shown for Ubuntu/OSX because those seem to be the most common compatible platforms we're getting installation questions about. People not on those distros are probably more likely to be able to figure out how to install the dependencies based on the instructions given for Ubuntu (but we might eventually want to include a section for other distros...)

Allow line-corpus format corpora to be segmented

For example, our yelp dataset has two segments: score (which is everything) and sentiment (which is just the extreme reviews). These subsets can be independently indexed if using file_corpus, but not when using line_corpus.

Not sure if this is worth it or not, but it could potentially be useful for the same reasons it was useful for file_corpus.

Reduce amount of full-scans required for forward_index creation from libsvm data

The current implementation of forward_index scans through the libsvm formatted file at least three times:

  1. To get the number of documents in the index
  2. To set the document byte positions for the index
  3. To initialize all document-level metadata for the index

When the data file is huge, this is really overkill. I think we ought to be able to combine these all into one pass through the file, and then just initialize all of the disk_vectors after we have created their files for them. Since we're single-threaded and just going through the data file in sequential order, we should be able to just write the numbers/doubles in binary to the appropriate file and then read them back in with disk_vector, I think. This would be a pretty big improvement for, say, the mnist8m dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.