Coder Social home page Coder Social logo

ungol's Introduction

V N G O L

Meta Repository. There are sub-projects which are bundled to the ungol package.

Abstract

In retrieval applications, binary hashes are known to offer significant improvements in terms of both memory and speed. We investigate the compression of sentence embeddings using a neural encoder-decoder architecture, which is trained by minimizing reconstruction error. Instead of employing the original real-valued embeddings, we use latent representations in Hamming space produced by the encoder for similarity calculations. In quantitative experiments on several benchmarks for semantic similarity tasks, we show that our compressed hamming embeddings yield a comparable performance to uncompressed embeddings (Sent2Vec, InferSent, Glove-BoW), at compression ratios of up to 256:1. We further demonstrate that our model strongly decorrelates input features, and that the compressor generalizes well when pre-trained on Wikipedia sentences. We publish the source code on Github and all experimental results.

Setup

Check out this repository. You should now have this folder structure:

└── ungol
    ├── eval
    ├── common
    ├── models
    └── similarity

To install all dependencies and install the whole module run make setup. To create a wheel run make dist.

Full Run

In the following a full run through the whole system from training binary embeddings up to an evaluation with an information retrieval data set is explained.

Training Binary Embeddings

The ungol.models module is concerned with the production of binary embedding codes via embcompr.py. It offers the infrastructure for training, an implementation of a neural compressor model and the routines for persisting the codes and models.

First enter the project, create the output directory and inspect the example configuration and possible options:

cd ../models
mkdir opt
less conf/embcompr.example.conf
./ungol/models/embcompr.py --help

You need to provide some embeddings. The embedding providers are defined in ungol.common.embed and you can implement your own if any is missing for your specific format or service. Now create your own configuration or use an existing one and train a model (from now on, as an example, a binary model with 256 bit codes is used):

./ungol/models/embcompr.py conf/embcompr.binary.conf

Various files and statistics are produced and persisted transparantly by this program. Simply explore the directory given by folder directive of the statistics configuration section (in this case opt/experiments/current/256x2/). Much more detailed information can be found in the README.md of ungol.models.

Producing a binary code file

A trained model can now be used to produce a code file for later use by the ungol index. The embcodr.py program in ungol.models is concerned with handling these code files. Consider the model produced by conf/embcodr.binary.conf is persisted to opt/experiments/current/256x2/compressor/model-2000.torch:

./ungol/models/embcodr.py --help
exp=opt/experiments/current/256x2
./ungol/models/embcodr.py --binary $exp/compressor/model-2000.torch 256x2

Now this writes three files:

codemap.model-2000.bin
codemap.model-2000.h5
codemap.model-2000.txt

The .bin file is a the main binary exchange format with minimal memory footprint. The .h5 file saves the data in the 1-byte-array format used internally (and thus consumes much more memory). The .txt file is a human readable version of the data.

Create k-NN files

Optionally, k-NN files can be created for different distance measures such as Euclidean, Cosine or Hamming. These are used for analysis in the notebooks or optionally for index creation. As such it is described here how to produce such a file for the Hamming distance (this example uses the h5py embedding provider as defined in the configuration):

./ungol/models/analyze.py --help
exp=opt/experiments/current/256x2
ungol/models/analyze.py nn-hamming $exp/codemap.model-2000.h5 $exp/hamming.h5

To create such distance files for other distance measures in continuous space, simply provide the necessary --embed-* command line arguments (see --help) and choose from nn-manhattan, nn-euclidean and nn-cosine.

Build Indexes

Now everything is prepared for building an index. The module ungol.index.setup is offering the necessary infrastructure to create instances of ungol.index.index.Index used for calculating different document distance metrics.

SDA/FRA/SPIEGEL

The CLEF 2003 Ad-Hoc Monolingual-task works with a document corpus of around 300k German news articles. These articles must now be added to a new index. The eval project and its ungol.eval.retrieval module are used to do so. Reading in the corpus data is handled by dumpr which reads in xml files. If you do not wish to use this approach have a look at ungol.eval.retrieval.setup.do_ungol_setup_articles - it is quite straight forward to implement your own reader. You need a pickled dictionary of the word embedding indexes and a codemap as described in the former section.

./ungol/retrieval/setup.py ungol-setup-articles \
    --vocabulary path/to/vocab.pickle \
    --ungol-codemap path/to/codemap.h5 \
    --files path/to/*.xml \
    --processes 7 \
    --out opt/indexes/fasttext.de-256x2.index.pickle

Additionally, for the evaluation, the query documents also need to be added to the index:

ungol/retrieval/setup.py ungol-setup-topics \
    --files data/CLEF/truth/CLEF2003_ah-mono-de_topics.xml \
    --out opt/indexes/fasttext.de-256x2.index.pickle

Run an Evaluation

Now much information is passed by using the command line arguments (I may implement configfile as in ungol.models but there are more important things to do atm.). I usually aggregate common command line arguments in variables (see eval/dot_me_honey.fish.

To run an evaluation on a randomly sampled but fixed dataset (created by eval/notes/dataset.ipynb) run:

./ungol/retrieval/evaluate.py \
    data/CLEF/truth/CLEF2003_ah-mono-de.txt \     # ground truth
    opt/current/ \                                # directory to write results to
    --ungol \                                     # evaluate ungol
    --ungol-index opt/indexes/fasttext.de-256.index.pickle \
    --ungol-fn rhwmd                              # scoring function
    -k 250                                        # retrieve a selection of 250 documents
    --dataset opt/datasets/180921.10000.1.pickle  # pre-selection

Additional Information

Logging

The whole library uses ungol.common.logger which in turn uses the standard python logging facility. It checks for two possible logging configurations:

  1. If (based on the cwd of execution) a file conf/logging.conf exists it is used.
  2. If the environment variable UNGOL_LOG is set to a *.conf file then this file is used. It overwrites the conf/logging.conf if it exists.

ungol's People

Contributors

kantholtz avatar

Stargazers

Kuangyan Song avatar

Watchers

 avatar James Cloos avatar

Forkers

nachtsky1077

ungol's Issues

Some thoughts and question while trying out the hamming embedding idea

Hi, I came across your thesis when I was doing some embedding quantization work, and it is really great work.

In my case, I would like to build up an embedding quantization module on top of an encoder, say BERT, hence it is hard to random sample embeddings as the initial values of codebook. I was wondering if there's any advice or thoughts from you for this case?

Also, I'm considering to use the latent representation to compute loss and train directly (discard the reconstruction layer), since, in my case, the final goal is to preserve the similarity between two sentences as much as possible, and possibly preserve the ranking of the sentences similarity as much as possible. Thus the reconstruction loss seems not perfectly aligned with the goal. But I haven't come up with an idea for this case and any advice or discussion are welcome! (Previously, I did binarization directly, which simply takes the sign of the embedding element, and use STE to do BP).

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.