Coder Social home page Coder Social logo

hyukkyukang / baleen Goto Github PK

View Code? Open in Web Editor NEW

This project forked from stanford-futuredata/baleen

0.0 0.0 0.0 10.09 MB

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21)

License: MIT License

Python 93.88% Dockerfile 4.99% Shell 1.13%

baleen's Introduction

Baleen

Baleen is a state-of-the-art model for multi-hop reasoning, enabling scalable multi-hop search over massive collections for knowledge-intensive tasks like QA and claim verification.

Figure 1: Baleen's condensed retrieval architecture for multi-hop search.

Installation

The implementation of Baleen lives as part of the parent ColBERT repository (under its new_api branch).

After cloning, make sure you obtain the code for the submodule too:

git submodule update --init --recursive

Please follow the installation instructions from the submodule. Baleen has the same requirements as the parent ColBERT repository.

Download

We release preprocessed data and models for the HoVer benchmark. The scripts below will use the decompressed files, which you should save under a single datadir directory.

# Preprocessed HoVer data (3 MB compressed)
wget https://downloads.cs.stanford.edu/nlp/data/colbert/baleen/hover.tar.gz
tar -xvzf hover.tar.gz

# Preprocessed Wikipedia Abstracts 2017 collection (1 GB compressed)
wget https://downloads.cs.stanford.edu/nlp/data/colbert/baleen/wiki.abstracts.2017.tar.gz
tar -xvzf wiki.abstracts.2017.tar.gz

# Checkpoints for Baleen retrieval and condesening (8 GB compressed)
wget https://downloads.cs.stanford.edu/nlp/data/colbert/baleen/hover.checkpoints-v1.0.tar.gz
tar -xvzf hover.checkpoints-v1.0.tar.gz

Indexing

The script below indexes Wikipedia Abstracts 2017, which is the collection used for HoVer (and HotPotQA). It uses the Baleen model trained on HoVer.

Indexing uses the compression mechanism described in ColBERTv2 to reduce the number of bits per dimension from 16 (as in the paper) to 2. This results in only marginal loss in retrieval quality on HoVer while preserving sentence-level EM and reducing the storage footprint about 5x.

python -m hover_indexing --root /path/to/save/experiments/ --datadir /path/to/downloads/ --index wiki17.hover.2bit --nbits 2

Multi-Hop Retrieval

The script below applies 4-hop inference using the queries (claims) in the HoVer dev set.

python -m hover_inference --root /path/to/save/experiments/ --datadir /path/to/downloads/ --index wiki17.hover.2bit

As the short script illustrates, the interface API is straightforward to use, once the Baleen modules are loaded. Given a text query, multi-hop search can be conducted using baleen.search(query, num_hops=4).

If you face any issues, please open a new issue and we'll help you promptly!

baleen's People

Contributors

hyukkyukang avatar okhat avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.