Coder Social home page Coder Social logo

colbert's Introduction

ColBERT: Contextualized Late Interaction over BERT

This is the reference implementation of the paper ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, to appear at SIGIR'20 (preprint).

UPDATE: See the new version 0.2 on this branch.

Dependencies

ColBERT requires Python 3 and Pytorch 1 and uses the HuggingFace Transformers library. You can create a conda environment with the required dependencies using the conda_requirements.txt file.

conda create --name <env> --file conda_requirements.txt

Data

This repository works directly with the data format of the MS MARCO Passage Ranking dataset. You will need the training triples (triples.train.small.tar.gz), the official top-1000 ranked lists for the dev set queries (top1000.dev), and the dev set relevant passages (qrels.dev.small.tsv). For indexing the full collection, you will also need the list of passages (collection.tar.gz).

To avoid specifying the directory where you download this data on every command, it might be useful to modify DEFAULT_DATA_DIR in src/parameters.py to your data directory.


Training

Training requires a list of <query, positive passage, negative passage> tab-separated triples. Out of the box, this works with MS MARCO Passage Ranking's triples.train.small.tsv (see above for Data).

Example command:

python -m src.train --triples triples.train.small.tsv [--data_dir <path>] [--dim 128] [--maxsteps 400000] [--bsize 32] [-accum 2] [...]

Refer to src/train.py for the complete list of arguments and their defaults.

Pretrained model

To be released soon.


Evaluation

Before indexing into ColBERT, you can evaluate the model at re-ranking a pre-defined top-k set per query. This evaluation will use ColBERT on-the-fly. That is, it will compute document representations during query evaluation. For offline indexing and efficient ranking, see Indexing below.

This script requires the top-k list per query, provieded as a tab-separated file whose every line contains a quadruple <query ID, passage ID, query text, passage text>. This is the format of MS MARCO's top1000.dev and top1000.eval. Additionally, you can optionally supply the relevance judgements (qrels) for evaluation. This is a tab-separated file whose every line has a quadruple <query ID, 0, passage ID, 1>, like qrels.dev.small.tsv.

Example command:

python -m src.test --checkpoint colbert.dnn --topk top1000.dev [--qrels qrels.dev.small.tsv] [--output_dir <path>] [...]

Refer to src/test.py for the complete list of arguments and their defaults.


Indexing

For efficient retrieval and much faster re-ranking, you can precompute the document representations with ColBERT. This step requires a tab-separated file, whose every line contains a passage ID alongside the passage's content. Out of the box, this works with MS MARCO Passage Ranking's collection.tsv.

Example command:

python -m src.index --index <index_name> --collection collection.tsv --checkpoint colbert.dnn [--bsize <n>] [--output_dir <path>] [...]

Indexing uses all GPUs visible to the process. To limit those to, say, GPUs #0 and #2, you can prepend CUDA_VISIBLE_DEVICES="0,2" to the command.

Using the index for efficient re-ranking

Example command:

python -m src.rerank --index <index_name> --checkpoint colbert.dnn --topk top1000.dev [--qrels qrels.dev.small.tsv]

Indexing for end-to-end retrieval from the full collection

To be released soon. This step uses faiss for fast vector-similarity search.

Using the index for end-to-end retrieval

To be released soon.

python -m src.retrieve --index <index_name> --checkpoint colbert.dnn [--qrels qrels.dev.small.tsv]

colbert's People

Contributors

okhat avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.