Coder Social home page Coder Social logo

rocketqa's Introduction

RocketQA

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutting edge technologies, this repository provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers, namely RocketQA. This toolkit has the following advantages:

  • State-of-the-art: It provides well-trained RocketQA models, which achieve SOTA performance on many dense retrieval datasets. And it will continue to update the latest models.
  • First-Chinese-model: It provides the first open source Chinese dense retrieval model, which is trained on millions of manual annotation data from DuReader.
  • Easy-to-use: By integrating this toolkit with JINA, developers can build an end-to-end question answering system with several lines of code.

Installation

We provide two installation methods: Python Installation Package and Docker Environment

Install with Python Package

First, install PaddlePaddle.

# GPU version:
$ pip install paddlepaddle-gpu

# CPU version:
$ pip install paddlepaddle

Second, install rocketqa package:

$ pip install rocketqa

NOTE: this toolkit MUST be running on Python3.6+ with PaddlePaddle 2.0+.

Install with Docker

docker pull rocketqa/rocketqa

docker run -it docker.io/rocketqa/rocketqa bash

Getting Started

Refer to the examples below, you can build your own Search Engine with several lines of code.

Running with JINA

JINA is a cloud-native neural search framework to build SOTA and scalable deep learning search applications in minutes. Here is a simple example to build a Search Engine based on JINA and RocketQA.

cd examples/jina_example
pip3 install -r requirements.txt

# Index: Encodes and indexes text, then starts a searching service
python3 app.py index toy_data/test.tsv

# Query: Encodes query and searches for answer, returns candidates ranked by relevance score
python3 app.py query_cli

Please view JINA example to know more.

Running with FAISS

We also provide a simple example built on Faiss.

cd examples/faiss_example/
pip3 install -r requirements.txt

# Index: Encodes and indexes text
python3 index.py en ../marco.tp.1k marco_index

# Start service
python3 rocketqa_service.py en ../marco.tp.1k marco_index

# Request: Encodes query and searches for answer, returns candidates ranked by relevance score
python3 query.py

API

RocketQA provide two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. For running RocketQA models and your own checkpoints, you can use the following functions.

Load model

Returns the names of the available RocketQA models. To know more about the available models, please see the code comment.

Returns the model specified by the input parameter. It can initialize both dual encoder and cross encoder. By setting input parameter, you can load either RocketQA models returned by "available_models()" or your own checkpoints.

Dual encoder

Dual-encoder returned by "load_model()" supports the following functions:

Given a list of queries, returns their representation vectors encoded by model.

Given a list of paragraphs and their corresponding titles (optional), returns their representations vectors encoded by model.

Given a list of queries and paragraphs (and titles), returns their matching scores (dot product between two representation vectors).

Cross encoder

Cross-encoder returned by "load_model()" supports the following function:

Given a list of queries and paragraphs (and titles), returns their matching scores (probability that the paragraph is the query's right answer).

Examples

Following the examples below, you can run RocketQA models and your own checkpoints.

Run RocketQA Model

To run RocketQA models, you should set the parameter model in 'load_model()' with RocketQA model name return by 'available_models()'.

import rocketqa

query_list = ["trigeminal definition"]
para_list = [
    "Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT."]

# init dual encoder
dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=16)

# encode query & para
q_embs = dual_encoder.encode_query(query=query_list)
p_embs = dual_encoder.encode_para(para=para_list)
# compute dot product of query representation and para representation
dot_products = dual_encoder.matching(query=query_list, para=para_list)

Run Self-development Model

To run your own checkpoints, you should write a config file, and set the parameter model in 'load_model()' with the path of the config file.

import rocketqa

query_list = ["交叉验证的作用"]
title_list = ["交叉验证的介绍"]
para_list = ["交叉验证(Cross-validation)主要用于建模应用中,例如PCR 、PLS回归建模中。在给定的建模样本中,拿出大部分样本进行建模型,留小部分样本用刚建立的模型进行预报,并求这小部分样本的预报误差,记录它们的平方加和。"]

# conf
ce_conf = {
    "model": ${YOUR_CONFIG},     # path of config file
    "use_cuda": True,
    "device_id": 0,
    "batch_size": 16
}

# init cross encoder
cross_encoder = rocketqa.load_model(**ce_conf)

# compute matching score of query and para
ranking_score = cross_encoder.matching(query=query_list, para=para_list, title=title_list)

${YOUR_CONFIG} is a JSON format file.

{
    "model_type": "cross_encoder",
    "max_seq_len": 160,
    "model_conf_path": "en_large_config.json",  # path relative to config file
    "model_vocab_path": "en_vocab.txt",         # path relative to config file
    "model_checkpoint_path": "marco_cross_encoder_large", # path relative to config file
    "joint_training": 0
}

News

  • August 26, 2021: RocketQA v2 was accepted by EMNLP 2021.
  • May 5, 2021: PAIR was accepted by ACL 2021
  • March 11, 2021: RocketQA v1 was accepted by NAACL 2021.

Citations

If you find RocketQA v1 models helpful, feel free to cite our publication RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

@inproceedings{rocketqa_v1,
    title="RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering",
    author="Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu and Haifeng Wang",
    year="2021",
    booktitle = "In Proceedings of NAACL"
}

If you find PAIR models helpful, feel free to cite our publication PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval

@inproceedings{rocketqa_pair,
    title="PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval",
    author="Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen",
    year="2021",
    booktitle = "In Proceedings of ACL Findings"
}

If you find RocketQA v2 models helpful, feel free to cite our publication RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

@inproceedings{rocketqa_v2,
    title="RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking",
    author="Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen",
    year="2021",
    booktitle = "In Proceedings of EMNLP"
}

License

This repository is provided under the Apache-2.0 license.

Contact Information

For help or issues using RocketQA, please submit a Github issue.

For other communication or cooperation, please contact Jing Liu ([email protected]) or scan the following QR Code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.