Coder Social home page Coder Social logo

splade's Introduction

SPLADE

paper blog huggingface weights weights

What's New:

This repository contains the code to perform training, indexing and retrieval for SPLADE models. It also includes everything needed to launch evaluation on the BEIR benchmark.

TL; DR SPLADE is a neural retrieval model which learns query/document sparse expansion via the BERT MLM head and sparse regularization. Sparse representations benefit from several advantages compared to dense approaches: efficient use of inverted index, explicit lexical match, interpretability... They also seem to be better at generalizing on out-of-domain data (BEIR benchmark).

By benefiting from recent advances in training neural retrievers, our v2 models rely on hard-negative mining, distillation and better Pre-trained Language Model initialization to further increase their effectiveness, on both in-domain (MS MARCO) and out-of-domain evaluation (BEIR benchmark).

Finally, by introducing several modifications (query specific regularization, disjoint encoders etc.), we are able to improve efficiency, achieving latency on par with BM25 under the same computing constraints.

Weights for models trained under various settings can be found on Naver Labs Europe website, as well as Hugging Face. Please bear in mind that SPLADE is more a class of models rather than a model per se: depending on the regularization magnitude, we can obtain different models (from very sparse to models doing intense query/doc expansion) with different properties and performance.

splade: a spork that is sharp along one edge or both edges, enabling it to be used as a knife, a fork and a spoon.


Getting started 🚀

Requirements

We recommend to start from a fresh environment, and install the packages from conda_splade_env.yml.

conda create -n splade_env python=3.9
conda activate splade_env
conda env create -f conda_splade_env.yml

Usage

Playing with the model

inference_splade.ipynb allows you to load and perform inference with a trained model, in order to inspect the predicted "bag-of-expanded-words". We provide weights for six main models:

model MRR@10 (MS MARCO dev)
naver/splade_v2_max (v2 HF) 34.0
naver/splade_v2_distil (v2 HF) 36.8
naver/splade-cocondenser-selfdistil (SPLADE++, HF) 37.6
naver/splade-cocondenser-ensembledistil (SPLADE++, HF) 38.3
naver/efficient-splade-V-large-doc (HF) + naver/efficient-splade-V-large-query (HF) (efficient SPLADE) 38.8
naver/efficient-splade-VI-BT-large-doc (HF) + efficient-splade-VI-BT-large-query (HF) (efficient SPLADE) 38.0

We also uploaded various models here. Feel free to try them out!

High level overview of the code structure

  • This repository lets you either train (train.py), index (index.py), retrieve (retrieve.py) (or perform every step with all.py) SPLADE models.
  • To manage experiments, we rely on hydra. Please refer to conf/README.md for a complete guide on how we configured experiments.

Data

  • To train models, we rely on MS MARCO data.
  • We also further rely on distillation and hard negative mining, from available datasets (Margin MSE Distillation , Sentence Transformers Hard Negatives) or datasets we built ourselves (e.g. negatives mined from SPLADE).
  • Most of the data formats are pretty standard; for validation, we rely on an approximate validation set, following a setting similar to TAS-B.

To simplify setting up, we made available all our data folders, which can be downloaded here. This link includes queries, documents and hard negative data, allowing for training under the EnsembleDistil setting (see v2bis paper). For other settings (Simple, DistilMSE, SelfDistil), you also have to download:

After downloading, you can just untar in the root directory, and it will be placed in the right folder.

tar -xzvf file.tar.gz

Quick start

In order to perform all steps (here on toy data, i.e. config_default.yaml), go on the root directory and run:

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_default.yaml"
python3 -m splade.all \
  config.checkpoint_dir=experiments/debug/checkpoint \
  config.index_dir=experiments/debug/index \
  config.out_dir=experiments/debug/out

Additional examples

We provide additional examples that can be plugged in the above code. See conf/README.md for details on how to change experiment settings.

  • you can similarly run training python3 -m splade.train (same for indexing or retrieval)
  • to create Anserini readable files (after training), run SPLADE_CONFIG_FULLPATH=/path/to/checkpoint/dir/config.yaml python3 -m splade.create_anserini +quantization_factor_document=100 +quantization_factor_query=100
  • config files for various settings (distillation etc.) are available in /conf. For instance, to run the SelfDistil setting:
    • change to SPLADE_CONFIG_NAME=config_splade++_selfdistil.yaml
    • to further change parameters (e.g. lambdas) outside the config, run: python3 -m splade.all config.regularizer.FLOPS.lambda_q=0.06 config.regularizer.FLOPS.lambda_d=0.02

We provide several base configurations which correspond to the experiments in the v2bis and "efficiency" papers. Please note that these are suited for our hardware setting, i.e. 4 GPUs Tesla V100 with 32GB memory. In order to train models with e.g. one GPU, you need to decrease the batch size for training and evaluation. Also note that, as the range for the loss might change with a different batch size, corresponding lambdas for regularization might need to be adapted. However, we provide a mono-gpu configuration config_splade++_cocondenser_ensembledistil_monogpu.yaml for which we obtain 37.2 MRR@10, trained on a single 16GB GPU.

Evaluating a pre-trained model

Indexing (and retrieval) can be done either using our (numba-based) implementation of inverted index, or Anserini. Let's perform these steps using an available model (naver/splade-cocondenser-ensembledistil).

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_splade++_cocondenser_ensembledistil"
python3 -m splade.index \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index
python3 -m splade.retrieve \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  config.out_dir=experiments/pre-trained/out
# pretrained_no_yamlconfig indicates that we solely rely on a HF-valid model path
  • To change the data, simply override the hydra retrieve_evaluate package, e.g. add retrieve_evaluate=msmarco as argument of splade.retrieve.

You can similarly build the files that will be ingested by Anserini:

python3 -m splade.create_anserini \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  +quantization_factor_document=100 \
  +quantization_factor_query=100

It will create the json collection (docs_anserini.jsonl) as well as the queries (queries_anserini.tsv) that are needed for Anserini. You then just need to follow the regression for SPLADE here in order to index and retrieve.

BEIR eval

You can also run evaluation on BEIR, for instance:

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_FULLPATH="/path/to/checkpoint/dir/config.yaml"
for dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq
do
    python3 -m splade.beir_eval \
      +beir.dataset=$dataset \
      +beir.dataset_path=data/beir \
      config.index_retrieve_batch_size=100
done

PISA evaluation

We provide in efficient_splade_pisa/README.md the steps to evaluate efficient SPLADE models with PISA.


Cite 📜

Please cite our work as:

  • (v1) SIGIR21 short paper
@inbook{10.1145/3404835.3463098,
author = {Formal, Thibault and Piwowarski, Benjamin and Clinchant, St\'{e}phane},
title = {SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463098},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2288–2292},
numpages = {5}
}
  • (v2) arxiv
@misc{https://doi.org/10.48550/arxiv.2109.10086,
  doi = {10.48550/ARXIV.2109.10086},
  url = {https://arxiv.org/abs/2109.10086},
  author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, Stéphane},
  keywords = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval},
  publisher = {arXiv},
  year = {2021},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}
  • (v2bis) SPLADE++, SIGIR22 short paper
@inproceedings{10.1145/3477495.3531857,
author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, St\'{e}phane},
title = {From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531857},
doi = {10.1145/3477495.3531857},
abstract = {Neural retrievers based on dense representations combined with Approximate Nearest Neighbors search have recently received a lot of attention, owing their success to distillation and/or better sampling of examples for training -- while still relying on the same backbone architecture. In the meantime, sparse representation learning fueled by traditional inverted indexing techniques has seen a growing interest, inheriting from desirable IR priors such as explicit lexical matching. While some architectural variants have been proposed, a lesser effort has been put in the training of such models. In this work, we build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization. We furthermore study the link between effectiveness and efficiency, on in-domain and zero-shot settings, leading to state-of-the-art results in both scenarios for sufficiently expressive models.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2353–2359},
numpages = {7},
keywords = {neural networks, indexing, sparse representations, regularization},
location = {Madrid, Spain},
series = {SIGIR '22}
}
  • efficient SPLADE, SIGIR22 short paper
@inproceedings{10.1145/3477495.3531833,
author = {Lassance, Carlos and Clinchant, St\'{e}phane},
title = {An Efficiency Study for SPLADE Models},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531833},
doi = {10.1145/3477495.3531833},
abstract = {Latency and efficiency issues are often overlooked when evaluating IR models based on Pretrained Language Models (PLMs) in reason of multiple hardware and software testing scenarios. Nevertheless, efficiency is an important part of such systems and should not be overlooked. In this paper, we focus on improving the efficiency of the SPLADE model since it has achieved state-of-the-art zero-shot performance and competitive results on TREC collections. SPLADE efficiency can be controlled via a regularization factor, but solely controlling this regularization has been shown to not be efficient enough. In order to reduce the latency gap between SPLADE and traditional retrieval systems, we propose several techniques including L1 regularization for queries, a separation of document/query encoders, a FLOPS-regularized middle-training, and the use of faster query encoders. Our benchmark demonstrates that we can drastically improve the efficiency of these models while increasing the performance metrics on in-domain data. To our knowledge, we propose the first neural models that, under the same computing constraints, achieve similar latency (less than 4ms difference) as traditional BM25, while having similar performance (less than 10% MRR@10 reduction) as the state-of-the-art single-stage neural rankers on in-domain data.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2220–2226},
numpages = {7},
keywords = {splade, sparse representations, latency, information retrieval},
location = {Madrid, Spain},
series = {SIGIR '22}
}

Contact 📭

Feel free to contact us via Twitter or by mail @ [email protected] !

License

SPLADE Copyright (c) 2021-present NAVER Corp.

SPLADE is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. (see license)

You should have received a copy of the license along with this work. If not, see http://creativecommons.org/licenses/by-nc-sa/4.0/ .

splade's People

Contributors

alexnodex avatar beanmilk avatar cadurosar avatar cmacdonald avatar drrv avatar jmmackenzie avatar leobavila avatar sclincha avatar thibault-formal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

splade's Issues

Equation (1) and (4)

In your paper, you said equation (1) is equivalent to the MLM prediction and E_j in equation (1) denotes the BERT input embedding for token j. If you use the default implementation of HuggingFace Transformers, E_j is not from the input layer but another embeddings matrix, which is called "decoder" in the "BertLMPredictionHead" (if you use BERT). Did you manually set the "decoder" weights to the input embedding weights?

My other question is concerning equation (4). It computes the summation of the weights of the document/query terms. In the "forward" function of the Splade class (models.py) however, you use "torch.max" function. Can you explain this issue?

Indexing a document corpus with Efficient SPLADE

What is the process for indexing MS MARCO using Efficient SPLADE?

I see a Dropbox link to download a pre-built index for MS MARCO, and a command to use PISA's query evaluation to retrieve from that index. However, I'd like to reproduce the indexing stage for this and other IR datasets.

Benchmark Performance After Re-ranking?

I'm curious if you've run your model with a "second-stage" reranker, on the BEIR benchmarks.
Would you expect much benefit from this?

Thank you, and excellent work!

Use interactively without indexing?

Hi there!

I'm looking to use Splade to evaluate just a handful of examples in a programatic way (without reading/writing to disk). The inference_splade.ipynb script was super useful, but I'm looking to evaluate a query over a handful of document rather than only look at the document representations.

Is there an easy way to do this, or will I have to index and write to disk my small number of examples and then retrieve from that?

Thanks!

Normalizing SPLADE embeddings - a bad idea?

Hi!

I'm using SPLADE together with sentence-transformers/multi-qa-mpnet-base-cos-v1 SentenceTransformer to create hybrid embeddings for use in Pinecone's sparse-dense indexes.

The sparse-dense indexes can only use dotproduct similarity, which is why I chose a dense model trained with cosine similarity. This means I get back dense embeddings with L2 norm of 1 and dot product similarity in range [-1, 1] which I can easily rescale to the unit interval. Based on my somewhat limited understanding, this seems like a relatively sound approach to getting scores which our users can understand as % similarity (assuming in distribution).

After transitioning to sparse-dense vectors, I noticed that SPLADE does not produce normalized embeddings, which means this approach no longer works. I thought about normalizing the SPLADE embeddings, but I'm not sure how this would affect performance.

On a separate note, I'm using Pinecone's convex combination

# alpha in range [0, 1]
embedding.sparse.values = [
    value * (1 - alpha) for value in embedding.sparse.values
]
embedding.dense = [value * alpha for value in embedding.dense]

I am struggling to reason about how all of this interacts and what effect it has on ranking. See here for info on how pinecone's score is calculated and here for more details about their convex combination logic.

Any help understanding this stuff would be hugely appreciated 🙌

Cheers!

configuration for splade++ results

Hi-- thanks for the nice work.

I'm trying to index+retrieve using the naver/splade-cocondenser-ensembledistil model. Following the readme, I've done:

export SPLADE_CONFIG_FULLPATH="config_default.yaml"
python3 -m src.index \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \ # <--- (from readme, using the new model)
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  index=msmarco  # <--- added

export SPLADE_CONFIG_FULLPATH="config_default.yaml"
python3 -m src.retrieve \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \ # <--- (from readme, using the new model)
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  config.out_dir=experiments/pre-trained/out-dl19 \
  index=msmarco \  # <--- added
  retrieve_evaluate=msmarco # <--- added

Everything runs just fine, but I'm getting rather poor results in the end:

MRR@10: 0.18084248646927734
recall ==> {'recall_5': 0.2665353390639923, 'recall_10': 0.3298710601719197, 'recall_15': 0.3694364851957974, 'recall_20': 0.3951050620821394, 'recall_30': 0.4270654250238777, 'recall_100': 0.5166069723018146, 'recall_200': 0.5560768863419291, 'recall_500': 0.606984240687679, 'recall_1000': 0.6402578796561604}

I suspect it's a configuration problem on my end, but since the indexing process takes a bit of time, I thought I'd just ask before diving too far into the weeds: Is there a configuration file to use for the splade++ results, and how do I use it?

Thanks!

Zero-dimension query embedding

In the notebook I made some modifications and I get back a zero-dimensional embedding. Specifically I wanted to see the bow representation of a quoted search query using the efficient-splade models. Is it expected for the model to sometimes return zero-dimensional embeddings? Without the quotes it generates an expected representation.

model_type_or_dir = "naver/efficient-splade-V-large-query"
q_model_type_or_dir = "naver/efficient-splade-V-large-doc"

# loading model and tokenizer

model = Splade(model_type_or_dir, q_model_type_or_dir, agg="max")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(q_model_type_or_dir)
reverse_voc = {v: k for k, v in tokenizer.vocab.items()}

# example document from MS MARCO passage collection (doc_id = 8003157)

query = '"a big fat potato"'

# now compute the document representation
with torch.no_grad():
    inputs = tokenizer(query, return_tensors="pt")
    print(inputs)
    query_rep = model(q_kwargs=inputs)["q_rep"].squeeze()  # (sparse) doc rep in voc space, shape (30522,)

# get the number of non-zero dimensions in the rep:
col = torch.nonzero(query_rep).squeeze().cpu().tolist()
print("number of actual dimensions: ", len(col))

# now let's inspect the bow representation:
weights = query_rep[col].cpu().tolist()
d = {k: v for k, v in zip(col, weights)}
sorted_d = {k: v for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)}
bow_rep = []
for k, v in sorted_d.items():
    bow_rep.append((reverse_voc[k], round(v, 2)))
print("SPLADE BOW rep:\n", bow_rep)

Dockerized environment to run splade

Can you please provide a Dockerfile to run splade? The conda env has dependencies on binaries built against nvidia-cuda for linux platform. I cannot build it for non-cuda linux and osx. I tried replacing those dependencies with their replacement built for usage with cpu, however, still did not manage to make it work.

A Dockerfile should allow users to use splade more easily across platforms.

How to install the ENV correctly?

Hi,

In your readme file:

conda create -n splade_env python=3.9
conda activate splade_env
conda env create -f conda_splade_env.yml

makes me confused as conda create -n splade_env python=3.9 creates an ENV named splade_env without installing any packages while conda env create -f conda_splade_env.yml directly installs an ENV named splade with python=3.8 and all the required package. In the following instructions, splade_env is utilized however splade_env is just a python=3.9 env without any packages installed.

BTW, I installed ENV splade by conda env create -f conda_splade_env.yml but got error TypeError: main() got an unexpected keyword argument 'version_base' when I run config config_splade++_cocondenser_ensembledistil`.

How to solve this ENV issue? Thanks.

Zhiyuan.

Tutorial to export a SPLADE model to ONNX

Hello,

I trained a SPLADE model on my own recently. To reduce the inference time, I tried to export my model to ONNX with torch.onnx.export() but I encountered a few errors.

Is there a tutorial somewhere for this conversion?

Multilingual version of SPLADE

I'm very impressed by SPLADE, particularly the newest efficient versions. However, it is only trained on English texts.

There's an mMARCO dataset that has 14 languages, which is already in use by SBERT and other projects. Importantly, there's a doc2query mt5 model that uses this dataset. It seems to me that anyone using non-english (or multiple) languages would have no choice but to use this. A SPLADE version would be fantastic, especially if compared to the mT5 version of doc2query on BEIR zero-shot data!

Even better would be if you could somehow use the FLORES-200 dataset, which is used by the cutting edge NLLB-200 translation model!

Would you consider implementing a multilingual version in a future iteration of SPLADE? I think this would provide immense value to the global community!

Also, its not clear to me that the SPLADE++ methods were used as part of your efficient version. So, it would be great if you could use and compare it with the other methods.

Inquiry about Configuration Details for "ecir23-scratch-tydi-japanese-splade" Model

Hello, I am currently developing a Japanese model and have been referencing the "ecir23-scratch-tydi-japanese-splade" model on Hugging Face for guidance. I would greatly appreciate it if you could share the specific settings, including the models and datasets used, to create this model. This information will be incredibly helpful for my project. Thank you in advance for your assistance.

url:https://huggingface.co/naver/ecir23-scratch-tydi-japanese-splade

Clustering

Maybe a stupid question, but you can't use SPLADE for clustering, right?

Instructions on Using Pisa for Splade

Firstly, thanks for your series of amazing papers and well-organized code implementations.

The two papers Wacky Weights in Learned Sparse Representations and the
Revenge of Score-at-a-Time Query Evaluation
and From Distillation to Hard Negative Sampling: Making Sparse
Neural IR Models More Effective
show that using Pisa can make query retrieval much faster compared to using Anserini or code from the repo for Splade.

The folder efficient_splade_pisa/ in the repo contains the instructions on using Pisa for Splade but the instructions are only for processed queries and indexes. If I only have a well-trained Splade model, how can I process the outputs of the Splade model (sparse vectors or its quantized version for Anserini) to make them suitable for Pisa? Can you provide more specific instructions on this?

Best wishes

Evaluation on MSMARCO?

Hi, thanks for your very interesting work.

Could you share how you evaluate to get the results here.
Did you use inverted indexing or use this code?
I am trying the later approach, but it is very slow on MSMARCO.
Thank you

FLOPs calculation

I recently read your SPLADE paper and I think it's quite interesting. I have a question concerning FLOPs calculation in the paper.

I think computing FLOPs for an inverted index involves the length of the activated posting lists(the overlapping terms in query and document). For example, a query a b c and a document c a e, since we must inspect the posting list of the overlapping terms a and c, the flops should at least be

posting_length(a) + posting_length(c)

because we perform summation for each entry in the posting list. However, in the paper you compute FLOPs by the probability that a, b, c are activated in the query and c, a, e are activated in the document. I think this may underestimate the flops of SPLADE because the less sparse the document, the longer posting lists in the inverted index.

Proposed Dockerfile

Hello maintainers and community,

I've noticed that the project doesn't currently have a Dockerfile, so I've taken the initiative to create one. Dockerizing the project provides a consistent environment for both development and deployment, making it easier for contributors to get started and maintain the quality of the project.

What I've Done:

Created a Dockerfile to build and run the project
Tested it locally to ensure that it works as expected

How to Test the Docker Setup:

Build the Docker image: docker build -t [image-name] .
Run the Docker container: docker run [options] [image-name]
Run the splade.all: /opt/conda/envs/splade/bin/python -m splade.all config.checkpoint_dir=experiments/debug/checkpoint config.index_dir=experiments/debug/index config.out_dir=experiments/debug/out

Dockerfile

FROM continuumio/anaconda3:2022.05

RUN git clone https://github.com/naver/splade.git && cd splade
WORKDIR /splade

RUN conda create -n splade_env python=3.9
RUN conda env create -f conda_splade_env.yml

Placement of the Dockerfile:

I've placed the Dockerfile in the project root directory for now, but I'm open to suggestions if there's a more appropriate directory for it.

I would appreciate your feedback on this addition. If it aligns with the project's goals and you find it beneficial, I would be happy to submit a Pull Request.

Thank you for considering my proposal.

Great job!

it's glad to see the sparse IR models, specially, SPLADE and SPLADE++, achieves such good performances.

TypeError: main() got an unexpected keyword argument 'version_base'

Howdy,

Sorry to bother, but I'm just trying to get the basic toy data training task to work on a fresh git clone and I'm running into the following error running:

python3 -m splade.train

Traceback (most recent call last):
  File "/home/vagrant/.conda/envs/splade_source/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/vagrant/.conda/envs/splade_source/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/srv/repos/splade_source/splade/index.py", line 13, in <module>
    @hydra.main(config_path=CONFIG_PATH, config_name=CONFIG_NAME, version_base="1.2")
TypeError: main() got an unexpected keyword argument 'version_base'

is there something basic I'm missing?

Training SPLADE with a smaller dataset?

Hello,

Thank you for researching SPLADE & setting up the GitHub REPO in such an easy-to-access way. I am trying to experiment with SPLADE and was wondering if it was possible to use the MSMarco small dataset to train instead of the larger one. I was curious since it was taking ~30 hours/epoch to train with the data from the distli_from_ensemble data config file.

This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

This repo fails to clone either by manually git clone-ing and installing with pipenv:

pipenv install git+https://github.com/naver/splade.git#egg=splade

Error

Cloning into '/REDACTED/splade'...
remote: Enumerating objects: 524, done.        
remote: Counting objects: 100% (57/57), done.        
remote: Compressing objects: 100% (34/34), done.        
remote: Total 524 (delta 32), reused 24 (delta 23), pack-reused 467        
Receiving objects: 100% (524/524), 3.09 MiB | 18.16 MiB/s, done.
Resolving deltas: 100% (274/274), done.
Downloading weights/distilsplade_max/pytorch_model.bin (268 MB)
Error downloading object: weights/distilsplade_max/pytorch_model.bin (33a5b0a): Smudge error: Error downloading weights/distilsplade_max/pytorch_model.bin (33a5b0a696d7b540065aedf6a86a056df3ac5f074d5be43923f0315f8b8bf7c4): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to '/REDACTED/splade/.git/lfs/logs/20230318T195557.182673.log'.
Use `git lfs logs last` to view the log.
error: external filter 'git-lfs filter-process' failed
fatal: weights/distilsplade_max/pytorch_model.bin: smudge filter lfs failed
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'


Would you like to retry cloning ?

Please help resolve.

Is there an alternative way to install?

Running SPLADE in production (Render python server)

Hi team, I'm working with my development team on using SPLADE for sparse embeddings (alongside dense embeddings from OpenAI Ada), with the end goal of having a hybrid search setup.

However, we keep running into memory issues creating embeddings for chunks of text.

I was wondering if you have any tips for running this in production or better still if there is an API available that you're aware of that can take text as an input and output spare embeddings?

Any help would be hugely appreciated.

Quick Start Problem: an unexpected keyword argument 'version_base'

Hello. I've got TypeError problem with running a quick start example and during working with hf.train.py.

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/usr/splade/splade/all.py", line 6, in
from .index import index
File "/home/usr/s/splade/splade/index.py", line 13, in
@hydra.main(config_path=CONFIG_PATH, config_name=CONFIG_NAME, version_base="1.2")
TypeError: main() got an unexpected keyword argument 'version_base'

As far as I understand. it isn't a crucial argument. However, the removing of it from @hydra.main generates SystemError: 2

[Bug] Get PyTorch version

Hi, I believe there is a bug in the function to check if Pytorch version >= 1.6.

https://github.com/naver/splade/blob/main/splade/tasks/amp.py

import torch

# inspired from Colbert repo: https://github.com/stanford-futuredata/ColBERT

PyTorch_over_1_6 = float((torch.__version__.split('.')[1])) >= 6 and float((torch.__version__.split('.')[0])) >= 1

It returns false for a pytorch version '2.0.1+cu117' (google colab). Could you guys check it please?
I have replaced the function by another one:

PyTorch_over_1_6 = float(".".join([torch.__version__.split('.')[0], torch.__version__.split('.')[1]])) >= 1.6

Full example:
image

This error makes the code to break when using this pytorch version combined with fp16 = True.

Error message:
"Cannot use AMP for PyTorch version < 1.6"

From:
image

When do you drop a term?

I understand that the log-saturation function and regularization loss suppress the weights of the frequent terms. But when do you drop a term (setting the term weight to zero)? Is it when the logit is less or equal to zero, so that the log(1+ReLu(.)) function outputs zero?

Hybrid search & normalization

Hello! I see many articles (like pinecones) that use the following ways to combine the hybrid search results from dense vector and splade.

However i'm a bit confused of how it would work if the dense vectors are normalized to 1, but splade's output is not. any thoughts. What is the best way to conduct hybrid search with both vectors?

I understand the ANN search is done with dot product, so we would just use the highest score and not try to normalize?

def hybrid_scale(dense, sparse, alpha: float):
    # check alpha value is in range
    if alpha < 0 or alpha > 1:
        raise ValueError("Alpha must be between 0 and 1")
    # scale sparse and dense vectors to create hybrid search vecs
    hsparse = {
        'indices': sparse['indices'],
        'values':  [v * (1 - alpha) for v in sparse['values']]
    }
    hdense = [v * alpha for v in dense]
    return hdense, hsparse

i seee this prior issue: #34 but it seemed inconclusive

PyTorch version checking

This line

PyTorch_over_1_6 = float(".".join([torch.__version__.split('.')[0], torch.__version__.split('.')[1]])) >= 1.6

doesn't work properly just because there are versions over 1.6 that are less than 1.6 numerically. For example, versions >= 1.10 

Seeking Assistance with SPLADE Model for Chinese Text

Hello,

I am currently developing a SPLADE model focusing on Chinese text, and during the training process, I have encountered several issues that I hope you might be able to help me with:

I attempted to pre-train the model using the method outlined in the paper available at https://arxiv.org/pdf/2301.10444.pdf, but I am unsure if the problem lies in the implementation details. I observed that the sparse representation became entirely zeros during fine-tuning. For the FLOPS input, I am using log(1 + ReLU(y_logits)) and have also tried adding an MLM Loss specifically targeting log(1 + ReLU(y_logits)).

I found that the original MLM Loss + FLOPS (log(1 + ReLU(y_logits - 1))) yielded better training results than MLM + FLOPS (log(1 + ReLU(y_logits))).

LexMAE has shown satisfactory results on English text datasets, and I am curious to know if you have conducted any experiments on top of LexMAE's foundation.

I would greatly appreciate your response and any advice you can provide.

Thank you very much for your time.

Best regards,

Installation error - splade with tokenisers v0.12.1 – Compatibility issue with Python 3.11.1 and Rust (v. 1.72, 1.76, 1.69, 1.62)

Splade has tokenizers v0.12.1 as a dependency which seems to have a known conflict with multiple versions of Rust. Can we please update the dependency to a version >0.14.1?

warning: variable does not need to be mutable
         --> tokenizers-lib\src\models\unigram\model.rs:265:21
          |
      265 |                 let mut target_node = &mut best_path_ends_at[key_pos];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`
          |
          = note: `#[warn(unused_mut)]` on by default
     
      warning: variable does not need to be mutable
         --> tokenizers-lib\src\models\unigram\model.rs:282:21
          |
      282 |                 let mut target_node = &mut best_path_ends_at[starts_at + mblen];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`
     
      warning: variable does not need to be mutable
         --> tokenizers-lib\src\pre_tokenizers\byte_level.rs:200:59
          |
      200 |     encoding.process_tokens_with_offsets_mut(|(i, (token, mut offsets))| {
          |                                                           ----^^^^^^^
          |                                                           |
          |                                                           help: remove this `mut`
     
      error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
         --> tokenizers-lib\src\models\bpe\trainer.rs:526:47
          |
      522 |                     let w = &words[*i] as *const _ as *mut _;
          |                             -------------------------------- casting happend here
      ...
      526 |                         let word: &mut Word = &mut (*w);
          |                                               ^^^^^^^^^
          |
          = note: for more information, visit <https://doc.rust-lang.org/book/ch15-05-interior-mutability.html>
          = note: `#[deny(invalid_reference_casting)]` on by default
  warning: `tokenizers` (lib) generated 3 warnings
  error: could not compile `tokenizers` (lib) due to 1 previous error; 3 warnings emitted

Alternatively, if anyone knows how to install Splade without this issue, please advise.

YAML Installation doesn't work from macOS with mini conda

OS:

macOS Ventura 13.3.1 
conda 23.1.0

Command:
conda env create -f conda_splade_env.yml

Output:

Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - libgfortran4==7.5.0=ha8ba4b0_17
  - protobuf==3.19.1=py38h295c915_0
  - gmp==6.2.1=h2531618_2
  - gcc_impl_linux-64==9.3.0=h70c0ae5_19
  - libwebp==1.2.2=h55f646e_0
  - freetype==2.11.0=h70c0345_0
  - libffi==3.3=he6710b0_2
  ....

Is it possible to get a commercial license?

We are a startup who is attempting to sell a new search API with some of the latest open source text embedding and cross-encoder models. We ourselves are BSL-licensed so I totally understand not wanting someone to commercialize your work and give nothing back.

SPLADE is obviously a very good and well-tested sparse-vector encoder and alternative to BM_25 and we would like to include it in our commercial product if possible.

Is there a way to get a commercial license? We are still a tiny company without much revenue for an outright purchase, but maybe we could setup some kind of channel partnership? Not sure, but it would be amazing to figure something out if at all possible. Thanks!

Change default to splade-v3

Hey,

should we change the default configuration from splade++ to splade-v3? I could make a PR for the readme if that makes sense.

SPLADE representations on BEIR dataset

Hi,
thank you for sharing and maintaining this repo! I am willing to generate the SPLADE representations both for documents and queries for all the datasets in BEIR, similarly to what it is possible to do with the create_anserini script for the MSMARCO dataset. I would like to do it both for splade-cocondenser-ensembledistil and efficient-splade-V-large.

I tried to run the following script,

export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_splade++_cocondenser_ensembledistil"

for dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq
do
    python3 -m splade.beir_eval \
        config.pretrained_no_yamlconfig=true \
        +beir.dataset=$dataset \
        +beir.dataset_path=data/beir \
        config.index_retrieve_batch_size=100
done

but I get NDCG=0.001 on the arguana dataset (then, I stopped the script because I guess that there is something wrong). What I am doing wrong? Also, does this script save the embeddings of each dataset? If not, how can I force it to save them?

Can SPLADE adapt to Chinese language ?

Hi, I am interested in your great work. I tried to tran a SPLADE model based on Roberta from huggingface https://huggingface.co/hfl/chinese-roberta-wwm-ext in my retrieval task over Chinese corpus.
But the result is not satisfied. In inference stage, my codes are as follows,

texts = ['王者荣耀好玩吗', '带你上王者', '如何下载王者荣耀', '鲁班怎么利用普通攻击']
embeds = batch_embed_doc(texts=texts, encoder=encoder, tokenizer=tokenizer, max_len=max_doc_len)
for i in range(len(texts)):
    print(texts[i])
    print(tokenizer.decode(embeds[i].topk(k=40).indices))

Then, I got the result:

王者荣耀好玩吗
700 喺帐 卷喉鲱 st fgo 蠹44 改判 短淇 貂 混华 oil賽 陇 谁 00 邇 呐 ssd 踝 ⒈ 2014 洞 天ᅦ 诰 or 西 乌 京 艷對 鬼 nt
带你上王者
700 呐 nt st 爸淇ᅦ 踝 git 艷鲱 dyson 貂淮44 ( 输 卷 购 53 才 葦 誣鼹 is 揶項θ 佈 cdma 贡 i3 { 马 fgo 邇 搜 以 乌帐
如何下载王者荣耀
700帐 喺喉 fgo 卷判 貂 短44鲱 st 蠹华 改 谁 00 oil淇 陇賽 混 ssd 踝 2014 or ⒈ 天ᅦ 邇 艷 射 璉 京浣 战 載對 跚 呐
鲁班怎么利用普通攻击
職 诰 据 尖閏哄my 20尔x 漏 表 才 剃 32g5s gohappymic 灞首缆 塊 互 山 种 怡 购椎 麒 奈級曇 膏 洛污 唔 find 躁

Here are two questions come to me,
1.Can SPLADE adapt to Chinese language?
2.What should I do to extend SPLADE to Chinese corpus?

Training by dot product and evaluation via inverted index?

Hey,
I recently read your SPLADEv2 paper. That's so insightful! But I still have a few questions about it.

  1. Is the model trained with dot product similarity function included in the contrastive loss?
  2. Evaluation on MS MARCO is performed via inverted index backed by anserine?
  3. Evaluation on BEIR is implemented with sentencetransformer hence also via dot product?
  4. How much can you gurantee the sparsity of learned representation since it's softly regularized by L1 and FLOPS loss? Did you use a tuned threshold to ''zerofy'' ~0 value?

Flops calcualtion

Hello!

I find that when I run flops, it always returns Nan.

I see your last commit fixed "force new", and changed line 25 in transformer_evaluator.py to force_new=True,
but in inverted_index.py line 23, seems that the self.n will return 0 if force_new is True.

The flops no longer return nan after I remove the "force_new=True".

Am I doing sth wrong here? And how should I get the correct flops..

Thank you!
Allen

Inference Experiments

Hey all,

I'm looking at the Efficiency Study paper and I'd like to replicate the query encoding numbers - could you please provide a pipeline or any other pointers so I can ensure my measurement is correct?

Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.