Unsupervised multi-lingual/cross-lingual information retrieval.
In this track, we target on how embeddings behave in the context of (i) multilingaul search and (ii) crosslingual search. We will investigate this with two scenarios (benchmarks): (1) multilingual passage re-ranking (MIRACL) and (2) crosslingual passage re-ranking (NeuCLIR).
We are going the create a re-ranking models
Then, for each query
In this scenario, the query and passage are in the same langauge (i.e., monolingual).
We subsample the MIRACL in to Persian, Russian and Chinese (fa
, ru
, zh
), to align the target langauges used in NeuCLIR.
Dataset | lang | retrieval | metric | score |
---|---|---|---|---|
miracl-dev | en | bm25 | nDCG@10 | 0.3504 |
miracl-dev | fa | bm25 | nDCG@10 | 0.3332 |
miracl-dev | ru | bm25 | nDCG@10 | 0.3342 |
miracl-dev | zh | bm25 | nDCG@10 | 0.1801 |
miracl-dev | en | bm25 | Recall@10 | 0.4515 |
miracl-dev | fa | bm25 | Recall@10 | 0.4368 |
miracl-dev | ru | bm25 | Recall@10 | 0.3991 |
miracl-dev | zh | bm25 | Recall@10 | 0.2504 |
In this scenario, the query is in English (source language) and passage are in the other language (e.g., Chinese).
Dataset | lang | retrieval | metric | score |
---|
Note that we use the google-translated query (i.e., English to target language) for the BM25 search. However, you can also try dense retrieval, which can bypassa traslation step and directly retrieve passage in target languages.
We have preprocessed the data and put it on Huggingface.
You can git clone
the entire dataset repo, but make sure you have git-lfs
installed properly (e.g., conda install git-lfs
)
git clone https://huggingface.co/datasets/DylanJHJ/essir-xlir
This experimental dataset are from MIRACL and NeuCLIR'23.
Note that we subsample MIRACL into 3 languages: Chinese(zho
), Russian(rus
), and Persian(fas
) as they are the target languages used in NeuCLIR.
To download the raw data from scratch, please refer to data.
You can find them in data/miracl and data/neuclir.
You can find them in runs. Namining format: run.{miracl/neuclir}.{dev/translate}.bm25.{lang}.txt The results are shown in table above.
As we have provided the runs (results), you won't need to do the preprocessing and indexing agiain. But if you want to reproduce youself, please refer to retrieval.
The reranking models include: (1) multilingual miniLM and (2) multilingual monoT5
Dataset | lang | retrieval | metric | score |
---|---|---|---|---|
miracl-dev | en | bm25 | nDCG@10 | 0.3504 |
miracl-dev | fa | bm25 | nDCG@10 | 0.3332 |
miracl-dev | ru | bm25 | nDCG@10 | 0.3342 |
miracl-dev | zh | bm25 | nDCG@10 | 0.1801 |
--- | --- | --- | --- | --- |
miracl-dev | en | bm25 | nDCG@10 | 0.3504 |
miracl-dev | fa | bm25 | nDCG@10 | 0.3332 |
miracl-dev | ru | bm25 | nDCG@10 | 0.3342 |
miracl-dev | zh | bm25 | nDCG@10 | 0.1801 |