terrierteam / pyterrier_doc2query Goto Github PK

View Code? Open in Web Editor NEW

34.0 34.0 9.0 72 KB

Python 11.59% Jupyter Notebook 88.41%

pyterrier_doc2query's People

Contributors

Stargazers

Watchers

Forkers

seanmacavaney tonellotto carvalhoamc mitgosp caliban23 cmacdonald anhvth xinleif666

pyterrier_doc2query's Issues

Issues with Fetching Queries/Scores from Store

I am just trying to fetch the pre-computed queries and scores.

When I try to run the following:

import pyterrier as pt; pt.init()
from pyterrier_doc2query import Doc2QueryStore

store = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage')
print(store.lookup('100'))

I get the following error:

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/soyuj/improving-learned-index/src/doc2query--/utils.py", line 9
    print(store.lookup('100'))
          ^^^^^^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/pyterrier_doc2query/stores.py", line 60, in lookup
    queries, q_offsets, docnos_lookup = self.payload()
                                        ^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/pyterrier_doc2query/stores.py", line 28, in payload
    self._queries_offsets = np.memmap(self.path/'queries.offsets.u8', mode='r', dtype=np.uint64)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/numpy/core/memmap.py", line 240, in __new__
    raise ValueError("Size of available data is not a "
ValueError: Size of available data is not a multiple of the data-type size.

When I looked into the details, I found that it is because the repo being cloned has a Git LFS file, and the pointer file is being downloaded instead:

dont extend ApplyGenericTransformer

Some feedback and suggestions about your paper

I just came across your fantastic paper and have some feedback and suggestions for future work.

First some nitpicking. You miscalculated the reduction in query execution time and index size:

0.95/1.41-1 = -32% (vs the quoted 1.41/0.95-1 = 48%)
23/30-1 = -23% (vs 30/23-1 = 30%).
Still very impressive, but smaller. The accuracy was properly calculated though (.323/.279-1 = 16%)

Suggestions for future work:

Further cleaning the data

What would be the effect of lemmatizing? I think you could first pass the full original document into doc2query so it has maximum context. But once you filter and concatenate the generated queries, you could then lemmatize the entire thing to standardize all the terms on their root words. You would then just lemmatize search query terms (very rapid) prior to running the BM25 query
You could probably even remove stop words either prior to or after lemmatizing. They shouldn't affect the score/relevance much, but would further reduce the index size

More comprehensive benchmarking

I'd suggest using the BEIR benchmark, particularly for out-of-sample/domain datasets. It seems to be the most comprehensive way to evaluate all of this and is what is used by the SPLADE team to show how their method is superior to docT5query. SPLADE recently got a lot of attention when Pinecone recently published an article about using it. Relevant papers about SPLADE.
https://arxiv.org/pdf/2107.05720.pdf
https://arxiv.org/pdf/2109.10086.pdf
https://arxiv.org/pdf/2110.11540.pdf
https://arxiv.org/pdf/2205.04733.pdf
https://arxiv.org/pdf/2207.03834v1.pdf

Multilingual

There's an excessive focus on English for all of this sort of stuff. So, I'd love to see all of this tested using the doc2query mT5 model based on the 14 language mMARCO dataset.
You could also use this multilingual SBERT cross-encoder that was trained on the same dataset for the filtering.

And, more generally, I think that there would be a lot of value in exploring the tenets of a Data Centric Approach to all of this, which advocates for the sort of data cleaning that you're doing rather than minor improvements from ever more complex models.

I found a lot of useful info about all of this when I started looking into the Argilla platform https://www.argilla.io/
This video was fantastic as well.
And this related competition https://https-deeplearning-ai.github.io/data-centric-comp/

I hope this helps! I really think this approach has enormous potential for providing great IR results at low cost. I'd be happy to chat further about any of it!

How are the GPU hours recorded

Hi, really impressed by your interesting work. I have a question: how are the GPU hours in Table 2 recorded? I gave it try with your code myself: https://colab.research.google.com/drive/1Y-Q93GplU7dpnBSQZ3ed9drjRPFbk8wP?usp=sharing but it seems that it needs 1~2K hours on Colab's Tesla 4T GPU to process the MSMARCO collection (batch size = 32, nsamples = 40). This is very different from the reported one, i.e. 214. Thanks in advance.

P.S.: I also ran the same code on a V100 server and it still needs ~400 hours

terrierteam / pyterrier_doc2query Goto Github PK

pyterrier_doc2query's People

Contributors

Stargazers

Watchers

Forkers

pyterrier_doc2query's Issues

Issues with Fetching Queries/Scores from Store

dont extend ApplyGenericTransformer

Some feedback and suggestions about your paper

Suggestions for future work:

Further cleaning the data

More comprehensive benchmarking

Multilingual

How are the GPU hours recorded

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent