Coder Social home page Coder Social logo

pyterrier_doc2query's People

Contributors

cmacdonald avatar mitgosp avatar seanmacavaney avatar tonellotto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pyterrier_doc2query's Issues

Issues with Fetching Queries/Scores from Store

I am just trying to fetch the pre-computed queries and scores.

When I try to run the following:

import pyterrier as pt; pt.init()
from pyterrier_doc2query import Doc2QueryStore

store = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage')
print(store.lookup('100'))

I get the following error:

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/soyuj/improving-learned-index/src/doc2query--/utils.py", line 9
    print(store.lookup('100'))
          ^^^^^^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/pyterrier_doc2query/stores.py", line 60, in lookup
    queries, q_offsets, docnos_lookup = self.payload()
                                        ^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/pyterrier_doc2query/stores.py", line 28, in payload
    self._queries_offsets = np.memmap(self.path/'queries.offsets.u8', mode='r', dtype=np.uint64)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/numpy/core/memmap.py", line 240, in __new__
    raise ValueError("Size of available data is not a "
ValueError: Size of available data is not a multiple of the data-type size.

When I looked into the details, I found that it is because the repo being cloned has a Git LFS file, and the pointer file is being downloaded instead:
image

Some feedback and suggestions about your paper

I just came across your fantastic paper and have some feedback and suggestions for future work.

First some nitpicking. You miscalculated the reduction in query execution time and index size:

  • 0.95/1.41-1 = -32% (vs the quoted 1.41/0.95-1 = 48%)
  • 23/30-1 = -23% (vs 30/23-1 = 30%).
    Still very impressive, but smaller. The accuracy was properly calculated though (.323/.279-1 = 16%)

Suggestions for future work:

Further cleaning the data

  • What would be the effect of lemmatizing? I think you could first pass the full original document into doc2query so it has maximum context. But once you filter and concatenate the generated queries, you could then lemmatize the entire thing to standardize all the terms on their root words. You would then just lemmatize search query terms (very rapid) prior to running the BM25 query
  • You could probably even remove stop words either prior to or after lemmatizing. They shouldn't affect the score/relevance much, but would further reduce the index size

More comprehensive benchmarking

I'd suggest using the BEIR benchmark, particularly for out-of-sample/domain datasets. It seems to be the most comprehensive way to evaluate all of this and is what is used by the SPLADE team to show how their method is superior to docT5query. SPLADE recently got a lot of attention when Pinecone recently published an article about using it. Relevant papers about SPLADE.
https://arxiv.org/pdf/2107.05720.pdf
https://arxiv.org/pdf/2109.10086.pdf
https://arxiv.org/pdf/2110.11540.pdf
https://arxiv.org/pdf/2205.04733.pdf
https://arxiv.org/pdf/2207.03834v1.pdf

Multilingual

  • There's an excessive focus on English for all of this sort of stuff. So, I'd love to see all of this tested using the doc2query mT5 model based on the 14 language mMARCO dataset.
  • You could also use this multilingual SBERT cross-encoder that was trained on the same dataset for the filtering.

And, more generally, I think that there would be a lot of value in exploring the tenets of a Data Centric Approach to all of this, which advocates for the sort of data cleaning that you're doing rather than minor improvements from ever more complex models.

I hope this helps! I really think this approach has enormous potential for providing great IR results at low cost. I'd be happy to chat further about any of it!

How are the GPU hours recorded

Hi, really impressed by your interesting work. I have a question: how are the GPU hours in Table 2 recorded? I gave it try with your code myself: https://colab.research.google.com/drive/1Y-Q93GplU7dpnBSQZ3ed9drjRPFbk8wP?usp=sharing but it seems that it needs 1~2K hours on Colab's Tesla 4T GPU to process the MSMARCO collection (batch size = 32, nsamples = 40). This is very different from the reported one, i.e. 214. Thanks in advance.

P.S.: I also ran the same code on a V100 server and it still needs ~400 hours

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.