pyterrier_doc2query's People
pyterrier_doc2query's Issues
Issues with Fetching Queries/Scores from Store
I am just trying to fetch the pre-computed queries and scores.
When I try to run the following:
import pyterrier as pt; pt.init()
from pyterrier_doc2query import Doc2QueryStore
store = Doc2QueryStore.from_repo('https://huggingface.co/datasets/macavaney/d2q-msmarco-passage')
print(store.lookup('100'))
I get the following error:
PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7
No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/soyuj/improving-learned-index/src/doc2query--/utils.py", line 9
print(store.lookup('100'))
^^^^^^^^^^^^^^^^^^^
File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/pyterrier_doc2query/stores.py", line 60, in lookup
queries, q_offsets, docnos_lookup = self.payload()
^^^^^^^^^^^^^^
File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/pyterrier_doc2query/stores.py", line 28, in payload
self._queries_offsets = np.memmap(self.path/'queries.offsets.u8', mode='r', dtype=np.uint64)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/soyuj/.conda/envs/sjb/lib/python3.11/site-packages/numpy/core/memmap.py", line 240, in __new__
raise ValueError("Size of available data is not a "
ValueError: Size of available data is not a multiple of the data-type size.
When I looked into the details, I found that it is because the repo being cloned has a Git LFS file, and the pointer file is being downloaded instead:
dont extend ApplyGenericTransformer
Some feedback and suggestions about your paper
I just came across your fantastic paper and have some feedback and suggestions for future work.
First some nitpicking. You miscalculated the reduction in query execution time and index size:
- 0.95/1.41-1 = -32% (vs the quoted 1.41/0.95-1 = 48%)
- 23/30-1 = -23% (vs 30/23-1 = 30%).
Still very impressive, but smaller. The accuracy was properly calculated though (.323/.279-1 = 16%)
Suggestions for future work:
Further cleaning the data
- What would be the effect of lemmatizing? I think you could first pass the full original document into doc2query so it has maximum context. But once you filter and concatenate the generated queries, you could then lemmatize the entire thing to standardize all the terms on their root words. You would then just lemmatize search query terms (very rapid) prior to running the BM25 query
- You could probably even remove stop words either prior to or after lemmatizing. They shouldn't affect the score/relevance much, but would further reduce the index size
More comprehensive benchmarking
I'd suggest using the BEIR benchmark, particularly for out-of-sample/domain datasets. It seems to be the most comprehensive way to evaluate all of this and is what is used by the SPLADE team to show how their method is superior to docT5query. SPLADE recently got a lot of attention when Pinecone recently published an article about using it. Relevant papers about SPLADE.
https://arxiv.org/pdf/2107.05720.pdf
https://arxiv.org/pdf/2109.10086.pdf
https://arxiv.org/pdf/2110.11540.pdf
https://arxiv.org/pdf/2205.04733.pdf
https://arxiv.org/pdf/2207.03834v1.pdf
Multilingual
- There's an excessive focus on English for all of this sort of stuff. So, I'd love to see all of this tested using the doc2query mT5 model based on the 14 language mMARCO dataset.
- You could also use this multilingual SBERT cross-encoder that was trained on the same dataset for the filtering.
And, more generally, I think that there would be a lot of value in exploring the tenets of a Data Centric Approach to all of this, which advocates for the sort of data cleaning that you're doing rather than minor improvements from ever more complex models.
- I found a lot of useful info about all of this when I started looking into the Argilla platform https://www.argilla.io/
- This video was fantastic as well.
- And this related competition https://https-deeplearning-ai.github.io/data-centric-comp/
I hope this helps! I really think this approach has enormous potential for providing great IR results at low cost. I'd be happy to chat further about any of it!
How are the GPU hours recorded
Hi, really impressed by your interesting work. I have a question: how are the GPU hours in Table 2 recorded? I gave it try with your code myself: https://colab.research.google.com/drive/1Y-Q93GplU7dpnBSQZ3ed9drjRPFbk8wP?usp=sharing but it seems that it needs 1~2K hours on Colab's Tesla 4T GPU to process the MSMARCO collection (batch size = 32, nsamples
= 40). This is very different from the reported one, i.e. 214. Thanks in advance.
P.S.: I also ran the same code on a V100 server and it still needs ~400 hours
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.