Comments (8)
Hi,
I believe you mean creating your own index for an arbitrary text corpus. The code is there but lacks documentation/refactoring. Working on it, please stay tuned!
from denspi.
Hi @seominjoon, Thanks for the response. Yes, exactly I am looking for the same. Could you help me by pointing where exactly I should look at? That would be very helpful. Thanks is advance :)
from denspi.
Hi @Arjunsankarlal, code for indexing starts here https://github.com/uwnlp/denspi/blob/11ff5f8d31390384c8346e82f764c3b3c4e5b819/run_piqa.py#L655
Thanks!
from denspi.
Hi, is there any update on this?
I was trying to generate the sparse index for my own corpus. I assumed open/dump_tfidf.py
is the script needed to do this. I am also assuming that we need to pass --sparse
to open/run_pred.py
to use the sparse index. But I am not sure which argument to use to pass in the generated hdf5 file to this script?
Also, what confused me is that open/run_pred.py
still seems to require the wikipedia tfidf dump from DrQA (as --ranker_path
). What is this used for? The doc ids here may not correspond to my corpus anymore, so will that create a problem? E.g. here: https://github.com/uwnlp/denspi/blob/master/open/mips_sparse.py#L181
I would greatly appreciate some guidance on how to run the dense + sparse index for a custom corpus.
Thank you,
Bhuwan
from denspi.
Hi Bhuwan,
sorry for the inconvenience. Running open/dump_tfidf.py
outputs paragraph-level tfidf for your corpus, which should be located under args.dump_dir/tfidf
folder. Note that this script uses[PAR]
to split a document into paragraphs.
Also, the reason why we need DrQA is to compute document-level tfidf as they have the inverted index of whole wikipedia document. If you want to use a subset of Wikipedia for running DenSPI, you have to modify the code to map your documents to the original index in DrQA Wikipedia corpus. And, yes, it will create a problem if you use a custom corpus (not Wikipedia) in this version. You can simply remove the document-level tfidf, but it will give you a noticeable decrease in its performance (especially for QA pairs where document selection matters: e.g., SQuADopen). For custom document-level tfidf generation, see here: https://github.com/facebookresearch/DrQA/blob/master/scripts/retriever/build_tfidf.py.
We are on our way to refactor and provide more cleaner codes for custom corpus. It would take few more weeks. Thanks.
Jinhyuk
from denspi.
Thanks for the quick response Jinhyuk!
So to confirm if my understanding is correct, the order of documents in self.ranker.doc_mat
here, should match the order in the predict file used for generating the phrase vectors passed to run_piqa.py
? (Since the doc_idx
seems to be inferred using an enumerate on the input docs here?).
from denspi.
Yes, you are correct. See here where 'doc_idx' is used for the key of hdf5 files, and here where 'doc_idx' is used to get document scores calculated fromself.ranker.doc_mat
.
from denspi.
Hi @Arjunsankarlal and @bdhingra ,
I just updated the code and readme so that they now support running demo for custom phrase index.
Please try https://github.com/uwnlp/denspi#train and https://github.com/uwnlp/denspi#create-a-custom-phrase-index
You will be able to train with your own SQuAD-like data and host a demo with your custom document files as well.
Scaling up is detailed in https://github.com/uwnlp/denspi#create-a-large-phrase-index
It's still missing some details, which will be added soon. Thanks!
from denspi.
Related Issues (16)
- Create one-command index->pred->eval routine
- Handle short sentences HOT 8
- how to convert each float32 value to int8? HOT 1
- Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so. HOT 1
- is it possible to run the denspi in a normal laptop without any problems
- neg training code is different from the paper
- Issues in setting up demo for SQuAD 1.1 data HOT 6
- Sparse-first search and hybrid search not working HOT 2
- How could I reproduce the result for SQuAD 1.1? HOT 6
- Quick Q: Does DenSPI depend on CoreNLP? HOT 2
- How could I reproduce the result for SQuAD 1.1 from scratch ?
- How to handle short sentences/contexts HOT 1
- torch version(s)
- the choice of faiss index
- The demo link is not working
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from denspi.