How to generate dense vector and sparse vector for own data about denspi HOT 8 CLOSED

Arjunsankarlal commented on July 20, 2024

How to generate dense vector and sparse vector for own data

from denspi.

Comments (8)

seominjoon commented on July 20, 2024

Hi,
I believe you mean creating your own index for an arbitrary text corpus. The code is there but lacks documentation/refactoring. Working on it, please stay tuned!

from denspi.

Arjunsankarlal commented on July 20, 2024

Hi @seominjoon, Thanks for the response. Yes, exactly I am looking for the same. Could you help me by pointing where exactly I should look at? That would be very helpful. Thanks is advance :)

from denspi.

jhyuklee commented on July 20, 2024

Hi @Arjunsankarlal, code for indexing starts here https://github.com/uwnlp/denspi/blob/11ff5f8d31390384c8346e82f764c3b3c4e5b819/run_piqa.py#L655
Thanks!

from denspi.

bdhingra commented on July 20, 2024

Hi, is there any update on this?

I was trying to generate the sparse index for my own corpus. I assumed open/dump_tfidf.py is the script needed to do this. I am also assuming that we need to pass --sparse to open/run_pred.py to use the sparse index. But I am not sure which argument to use to pass in the generated hdf5 file to this script?

Also, what confused me is that open/run_pred.py still seems to require the wikipedia tfidf dump from DrQA (as --ranker_path). What is this used for? The doc ids here may not correspond to my corpus anymore, so will that create a problem? E.g. here: https://github.com/uwnlp/denspi/blob/master/open/mips_sparse.py#L181

I would greatly appreciate some guidance on how to run the dense + sparse index for a custom corpus.

Thank you,
Bhuwan

from denspi.

jhyuklee commented on July 20, 2024

Hi Bhuwan,

sorry for the inconvenience. Running open/dump_tfidf.pyoutputs paragraph-level tfidf for your corpus, which should be located under args.dump_dir/tfidf folder. Note that this script uses[PAR] to split a document into paragraphs.

Also, the reason why we need DrQA is to compute document-level tfidf as they have the inverted index of whole wikipedia document. If you want to use a subset of Wikipedia for running DenSPI, you have to modify the code to map your documents to the original index in DrQA Wikipedia corpus. And, yes, it will create a problem if you use a custom corpus (not Wikipedia) in this version. You can simply remove the document-level tfidf, but it will give you a noticeable decrease in its performance (especially for QA pairs where document selection matters: e.g., SQuADopen). For custom document-level tfidf generation, see here: https://github.com/facebookresearch/DrQA/blob/master/scripts/retriever/build_tfidf.py.

We are on our way to refactor and provide more cleaner codes for custom corpus. It would take few more weeks. Thanks.

Jinhyuk

from denspi.

bdhingra commented on July 20, 2024

Thanks for the quick response Jinhyuk!

So to confirm if my understanding is correct, the order of documents in self.ranker.doc_mat here, should match the order in the predict file used for generating the phrase vectors passed to run_piqa.py? (Since the doc_idx seems to be inferred using an enumerate on the input docs here?).

from denspi.

jhyuklee commented on July 20, 2024

Yes, you are correct. See here where 'doc_idx' is used for the key of hdf5 files, and here where 'doc_idx' is used to get document scores calculated fromself.ranker.doc_mat.

from denspi.

seominjoon commented on July 20, 2024

Hi @Arjunsankarlal and @bdhingra ,
I just updated the code and readme so that they now support running demo for custom phrase index.
Please try https://github.com/uwnlp/denspi#train and https://github.com/uwnlp/denspi#create-a-custom-phrase-index
You will be able to train with your own SQuAD-like data and host a demo with your custom document files as well.

Scaling up is detailed in https://github.com/uwnlp/denspi#create-a-large-phrase-index

It's still missing some details, which will be added soon. Thanks!

from denspi.

How to generate dense vector and sparse vector for own data about denspi HOT 8 CLOSED

Comments (8)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent