Coder Social home page Coder Social logo

Comments (8)

seominjoon avatar seominjoon commented on July 20, 2024

Hi,
I believe you mean creating your own index for an arbitrary text corpus. The code is there but lacks documentation/refactoring. Working on it, please stay tuned!

from denspi.

Arjunsankarlal avatar Arjunsankarlal commented on July 20, 2024

Hi @seominjoon, Thanks for the response. Yes, exactly I am looking for the same. Could you help me by pointing where exactly I should look at? That would be very helpful. Thanks is advance :)

from denspi.

jhyuklee avatar jhyuklee commented on July 20, 2024

Hi @Arjunsankarlal, code for indexing starts here https://github.com/uwnlp/denspi/blob/11ff5f8d31390384c8346e82f764c3b3c4e5b819/run_piqa.py#L655
Thanks!

from denspi.

bdhingra avatar bdhingra commented on July 20, 2024

Hi, is there any update on this?

I was trying to generate the sparse index for my own corpus. I assumed open/dump_tfidf.py is the script needed to do this. I am also assuming that we need to pass --sparse to open/run_pred.py to use the sparse index. But I am not sure which argument to use to pass in the generated hdf5 file to this script?

Also, what confused me is that open/run_pred.py still seems to require the wikipedia tfidf dump from DrQA (as --ranker_path). What is this used for? The doc ids here may not correspond to my corpus anymore, so will that create a problem? E.g. here: https://github.com/uwnlp/denspi/blob/master/open/mips_sparse.py#L181

I would greatly appreciate some guidance on how to run the dense + sparse index for a custom corpus.

Thank you,
Bhuwan

from denspi.

jhyuklee avatar jhyuklee commented on July 20, 2024

Hi Bhuwan,

sorry for the inconvenience. Running open/dump_tfidf.pyoutputs paragraph-level tfidf for your corpus, which should be located under args.dump_dir/tfidf folder. Note that this script uses[PAR] to split a document into paragraphs.

Also, the reason why we need DrQA is to compute document-level tfidf as they have the inverted index of whole wikipedia document. If you want to use a subset of Wikipedia for running DenSPI, you have to modify the code to map your documents to the original index in DrQA Wikipedia corpus. And, yes, it will create a problem if you use a custom corpus (not Wikipedia) in this version. You can simply remove the document-level tfidf, but it will give you a noticeable decrease in its performance (especially for QA pairs where document selection matters: e.g., SQuADopen). For custom document-level tfidf generation, see here: https://github.com/facebookresearch/DrQA/blob/master/scripts/retriever/build_tfidf.py.

We are on our way to refactor and provide more cleaner codes for custom corpus. It would take few more weeks. Thanks.

Jinhyuk

from denspi.

bdhingra avatar bdhingra commented on July 20, 2024

Thanks for the quick response Jinhyuk!

So to confirm if my understanding is correct, the order of documents in self.ranker.doc_mat here, should match the order in the predict file used for generating the phrase vectors passed to run_piqa.py? (Since the doc_idx seems to be inferred using an enumerate on the input docs here?).

from denspi.

jhyuklee avatar jhyuklee commented on July 20, 2024

Yes, you are correct. See here where 'doc_idx' is used for the key of hdf5 files, and here where 'doc_idx' is used to get document scores calculated fromself.ranker.doc_mat.

from denspi.

seominjoon avatar seominjoon commented on July 20, 2024

Hi @Arjunsankarlal and @bdhingra ,
I just updated the code and readme so that they now support running demo for custom phrase index.
Please try https://github.com/uwnlp/denspi#train and https://github.com/uwnlp/denspi#create-a-custom-phrase-index
You will be able to train with your own SQuAD-like data and host a demo with your custom document files as well.

Scaling up is detailed in https://github.com/uwnlp/denspi#create-a-large-phrase-index

It's still missing some details, which will be added soon. Thanks!

from denspi.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.