seominjoon / denspi Goto Github PK

View Code? Open in Web Editor NEW

200.0 200.0 26.0 1.17 MB

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DenSPI)

Home Page: https://nlp.cs.washington.edu/denspi

License: Apache License 2.0

Python 95.92% CSS 0.15% HTML 3.02% Shell 0.54% Dockerfile 0.37%

acl2019 nlp question-answering

denspi's Issues

is it possible to run the denspi in a normal laptop without any problems

Handle short sentences

Sparse-first search and hybrid search not working

Hello,
I am facing issue in sparse-first search and hybrid search:

Dense-First Search is working fine, but when I select the other options it gives the following error:
KeyError: "Unable to open object (object '3580546' doesn't exist)"

I have used the pretrained model and then created custom phrase index for "dev-v1.1"

ERROR:flask.app:Exception on /api [GET]
Traceback (most recent call last):
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask_cors/extension.py", line 161, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
raise value
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
return self.view_functionsrule.endpoint
File "open/run_demo.py", line 128, in api
doc_top_k=5)
File "open/run_demo.py", line 94, in search
search_strategy=search_strategy, doc_top_k=5)
File "/root/denspi/open/mips_sparse.py", line 291, in search
doc_top_k=5)
File "/root/denspi/open/mips_sparse.py", line 218, in search_start
(doc_idxs, start_idxs), start_scores = self.search_sparse(query_start, doc_scores, doc_top_k)
File "/root/denspi/open/mips_sparse.py", line 168, in search_sparse
doc_group = self.get_doc_group(doc_idx)
File "/root/denspi/open/mips.py", line 121, in get_doc_group
if len(self.phrase_dumps) == 1:
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/root/anaconda3/envs/despi/lib/python3.6/site-packages/h5py/_hl/group.py", line 262, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object '3580546' doesn't exist)"
ERROR:tornado.access:500 GET /api?strat=sparse_first&query=pharmacy%20department%20and%20specialised%20areas%20 (127.0.0.1) 419.57ms

How could I reproduce the result for SQuAD 1.1 from scratch ?

How could I reproduce the result for SQuAD 1.1(as shown in Table 1 in the paper) from scratch? I want to start to train a DENSPI (with scalar) from scratch. But I donnot know how to achieve my goal. The README file does not explain it.

Create one-command index->pred->eval routine

Enable one command routine for indexing, prediction, and evaluation.
This will go into `open/run_index_pred_eval.py'.

Then the entire evaluation process will be largely three stages:

train model
dump vectors
index-pred-eval

Issues in setting up demo for SQuAD 1.1 data

Hello there,
I am facing issue in setting up this code. Here is what I did:

I have downloaded Pretrained Model by running this command : "gsutil cp -r gs://denspi/v1-0/model .", and then created the Custom Phrase Index for "dev-v1.1" by running below command:
python run_piqa.py --do_dump --filter_threshold -2 --save_dir SAVE3_DIR/ --load_dir ROOT_DIR/model --metadata_dir ROOT_DIR/bert --data_dir ROOT_DIR/data/dev-v1.1 --predict_file 0:2 --output_dir ROOT_DIR/your_dump/phrase --dump_file 0-1.hdf5

After that I am serving the API and run the Demo by using following command :
python run_piqa.py --do_serve --load_dir ROOT_DIR/model --metadata_dir ROOT_DIR/bert --do_load --parallel --port 8000
python open/run_demo.py ROOT_DIR/dump ROOT_DIR/wikipedia --api_port 8000 --port 3000 --index_name 64_flat_SQ8 --sparse_type p

But the demo is not working properly. I have tested the demo by providing the questions from SQUAD 1.1 Dataset but it's not giving proper answers. Instead of expected answers, it looks like it gives random answers.

I am not able to understand why it is not providing accurate answers. Is there something which I have missed or doing wrong?

Is it compulsory to train the model on our own or the pre-trained model provided at "gs://denspi/v1-0/model ." will work instead of training our own?

Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

After all the installations (faiss, drqa, and the two requirements.txt from this repo), run_index_pred_eval.py gives an error like below:

$ python open/run_index_pred_eval.py
/home/jinhyuk/github/kernel-sparse/dense
/data_nfs/camist002/data/dev-3.json
--para
--no_od
sampling from:
/home/jinhyuk/github/kernel-sparse/dense/phrase.hdf5
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.63it/s]
WARNING clustering 788 points to 256 centroids: please provide at least 9984 training points████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 341.87it/s]
Clustering 788 points in 481D to 256 clusters, redo 1 times, 10 iterations
Preprocessing in 0.00 s
INTEL MKL ERROR: /home/jinhyuk/miniconda3/envs/kesper/lib/python3.6/site-packages/faiss/../../../libmkl_avx2.so: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8.
Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

Following the recommendation from here, the conda install nomkl numpy scipy scikit-learn numexpr command shows that there are some conflicts between the versions.

$ conda install nomkl numpy scipy scikit-learn numexpr
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes. Press CTRL-C to abort.
failed
UnsatisfiableError: The following specifications were found to be incompatible with each other:
Package libopenblas conflicts for:
scikit-learn -> numpy[version='>=1.11.3,<2.0a0'] -> libopenblas[version='>=0.3.2,<0.3.3.0a0']
Package blas conflicts for:
mkl_fft -> numpy-base[version='>=1.0.6,<2.0a0'] -> blas[version='|1.0',build=openblas]
blas
scikit-learn -> blas[version='||1.0',build='mkl|openblas|mkl|openblas']
nomkl -> blas=[build=openblas]
mkl_fft -> blas[version='|1.0',build=mkl]
numexpr -> blas[version='||1.0',build='mkl|openblas|mkl|openblas']
numpy -> blas[version='||1.0',build='mkl|openblas|mkl|openblas']
mkl_random -> blas[version='|1.0',build=mkl]
faiss-cpu=1.5.2 -> numpy[version='>=1.11'] -> numpy-base==1.16.0=py36hde5b4d6_1 -> blas[version='|1.0',build=openblas]
scipy -> blas[version='||1.0',build='mkl|openblas|mkl|openblas']
mkl_random -> numpy-base[version='>=1.0.2,<2.0a0'] -> blas[version='|1.0',build=openblas]
numpy-base -> blas[version='|*|1.0',build='mkl|openblas|mkl|openblas']
faiss-cpu=1.5.2 -> blas=[build=mkl]
faiss-cpu=1.5.2 -> numpy[version='>=1.11'] -> blas==1.0=mkl

Any idea how to resolve this?

neg training code is different from the paper

Currently the neg training routine (--train_neg in run_piqa.py) is different from what is described in the paper.

In the paper, we use 'no answer' logit to train on negative examples so we just don't have a separate neg training routine. In the code, we have a neg training routine that instead attaches the neg example to each positive example (whose question embeddings are similar) after normal training.

In the code, several noise injections are also used.

In practice, the strategy in the current code is better than that in the paper (no answer logit). The paper will be updated soon and this issue will be resolved.

how to convert each float32 value to int8?

how to convert each float32 value to int8? where is the code about this?

How could I reproduce the result for SQuAD 1.1?

Hi,

Thanks for your good work. I would like to reproduce the result for SQuAD 1.1 (as shown in Table 1 in the paper), but I am having some troubles.

First, I downloaded the Pretrained Model from "gs://denspi/v1-0/model" and then tried to eval on dev-v1.1 using: "python run_piqa.py --do_predict --output_dir tmp --do_load --load_dir model --predict_file dev-v1.1.json --do_eval --gt_file dev-v1.1.json --metadata_dir bert"

The predicted answer seems to be random span, resulting in a metric like: {"exact_match": 0.47303689687795647, "f1": 4.43806570152543}. 0.47% EM means something is totally wrong.

I wonder whether I did it correctly.

And if I want to train a model to reproduce the result by myself, since I cannot get the Pretrained Model work, is it enough to just run the first step in the training section (i.e. "python run_piqa.py --train_batch_size 12 --do_train --freeze_word_emb --save_dir $SAVE1_DIR")

Thanks and hope to get your advice

How to handle short sentences/contexts

Hi @seominjoon @jhyuklee

The default model performs well for SQUAD_v1.1 dataset (where context length is ~700 chars)
But It fails to perform, when I try to index my custom data which has small paragraph/contexts (length ~100-150 characters).
- The problem is, irrespective of the query, the same result (wrong) is being returned as the output
- Most the time, the result is just single random character like ? . (end of the context)
- I have debugged into this and realized that the problem stays in start vectors which we generate from model output

Ques:

May I know why this scenario occurs?
What is the solution?

Setting:
All the results are obtained using the commands mentioned in README.

Could you please share the detailed procedure of how you index wikipedia?
Is IVF1048576_HNSW32_SQ8 and search with nprobe=64a precise summary of your choice?
I find in open/build_index.py, there is a function named merge_indexes. Did you build multiple sub-indexes then merge? or did not? because I feel the choice may have some effect on the performance.
just more specific Q1, the process of building index seems quite complicated in your code as follows. by default, it goes through
https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L121
https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L126-L131
https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L134-L137
then
https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L148
https://github.com/uwnlp/denspi/blob/f540b6a547f012823fc6c2bb10077df6bccc13a6/open/run_index.py#L164

Can the following two lines encode the same idea?

index = faiss.index_factory(d, "IVF1048576_HNSW32,SQ8")
index.train(data)

thanks!

Quick Q: Does DenSPI depend on CoreNLP?

Its in the installer here

but its commented out in the Dockerfile.

I haven't seen any other reference to it.

seominjoon / denspi Goto Github PK

denspi's Issues

is it possible to run the denspi in a normal laptop without any problems

Handle short sentences

Sparse-first search and hybrid search not working

How could I reproduce the result for SQuAD 1.1 from scratch ?

Create one-command index->pred->eval routine

Issues in setting up demo for SQuAD 1.1 data

Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

neg training code is different from the paper

how to convert each float32 value to int8?

How could I reproduce the result for SQuAD 1.1?

How to handle short sentences/contexts

The demo link is not working

torch version(s)

the choice of faiss index

Quick Q: Does DenSPI depend on CoreNLP?

How to generate dense vector and sparse vector for own data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent