liberai / nspm Goto Github PK

View Code? Open in Web Editor NEW

221.0 34.0 86.0 10.86 MB

🤖 Neural SPARQL Machines for Knowledge Graph Question Answering.

Home Page: http://aksw.org/Projects/NeuralSPARQLMachines

License: MIT License

Shell 2.41% Python 97.59%

seq2seq neural-sparql-machines sparql machine-translation question-answering knowledge-graph rdf linked-data kbqa kgqa

nspm's People

Contributors

Stargazers

Watchers

Forkers

akaha mrinmoyghosal charlesvardeman whusym ml-nic hrishikeshh dbpedia srtarun amanmehta-maniac nmvijay anujsrc afcarl chenglongchen ml-lab opensourcedemocracy nitingupta910 amirunpri2018 panchbhai1969 tk1363704 haemin-jung theodore3131 somos-development capasitore michaelmalahy sujit420 huudung shimafoolad liguozhanglearner shujuner zhaoman32 fagan2888 levinglo tantrojan lahiruoshara rpjayasekara kartavyakothari thisisshub coolalexzb ansable hayoon-jung thedrowsywinger ezoar kansalaman mkartik ttm atomgraph pureuniverse lewismc ashutosh16399 thanator22 terminator-spec weiyang22 shobhs13 simiaolin piaohai sahandilshan brunotech smartniz makevin23 baylee001 iliedorobat hung-ai-dev maborroto darrix

nspm's Issues

Duplicate Initialization during Training Process

During executing the demo training step in the README, I found these two statements every epoch, I suppose there may be some unneeded code in the train.py

Table trying to initialize from file ../data/monument_300_model/vocab.en is already initialized.
Table trying to initialize from file ../data/monument_300_model/vocab.en is already initialized.

Trying to solve it.

ZeroDivisionError when run command train

When I run the command sh train.sh data/monument_300 120000 the following error appears ZeroDivisionError: integer division or modulo by zero

FailedPreconditionError: HashTable has different value for same key

FailedPreconditionError (see above for traceback): HashTable has different value for same key. Key dbr_Terreiro_da_Luta has 127 and trying to add value 285

how to solve this problem

adding requirements.txt

to specify the packages and their version required for the project
ex
enum34
numpy
tensorflow==1.2.0

Deprecation warning in build_vocab.py

Environment

Python 3.7.6
tensorflow==1.14.0

Log

$ python build_vocab.py data/monument_300/data_300.en > data/monument_300/vocab.en
WARNING:tensorflow:From build_vocab.py:44: VocabularyProcessor.__init__ (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
WARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/preprocessing/text.py:154: CategoricalVocabulary.__init__ (from tensorflow.contrib.learn.python.learn.preprocessing.categorical_vocabulary) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
WARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/contrib/learn/python/learn/preprocessing/text.py:170: tokenizer (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.

Update README.md to indicate the use of Python 2.7

Since the support for Python2 is now being rescinded, it would be great if the README.md could indicate that code had been written in Python2 so any future developers could set up the appropriate development environment.

What does ppl mean?

thanks for your excellent work in the interesting issue.
When i'm training with your given monument_300 data,I saw the output like this:

step 4100 lr 1 step-time 2.35s wps 2.36K ppl 64.10 gN 3.08 bleu 2.74, Sat Dec 8 14:28:50 2018
Can you tell me what does the ppl and gN mean? And why is the BLEU score so small?
Thank you very much.

fault in tensorflow version check

wrong output of tensorflow version check in NSpM/nmt/nmt/utils/misc_utils.py
EnvironmentError: Tensorflow version must >= 1.2.1

changes to be made
from

if tf.__version__ < “1.2.1”:
  raise EnvironmentError("Tensorflow version must >= 1.2.1”)

if tf.__version__ < "1.02.1":
    raise EnvironmentError("Tensorflow version must >= 1.02.1")

Bug in generator.py?

https://github.com/AKSW/NSpM/blob/f33f60dd2b1f423cde079a249328cae2115fdb5f/generator.py#L136
what does it mean? it seems inaccurate comparison with list and integer, it will always return true!

file handling and code cleanup in gsoc/aman

Error in path in ask.sh?

Hey, I don't know if I am just misunderstanding the instructions, but the project kept giving me ValueError:"Can't load save_path when it is None." if I tried to precisely follow the instructions in the readme.

I believe it would work correctly with your readme instructions if you would change this part of ask.sh:
python -m nmt.nmt --vocab_prefix=../$1/vocab --model_dir=../$1_model --inference_input_file=./to_ask.txt --inference_output_file=./output.txt --out_dir=../$1_model --src=en --tgt=sparql | tail -n4

to this:
python -m nmt.nmt --vocab_prefix=../$1/vocab --model_dir=../$1 --inference_input_file=./to_ask.txt --inference_output_file=./output.txt --out_dir=../$1 --src=en --tgt=sparql | tail -n4
(that $1_model makes it repeat the word weirdly in the folder names)

Please, feel free to correct me if I am wrong, and thank you for your awesome paper.

Conflicting dependencies

When trying to install the dependencies with
pip install -r requirements.txt
several errors concerning conflicting dependencies are raised.

Trainierte Models zur Verfügung stellen

Hallo, ist es vielleicht möglich fertig trainierte Models zur Verfügung zu stellen? Ich wollte das Netz mit dem Datensatz https://figshare.com/articles/Question-NSpM_SPARQL_dataset_EN_/6118505 trainieren, jedoch dauert das auf meinem Rechner einfach zu lange.

Issues while running pipeline 1

I tried running the pipelines but because of some python version related issues I was getting errors in pipeline 1.

The solution which worked was using
from urllib.request import urlopen and then urlopen(<url>)
instead of import urlliband then using urllib.request.urlopen(<url>)
Make sure to use python3.7 to run the pipelines as @panchbhai1969 's code uses it.

It works becuase
The urllib and urllib2 modules from Python 2.x have been combined into the urllib module in Python 3 as mentioned here

Also, while setting up the project I realised it will be better to have a requirements.txt file.
I would like to do it too as my initial contribution.

Deprecated Tensorflow functions

While running

python build_vocab.py data/monument_300/data_300.en > data/monument_300/vocab.en

The python interpreter gives out the following warning

WARNING:tensorflow:From build_vocab.py:43: init (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.

Can use tensorflow/transform or tf.data in place to keep up with the recent updates (as suggested by the python interpreter)

Possible need to rename files

While splitting the data file into train, dev and test sets by running the following commands given in the README.md

cd data/monument_300/
python ../../split_in_train_dev_test.py --lines $NUMLINES  --dataset data.sparql

I run into the following error

Traceback (most recent call last):
File "../../split_in_train_dev_test.py", line 42, in
with open(sparql_file) as original_sparql, open(en_file) as original_en:
IOError: [Errno 2] No such file or directory: 'data.sparql'

which can be solved by renaming the files in monument_300 (data_300.sparql and data_300.en to data.sparql and data.en)

Fixes in README.md

It may seem trivial, but README.md files are the first point of information that a parson refers to when trying to understand a project, The repository's README.md is good, but i have found some error that need attention.

NUMLINES= $(echo awk '{ print $1}' | cat data/monument_300/data_300.sparql | wc -l)
There shouldn't be any space after =, the line should be :
NUMLINES=$(echo awk '{ print $1}' | cat data/monument_300/data_300.sparql | wc -l)

Other README.md related issues and suggestions brought up by users are #12 #14 #17

cat: output.txt: No such file or directory

On running the ./ask.sh script I'm getting the following error:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/NSpM/nmt/nmt/nmt.py", line 707, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/content/NSpM/nmt/nmt/nmt.py", line 700, in main
    run_main(FLAGS, default_hparams, train_fn, inference_fn)
  File "/content/NSpM/nmt/nmt/nmt.py", line 658, in run_main
    save_hparams=(jobid == 0))
  File "/content/NSpM/nmt/nmt/nmt.py", line 607, in create_or_load_hparams
    hparams = extend_hparams(hparams)
  File "/content/NSpM/nmt/nmt/nmt.py", line 493, in extend_hparams
    unk=vocab_utils.UNK)
  File "/content/NSpM/nmt/nmt/utils/vocab_utils.py", line 137, in check_vocab
    raise ValueError("vocab_file '%s' does not exist." % vocab_file)
ValueError: vocab_file '../data/monument_300/vocab.en' does not exist.
# Job id 0
# Devices visible to TensorFlow: [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 2340982298104704118), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 7897109989793672363)]
# Creating output directory ../data/monument_300_model ...

ANSWER IN SPARQL SEQUENCE:
cat: output.txt: No such file or directory

Can someone please help me with this?

I didin't find monument_300_model direcory

ZeroDivisionError while training

File "nmt/model_helper.py", line 444, in compute_perplexity
perplexity = utils.safe_exp(total_loss / total_predict_count)
ZeroDivisionError: integer division or modulo by zero

Where is the LC-QuAD dataset?

The LC-QuAD data set has 5000 pairs, but I generated it through the lc-quad cvs file in the data path, and the result exceeded hundreds of thousands of LC-QuAD sentence pairs.Please can you help me generate accurate LC-QuAD data set

Fix prediction script

One of the last TensorFlow updates broke the ./ask.sh script. Probably it has to be written from scratch.

Tiny fix in example of Readme.md

Hi, a very small thing, but when running the example presented in the Readme.md, in the "Interpreter Module" section, the argument "--output" is not currently supported. Here is the fixed line of code.

Current:
python nspm/interpreter.py --input data/art_30 --output data/art_30 --query "yuncken freeman has architected in how many cities?"

New:
python nspm/interpreter.py --input data/art_30 --query "yuncken freeman has architected in how many cities?"

Possible fault in version check implemented in nmt (linked to tensorflow's implementation)

The version check in nmt/utils/misc_utils.py has a snippet of code for checking the version of TF installed on the local machine

def check_tensorflow_version():
  if tf.__version__ < "1.2.1":
    raise EnvironmentError("Tensorflow version must >= 1.2.1")

I have tested out the code using both the current version of TF as well as TF-nightly as the authors want you to do (mentioned on their README.md). The current version is 1.12 and it is obvious the check is failing because it's comparing the versions numerically.

I have created an issue on the main project of nmt as well, but since this would affect the working of this project as well, I am also opening an issue here.

Corrupted file

unzip art_30.zip
Archive: art_30.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of art_30.zip or
art_30.zip.zip, and cannot find art_30.zip.ZIP, period.

Upgrade to use TensorFlow 2

Replace the ./nmt/ submodule with an internal library based on the NMT with attention tutorial compatible with TensorFlow 2.2.0rc4.

How to use "analyse.sh" and "filter_dataset.py"?

Hello together

First of all, thank you for sharing the code of this project.
I was able to train the model and make some predictions, but now I want to find the shortcomings of the model, so I want to analyze on which questions/queries the model performs well.

I found the "analyse.sh" script and the "filter_dataset.py".
Now I want to ask you what's the purpose of these files and how to use them.

Thank you for your time
Kind regards
Nicolas

Que verdade

espaço: a fronteira final. Estas são as viagens da nave estelar Enterprise. Em sua missão de cinco anos... para explorar novos mundos... para pesquisar novas formas de vida e novas civilizações... audaciosamente indo onde nenhum homem jamais esteve.

Duplicates in movies dataset

Hello!

Firstly thank you very much for your repository and research. This is a very interesting field. I am currently using your monument dataset as the training data in my master thesis.

I notice you uploaded a new dataset called movies_300.zip several days ago. I intended to try it in my experiments as well but I found that it has many duplicate lines in the training file (e.g. "how long is the longest movie" showed 227 times in 'train.en').
Could you explain what is the reason for that? Is it appropriate to use this dataset for training or this dataset is just made for other tasks?

Thank you and best regards
Xiaoyu

Can it be used on gpu?

FailedPreconditionError (see above for traceback): HashTable has different value for same key. Key en has 3 and trying to add value 715

Issues in the file data folder monument 300 zip

there are some isssues in the file data folder monument 300 zip file in this file build_vocab there are some library are not mentioned

import numpy as np
from tensorflow.contrib import learn
import sys
from importlib import reload


reload(sys)
#there  is no neccessitiy of encoding as it is use in python 2.5 version so we can remove
sys.setdefaultencoding("utf-8")

x_text = list()
#there is no arguement in the script so we can change 1 to 0
with open(sys.argv[0]) as f:
    for line in f:
#we will remove unicode
        x_text.append(unicode(line[:-1]))

# x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])

## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))    

## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping

## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])

## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])

# print(vocabulary)
# print(x)
for v in vocabulary:
    print(v)

How to deal with out of vocabulary words?

Hi,

I recently utilized the technique that has been discussed in this project for transforming a natural language sentence into a SPARQL query. Based on this, I created an end to end question answering system as part of my final year project. The system works well for known resource names, however; for questions which contain out of vocabulary words (resource names/words not part of the training data), the system does not predict an accurate query.

In the Neural Machine Translation for Query Construction paper, it says that External pre-trained word embeddings help deal with vocabulary mismatch. I am not sure how this would be implemented, could you provide any insight? I am already finished with the project but I would still like to learn about this.

The project I created is available on GitHub and can be found here if you would like to see. There's also a deployed version of the system and can be found here.

Thanks for the help in advance.

Attention for LSTM

Hello,
I really like your work on using seq2seq for creating SPARQL queries - just one question: Was there a specific reason not to include attention while training? As far as I understood the tensorflow NMT guilde, you would have to add something like --attention=scaled_luong to the options in your train.sh. Did you evaluate whether it works better with/without attention?
Greetings!