kelvinguu / neural-editor Goto Github PK
View Code? Open in Web Editor NEWRepository for "Generating Sentences by Editing Prototypes"
Repository for "Generating Sentences by Editing Prototypes"
Hey @kelvinguu, big thanks for uploading this code. After reviewing the raw code, I'm confused as to why you have kill_edit
flag to True in the Google Corpus Config, but False in Yelp.
In kill_edit
, we set the entire set of edit vectors to zero which essentially doesn't allow any edit vector to be used during training.
Is the reason why you set this to True in the Google corpus is because the phrases in the Google corpus have many more edits compared to the Yelp Corpus?
Also, do you know when the data
dir will be uploaded so we can test Neural Editor? Thanks!
I was following the instructions in the new README, and tried to do wget for the dataset using the url
https://worksheets.codalab.org/rest/bundles/0x99d0557925b34dae851372841f206b8a/contents/blob/
It failed. When I tried to go to the url in Chrome, I was told that my account "does not have sufficient permissions on bundle" to read the file. Does anyone know if this could be fixed, or if there is a workaround?
Hi, All
I read through the process of generation, I have some question about the attention part.
In the code, there is three part of attention_decoder.py file line 53, we can find out when decode we attend on following three part
insert_noisy_exact
delete_noisy_exact
But I see in the EditTrainingRun
class, the _train
function has a EditNoiser
object, this object will add some noise on our input example first. But I see the config of your codalab I see three line to control this EditNoiser
edit_dropout = true
ident_pr = 0.1
attend_pr = 0.0
In the EditNoiser
code
def _noise(self, ex):
"""Return a noisy EditExample.
Note: this strategy is only appropriate for diff-style EditExamples.
Args:
ex (EditExample)
Returns:
EditExample: a new example. Does not modify the original example.
"""
ident_map = np.random.binomial(1,self.ident_pr)
if ident_map:
return EditExample(ex.source_words, [], [], [], [], ex.source_words)
else:
insert_exact, insert_approx= self.dropout_split(ex.insert_exact_words)
delete_exact, delete_approx = self.dropout_split(ex.delete_exact_words)
return EditExample(ex.source_words, insert_approx, insert_exact, delete_approx, delete_exact, ex.target_words)
if the attend_pr
is zero the the insert_exact
will be empty and insert_approx
is the origin ex.insert_exact_words
, then in the new EditExample the insert_exact
and delete_exact
alway empty. The Attention we mentioned over
insert_noisy_exact
delete_noisy_exact
Hope for reply
Thanks
Gordon
Provide instructions for training and running the model:
Hi,
I was trying to make my own training dataset based on sentence jaccard distance.
Looks like it will take too much time to compare all sentences pairs by the raw algorithm :)
Could you provide the text preprocessing code ?
Hey @kelvinguu and @thashim thanks for your help so far. I've been doing some more research on deciding how to measure the "closest" sentence to create your training dataset (edit pairs).
First off, I think BFS is absolutely necessary especially for the bigger datasets. But the methodology of measurement between sentences I feel can be improved:
Doc2Vec Cosine Distance. In this approach you use doc2vec and train the model on each sentence which yields a final vector for each sentence. You then find the closest sentences to your prototype based upon cosine distance. I've tried this method already and unfortunately, it doesn't seem to find similarly worded sentences.
Edit Distance. In this approach, you use edit distance instead of Jaccard which offers the benefit of gpu usage (see tensorflow's edit_distance). In this method, deletions have a higher weight.
Word2Vec Averaging + Cosine Distance. In this approach you take the word vectors of each sentence, then average them to get a final vector for the sentence. You then compare cosine distances. I feel that this approach would be the best because it accounts for similarity between words. For instance, "feeling" and "feelings" are treat similarly. In Jaccard and Edit Distance, this does not occur.
Jaccard Distance Without Stop Words. In this approach, you take out all stop words in a sentence and then measure the jaccard distance. This would hopefully give you a better match semantically. This could be added in addition to taking the original Jaccard distance.
Obviously, you guys worked really hard, and these were all approaches you probably thought of already. If you could comment on why you chose vanilla Jaccard over these, it would be much appreciated!
I've been trying to run this code on a multiple GPUs machine, by providing multiple GPU indexes in the CUDA_VISIBLE_DEVICES variable, but my machine always ends up running it on only one. Has anyone had success with this?
Follow the ReadMe file, everything is set up. Running the
python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt
,
get the error
No such file or directory: '/data/word_vectors/glove.6B.300d_onebil.txt'
After unzip the onebillion_split.zip, there is no such file.
Thank you for your work first!
The paper refers to this worksheet, I do not know how to run it.
Thank you.
I spotted something strange happening in the edit_model/edit_encoder.py, seq_batch_noise function line 62:
new_values[:, 0, :] = phint*m_expand+ prand*(1-m_expand)
This basically return a noisy version of only one vector (the first one) and all other vector is putted to 0. Instead of every of them as specified in the docstring. This is then propagated to the input of the attention decoder hence making the attention layer of the insert and delete embedding using only the first insert or delete token information.
Is there a reason for this or is it just a mistake ?
I had two questions.
Firstly, I have trained my model on yelp dataset and now I wanted to test it, so how do I starting testing. I saw the code but it isn't intuitive on how to start the testing of the model.
Thanks in Advance
Hi,
My training is always killed without any error information like below.
uncomitted changes being stored as patches
New TrainingRun created at: /data/edit_runs/7
Optimized batches: reduced cost from 45709568 (naive) to 20758016 (0.545871533942% reduction).
Optimal (batch_size=1) would be 20741962.
Passed batching test
Streaming training examples: 6%|5 | 399/7032 [48:47<12:31:31, 6.80s/it]Killed
Hello,
Thank you for releasing such an amazing work. I've tried to run the process by following steps.
mkdir -p $DATA_DIR
and uncompressed Glove vectors into $DATA_DIR/word_vectors
glove.6B.300d_onebil.txt
and glove.6B.300d_yelp.txt
from https://worksheets.codalab.org/bundles/0x89bc0497bbb14ee489d33e032fa43a2e/$DATA_DIR/onebillion_split
However, I received an error while executing python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt --gpu 0
within docker
individually:
Variable containing:
205.0143
[torch.cuda.FloatTensor of size 1 (GPU 0)]batched:
Variable containing:
205.0143
[torch.cuda.FloatTensor of size 1 (GPU 0)]Traceback (most recent call last):
File "textmorph/edit_model/main.py", line 40, in
exp.train()
File "/code/textmorph/edit_model/training_run.py", line 265, in train
self._train(self.config, self._train_state, self._examples, self.workspace, self.metadata, self.tb_logger)
File "/code/textmorph/edit_model/training_run.py", line 399, in _train
editor.test_batch(noiser(train_batches[0]))
File "/code/textmorph/edit_model/editor.py", line 128, in test_batch
raise Exception('batching error - examples do not produce identical results under batching')
Exception: batching error - examples do not produce identical results under batching
After some checks, I found the individually
is 205.01431 while batched
is 205.01433, what should I do with this?
Appreciate any help! Thank you ๐
My oversimplified interpretation of how the model seems to work is; The encoder hidden states have some random normal noise added to them. Then these hidden states are combined with the input word vector for the decoder. And then it's trained like a regular language model.
The only thing I wasn't able to find was the part of the code that does the context combining? All the context combiner classes and their parent classes appear to be empty.
Is there a way to load the model and reuse it? I didn't find how I can put in my test sentences in the code.
If I only have source sentences without target pair, can I generate new sentences using the "INSERT" words I specified by myself?
Hi,
When is the code going to be pushed?
Regards
What is the third-party library of gtd for? Is there some documents for it?
I followed the instructions in https://github.com/kelvinguu/neural-editor/tree/readme. But I get the following error when executing:
python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt
The glove.6B.zip does not have this file. How do I obtain it?
No checkpoint to reload. Initializing fresh.
Traceback (most recent call last):
File "textmorph/edit_model/main.py", line 32, in
exp = experiments.new(config) # new experiment from config
File "/code/gtd/ml/training_run.py", line 144, in new
run = self._run_factory(config, save_dir)
File "/code/textmorph/edit_model/training_run.py", line 255, in init
self._train_state = self._initialize_train_state(config)
File "/code/textmorph/edit_model/training_run.py", line 322, in _initialize_train_state
editor = cls._build_editor(config.editor)
File "/code/textmorph/edit_model/training_run.py", line 297, in _build_editor
word_embeddings = SimpleEmbeddings.from_file(file_path, config.word_dim, vocab_size=config.vocab_size)
File "/code/gtd/ml/vocab.py", line 184, in from_file
with codecs.open(file_path, 'r', encoding='utf-8') as f:
File "/opt/conda/envs/pytorch-py27/lib/python2.7/codecs.py", line 896, in open
file = builtin.open(filename, mode, buffering)
IOError: [Errno 2] No such file or directory: '/data/word_vectors/glove.6B.300d_onebil.txt'
Hi @kelvinguu,
First of all, really interesting paper!! What do you think about creating a public project on FloydHub for having reproducible experiments and simplify the way in which other research or AI passionate can interact/play with it?
Hi Kelvin
Thans for the great work! I try the default editor on personal dataset and it works well. However during the training I found that the memory usage increasingly goes up, which leads to the experiment being killed after some training steps. Is it because I run the model in a wrong way?(I just prepare the dataset and use main.py in editor model fold.) And I want to know how to fix this problem.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.