kelvinguu / neural-editor Goto Github PK

View Code? Open in Web Editor NEW

328.0 328.0 61.0 172 KB

Repository for "Generating Sentences by Editing Prototypes"

Shell 0.43% Python 97.36% Dockerfile 2.22%

neural-editor's People

Contributors

Stargazers

Watchers

Forkers

johndpope kastnerkyle karkirowle cash2one anna-hope ianmconversica pingoogle yzh119 artidoro pmadhyastha demiguo kikumaru818 asiddhant shubhampachori12110095 linglin00 adempsey dpressel jacklangerman luciay schangpi magic282 akkikiki mgwave youngseon liqunchen0606 claire0120 afcarl haozijie leishenvictoria chqiwang grll yufanghuang yylong711 zhengliz cpaaax woollysocks morristech zsb87 bolleanman an81 thinkerboy theadamgabriel sunyunh b2220333 wuk32 sandradjambazovska jaewonlee-728 ykumards woutster robbie-luo gitsamshi wangzhen-nlp jongwon-jay-lee 1204271075 dataviral cb1473258684 beethovenvirus tk1363704 hdvvip standardgalactic

neural-editor's Issues

Why Is kill_edit=True for Google Corpus Config, but not For Yelp?

Hey @kelvinguu, big thanks for uploading this code. After reviewing the raw code, I'm confused as to why you have kill_edit flag to True in the Google Corpus Config, but False in Yelp.

In kill_edit, we set the entire set of edit vectors to zero which essentially doesn't allow any edit vector to be used during training.

Is the reason why you set this to True in the Google corpus is because the phrases in the Google corpus have many more edits compared to the Yelp Corpus?

Also, do you know when the data dir will be uploaded so we can test Neural Editor? Thanks!

Can't download dataset from Codalab

I was following the instructions in the new README, and tried to do wget for the dataset using the url

https://worksheets.codalab.org/rest/bundles/0x99d0557925b34dae851372841f206b8a/contents/blob/

It failed. When I tried to go to the url in Chrome, I was told that my account "does not have sufficient permissions on bundle" to read the file. Does anyone know if this could be fixed, or if there is a workaround?

Some question about attention on insert and delete words

Hi, All
I read through the process of generation, I have some question about the attention part.
In the code, there is three part of attention_decoder.py file line 53, we can find out when decode we attend on following three part

encoder hidden output of every timestamp
attention over insert_noisy_exact
attention over delete_noisy_exact

But I see in the EditTrainingRun class, the _train function has a EditNoiser object, this object will add some noise on our input example first. But I see the config of your codalab I see three line to control this EditNoiser

  edit_dropout = true
  ident_pr = 0.1
  attend_pr = 0.0

In the EditNoiser code

    def _noise(self, ex):
        """Return a noisy EditExample.

        Note: this strategy is only appropriate for diff-style EditExamples.

        Args:
            ex (EditExample)

        Returns:
            EditExample: a new example. Does not modify the original example.
        """
        ident_map = np.random.binomial(1,self.ident_pr)
        if ident_map:
            return EditExample(ex.source_words, [], [], [], [], ex.source_words)
        else:
            insert_exact, insert_approx= self.dropout_split(ex.insert_exact_words)
            delete_exact, delete_approx = self.dropout_split(ex.delete_exact_words)
            return EditExample(ex.source_words, insert_approx, insert_exact, delete_approx, delete_exact, ex.target_words)

if the attend_pr is zero the the insert_exact will be empty and insert_approx is the origin ex.insert_exact_words, then in the new EditExample the insert_exact and delete_exact alway empty. The Attention we mentioned over

attention over insert_noisy_exact
attention over delete_noisy_exact
will no use anymore. I want to confirm If no use why not just remove it, because It will make the decode step input vector much huge.

Hope for reply
Thanks
Gordon

Provide instructions for training and running the model

Provide instructions for training and running the model:

how to specify model configs
setting environment variables
setting up the data directory
how to run the Docker image

Text preprocessing code.

Hi,
I was trying to make my own training dataset based on sentence jaccard distance.
Looks like it will take too much time to compare all sentences pairs by the raw algorithm :)
Could you provide the text preprocessing code ?

Improving Edit Pairs with Different Distance Metrics

Hey @kelvinguu and @thashim thanks for your help so far. I've been doing some more research on deciding how to measure the "closest" sentence to create your training dataset (edit pairs).

First off, I think BFS is absolutely necessary especially for the bigger datasets. But the methodology of measurement between sentences I feel can be improved:

Doc2Vec Cosine Distance. In this approach you use doc2vec and train the model on each sentence which yields a final vector for each sentence. You then find the closest sentences to your prototype based upon cosine distance. I've tried this method already and unfortunately, it doesn't seem to find similarly worded sentences.
Edit Distance. In this approach, you use edit distance instead of Jaccard which offers the benefit of gpu usage (see tensorflow's edit_distance). In this method, deletions have a higher weight.
Word2Vec Averaging + Cosine Distance. In this approach you take the word vectors of each sentence, then average them to get a final vector for the sentence. You then compare cosine distances. I feel that this approach would be the best because it accounts for similarity between words. For instance, "feeling" and "feelings" are treat similarly. In Jaccard and Edit Distance, this does not occur.
Jaccard Distance Without Stop Words. In this approach, you take out all stop words in a sentence and then measure the jaccard distance. This would hopefully give you a better match semantically. This could be added in addition to taking the original Jaccard distance.

Obviously, you guys worked really hard, and these were all approaches you probably thought of already. If you could comment on why you chose vanilla Jaccard over these, it would be much appreciated!

How can we run this on multiple GPUs?

I've been trying to run this code on a multiple GPUs machine, by providing multiple GPU indexes in the CUDA_VISIBLE_DEVICES variable, but my machine always ends up running it on only one. Has anyone had success with this?

Need data preprocessing?

Follow the ReadMe file, everything is set up. Running the
python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt,
get the error
No such file or directory: '/data/word_vectors/glove.6B.300d_onebil.txt'

After unzip the onebillion_split.zip, there is no such file.

Run the Worksheet

Thank you for your work first!
The paper refers to this worksheet, I do not know how to run it.
Thank you.

Weird output of the edit encoder

I spotted something strange happening in the edit_model/edit_encoder.py, seq_batch_noise function line 62:

new_values[:, 0, :] = phint*m_expand+ prand*(1-m_expand)

This basically return a noisy version of only one vector (the first one) and all other vector is putted to 0. Instead of every of them as specified in the docstring. This is then propagated to the input of the attention decoder hence making the attention layer of the insert and delete embedding using only the first insert or delete token information.

Is there a reason for this or is it just a mistake ?

Release datasets

the dataset of (prototype, revision) sentence pairs
the sentence analogy evaluation

Testing the trained model

I had two questions.
Firstly, I have trained my model on yelp dataset and now I wanted to test it, so how do I starting testing. I saw the code but it isn't intuitive on how to start the testing of the model.

Thanks in Advance

why training is always killed without any error information

Hi,

My training is always killed without any error information like below.

uncomitted changes being stored as patches
New TrainingRun created at: /data/edit_runs/7
Optimized batches: reduced cost from 45709568 (naive) to 20758016 (0.545871533942% reduction).
Optimal (batch_size=1) would be 20741962.
Passed batching test
Streaming training examples:   6%|5         | 399/7032 [48:47<12:31:31,  6.80s/it]Killed

Exception: batching error - examples do not produce identical results under batching

Hello,
Thank you for releasing such an amazing work. I've tried to run the process by following steps.

Clone this repo
mkdir -p $DATA_DIR and uncompressed Glove vectors into $DATA_DIR/word_vectors
Download glove.6B.300d_onebil.txt and glove.6B.300d_yelp.txt from https://worksheets.codalab.org/bundles/0x89bc0497bbb14ee489d33e032fa43a2e/
Download onebillion_split dataset from https://worksheets.codalab.org/bundles/0x017b7af92956458abc7f4169830a6537/ and put them in $DATA_DIR/onebillion_split

However, I received an error while executing python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt --gpu 0 within docker

individually:
Variable containing:
205.0143
[torch.cuda.FloatTensor of size 1 (GPU 0)]

batched:
Variable containing:
205.0143
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Traceback (most recent call last):
File "textmorph/edit_model/main.py", line 40, in
exp.train()
File "/code/textmorph/edit_model/training_run.py", line 265, in train
self._train(self.config, self._train_state, self._examples, self.workspace, self.metadata, self.tb_logger)
File "/code/textmorph/edit_model/training_run.py", line 399, in _train
editor.test_batch(noiser(train_batches[0]))
File "/code/textmorph/edit_model/editor.py", line 128, in test_batch
raise Exception('batching error - examples do not produce identical results under batching')
Exception: batching error - examples do not produce identical results under batching

After some checks, I found the individually is 205.01431 while batched is 205.01433, what should I do with this?

Appreciate any help! Thank you 😄

Where abouts is the context stuff combined?

My oversimplified interpretation of how the model seems to work is; The encoder hidden states have some random normal noise added to them. Then these hidden states are combined with the input word vector for the decoder. And then it's trained like a regular language model.

The only thing I wasn't able to find was the part of the code that does the context combining? All the context combiner classes and their parent classes appear to be empty.

load the model

Is there a way to load the model and reuse it? I didn't find how I can put in my test sentences in the code.
If I only have source sentences without target pair, can I generate new sentences using the "INSERT" words I specified by myself?

When is the code going to be pushed?

Hi,

When is the code going to be pushed?

Regards

third-party of gtd

What is the third-party library of gtd for? Is there some documents for it?

glove.6B.300d_onebil.txt file not found error

I followed the instructions in https://github.com/kelvinguu/neural-editor/tree/readme. But I get the following error when executing:

python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt

The glove.6B.zip does not have this file. How do I obtain it?

No checkpoint to reload. Initializing fresh.
Traceback (most recent call last):
File "textmorph/edit_model/main.py", line 32, in
exp = experiments.new(config) # new experiment from config
File "/code/gtd/ml/training_run.py", line 144, in new
run = self._run_factory(config, save_dir)
File "/code/textmorph/edit_model/training_run.py", line 255, in init
self._train_state = self._initialize_train_state(config)
File "/code/textmorph/edit_model/training_run.py", line 322, in _initialize_train_state
editor = cls._build_editor(config.editor)
File "/code/textmorph/edit_model/training_run.py", line 297, in _build_editor
word_embeddings = SimpleEmbeddings.from_file(file_path, config.word_dim, vocab_size=config.vocab_size)
File "/code/gtd/ml/vocab.py", line 184, in from_file
with codecs.open(file_path, 'r', encoding='utf-8') as f:
File "/opt/conda/envs/pytorch-py27/lib/python2.7/codecs.py", line 896, in open
file = builtin.open(filename, mode, buffering)
IOError: [Errno 2] No such file or directory: '/data/word_vectors/glove.6B.300d_onebil.txt'

how to run it?

Reproducible experiments on FloydHub

Hi @kelvinguu,

First of all, really interesting paper!! What do you think about creating a public project on FloydHub for having reproducible experiments and simplify the way in which other research or AI passionate can interact/play with it?

Memory Usage Problem

Hi Kelvin

Thans for the great work! I try the default editor on personal dataset and it works well. However during the training I found that the memory usage increasingly goes up, which leads to the experiment being killed after some training steps. Is it because I run the model in a wrong way?(I just prepare the dataset and use main.py in editor model fold.) And I want to know how to fix this problem.
Thanks