Coder Social home page Coder Social logo

neural-editor's People

Contributors

kelvinguu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neural-editor's Issues

Why Is kill_edit=True for Google Corpus Config, but not For Yelp?

Hey @kelvinguu, big thanks for uploading this code. After reviewing the raw code, I'm confused as to why you have kill_edit flag to True in the Google Corpus Config, but False in Yelp.

In kill_edit, we set the entire set of edit vectors to zero which essentially doesn't allow any edit vector to be used during training.

Is the reason why you set this to True in the Google corpus is because the phrases in the Google corpus have many more edits compared to the Yelp Corpus?

Also, do you know when the data dir will be uploaded so we can test Neural Editor? Thanks!

Some question about attention on insert and delete words

Hi, All
I read through the process of generation, I have some question about the attention part.
In the code, there is three part of attention_decoder.py file line 53, we can find out when decode we attend on following three part

  1. encoder hidden output of every timestamp
  2. attention over insert_noisy_exact
  3. attention over delete_noisy_exact

But I see in the EditTrainingRun class, the _train function has a EditNoiser object, this object will add some noise on our input example first. But I see the config of your codalab I see three line to control this EditNoiser

  edit_dropout = true
  ident_pr = 0.1
  attend_pr = 0.0

In the EditNoiser code

    def _noise(self, ex):
        """Return a noisy EditExample.

        Note: this strategy is only appropriate for diff-style EditExamples.

        Args:
            ex (EditExample)

        Returns:
            EditExample: a new example. Does not modify the original example.
        """
        ident_map = np.random.binomial(1,self.ident_pr)
        if ident_map:
            return EditExample(ex.source_words, [], [], [], [], ex.source_words)
        else:
            insert_exact, insert_approx= self.dropout_split(ex.insert_exact_words)
            delete_exact, delete_approx = self.dropout_split(ex.delete_exact_words)
            return EditExample(ex.source_words, insert_approx, insert_exact, delete_approx, delete_exact, ex.target_words)

if the attend_pr is zero the the insert_exact will be empty and insert_approx is the origin ex.insert_exact_words, then in the new EditExample the insert_exact and delete_exact alway empty. The Attention we mentioned over

  1. attention over insert_noisy_exact
  2. attention over delete_noisy_exact
    will no use anymore. I want to confirm If no use why not just remove it, because It will make the decode step input vector much huge.

Hope for reply
Thanks
Gordon

Text preprocessing code.

Hi,
I was trying to make my own training dataset based on sentence jaccard distance.
Looks like it will take too much time to compare all sentences pairs by the raw algorithm :)
Could you provide the text preprocessing code ?

Improving Edit Pairs with Different Distance Metrics

Hey @kelvinguu and @thashim thanks for your help so far. I've been doing some more research on deciding how to measure the "closest" sentence to create your training dataset (edit pairs).

First off, I think BFS is absolutely necessary especially for the bigger datasets. But the methodology of measurement between sentences I feel can be improved:

  1. Doc2Vec Cosine Distance. In this approach you use doc2vec and train the model on each sentence which yields a final vector for each sentence. You then find the closest sentences to your prototype based upon cosine distance. I've tried this method already and unfortunately, it doesn't seem to find similarly worded sentences.

  2. Edit Distance. In this approach, you use edit distance instead of Jaccard which offers the benefit of gpu usage (see tensorflow's edit_distance). In this method, deletions have a higher weight.

  3. Word2Vec Averaging + Cosine Distance. In this approach you take the word vectors of each sentence, then average them to get a final vector for the sentence. You then compare cosine distances. I feel that this approach would be the best because it accounts for similarity between words. For instance, "feeling" and "feelings" are treat similarly. In Jaccard and Edit Distance, this does not occur.

  4. Jaccard Distance Without Stop Words. In this approach, you take out all stop words in a sentence and then measure the jaccard distance. This would hopefully give you a better match semantically. This could be added in addition to taking the original Jaccard distance.

Obviously, you guys worked really hard, and these were all approaches you probably thought of already. If you could comment on why you chose vanilla Jaccard over these, it would be much appreciated!

How can we run this on multiple GPUs?

I've been trying to run this code on a multiple GPUs machine, by providing multiple GPU indexes in the CUDA_VISIBLE_DEVICES variable, but my machine always ends up running it on only one. Has anyone had success with this?

Need data preprocessing?

Follow the ReadMe file, everything is set up. Running the
python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt,
get the error
No such file or directory: '/data/word_vectors/glove.6B.300d_onebil.txt'

After unzip the onebillion_split.zip, there is no such file.

Run the Worksheet

Thank you for your work first!
The paper refers to this worksheet, I do not know how to run it.
Thank you.

Weird output of the edit encoder

I spotted something strange happening in the edit_model/edit_encoder.py, seq_batch_noise function line 62:

new_values[:, 0, :] = phint*m_expand+ prand*(1-m_expand)

This basically return a noisy version of only one vector (the first one) and all other vector is putted to 0. Instead of every of them as specified in the docstring. This is then propagated to the input of the attention decoder hence making the attention layer of the insert and delete embedding using only the first insert or delete token information.

Is there a reason for this or is it just a mistake ?

Release datasets

  • the dataset of (prototype, revision) sentence pairs
  • the sentence analogy evaluation

Testing the trained model

I had two questions.
Firstly, I have trained my model on yelp dataset and now I wanted to test it, so how do I starting testing. I saw the code but it isn't intuitive on how to start the testing of the model.

Thanks in Advance

why training is always killed without any error information

Hi,

My training is always killed without any error information like below.

uncomitted changes being stored as patches
New TrainingRun created at: /data/edit_runs/7
Optimized batches: reduced cost from 45709568 (naive) to 20758016 (0.545871533942% reduction).
Optimal (batch_size=1) would be 20741962.
Passed batching test
Streaming training examples:   6%|5         | 399/7032 [48:47<12:31:31,  6.80s/it]Killed

Exception: batching error - examples do not produce identical results under batching

Hello,
Thank you for releasing such an amazing work. I've tried to run the process by following steps.

  1. Clone this repo
  2. mkdir -p $DATA_DIR and uncompressed Glove vectors into $DATA_DIR/word_vectors
  3. Download glove.6B.300d_onebil.txt and glove.6B.300d_yelp.txt from https://worksheets.codalab.org/bundles/0x89bc0497bbb14ee489d33e032fa43a2e/
  4. Download onebillion_split dataset from https://worksheets.codalab.org/bundles/0x017b7af92956458abc7f4169830a6537/ and put them in $DATA_DIR/onebillion_split

However, I received an error while executing python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt --gpu 0 within docker

individually:
Variable containing:
205.0143
[torch.cuda.FloatTensor of size 1 (GPU 0)]

batched:
Variable containing:
205.0143
[torch.cuda.FloatTensor of size 1 (GPU 0)]

Traceback (most recent call last):
File "textmorph/edit_model/main.py", line 40, in
exp.train()
File "/code/textmorph/edit_model/training_run.py", line 265, in train
self._train(self.config, self._train_state, self._examples, self.workspace, self.metadata, self.tb_logger)
File "/code/textmorph/edit_model/training_run.py", line 399, in _train
editor.test_batch(noiser(train_batches[0]))
File "/code/textmorph/edit_model/editor.py", line 128, in test_batch
raise Exception('batching error - examples do not produce identical results under batching')
Exception: batching error - examples do not produce identical results under batching

After some checks, I found the individually is 205.01431 while batched is 205.01433, what should I do with this?

Appreciate any help! Thank you ๐Ÿ˜„

Where abouts is the context stuff combined?

My oversimplified interpretation of how the model seems to work is; The encoder hidden states have some random normal noise added to them. Then these hidden states are combined with the input word vector for the decoder. And then it's trained like a regular language model.

The only thing I wasn't able to find was the part of the code that does the context combining? All the context combiner classes and their parent classes appear to be empty.

load the model

Is there a way to load the model and reuse it? I didn't find how I can put in my test sentences in the code.
If I only have source sentences without target pair, can I generate new sentences using the "INSERT" words I specified by myself?

third-party of gtd

What is the third-party library of gtd for? Is there some documents for it?

glove.6B.300d_onebil.txt file not found error

I followed the instructions in https://github.com/kelvinguu/neural-editor/tree/readme. But I get the following error when executing:

python textmorph/edit_model/main.py configs/edit_model/edit_onebil.txt

The glove.6B.zip does not have this file. How do I obtain it?

No checkpoint to reload. Initializing fresh.
Traceback (most recent call last):
File "textmorph/edit_model/main.py", line 32, in
exp = experiments.new(config) # new experiment from config
File "/code/gtd/ml/training_run.py", line 144, in new
run = self._run_factory(config, save_dir)
File "/code/textmorph/edit_model/training_run.py", line 255, in init
self._train_state = self._initialize_train_state(config)
File "/code/textmorph/edit_model/training_run.py", line 322, in _initialize_train_state
editor = cls._build_editor(config.editor)
File "/code/textmorph/edit_model/training_run.py", line 297, in _build_editor
word_embeddings = SimpleEmbeddings.from_file(file_path, config.word_dim, vocab_size=config.vocab_size)
File "/code/gtd/ml/vocab.py", line 184, in from_file
with codecs.open(file_path, 'r', encoding='utf-8') as f:
File "/opt/conda/envs/pytorch-py27/lib/python2.7/codecs.py", line 896, in open
file = builtin.open(filename, mode, buffering)
IOError: [Errno 2] No such file or directory: '/data/word_vectors/glove.6B.300d_onebil.txt'

Reproducible experiments on FloydHub

Hi @kelvinguu,

First of all, really interesting paper!! What do you think about creating a public project on FloydHub for having reproducible experiments and simplify the way in which other research or AI passionate can interact/play with it?

Memory Usage Problem

Hi Kelvin

Thans for the great work! I try the default editor on personal dataset and it works well. However during the training I found that the memory usage increasingly goes up, which leads to the experiment being killed after some training steps. Is it because I run the model in a wrong way?(I just prepare the dataset and use main.py in editor model fold.) And I want to know how to fix this problem.
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.