Coder Social home page Coder Social logo

dropout's Introduction

dropout's People

Contributors

chenglongchen avatar mdenil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dropout's Issues

Momentum bug

Reported by Ian Goodfellow:

Hi Misha,
I think I found a bug in the momentum for your dropout demo. This came
up when someone suggested adding some code that was partially copied
from your demo to pylearn2.

The bug is with these lines:

for gparam_mom, gparam in zip(gparams_mom, gparams):
updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam

# ... and take a step along that direction
for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param - (1.-mom) * learning_rate * gparam_mom

There are two things I think are wrong here:

  1. When you update stepped_param, you want to use updates[gparam_mom]
    and not gparam_mom. gparam_mom is one time step too old. Only
    updates[gparam_mom] has been updated with the current gradient.
    gparam_mom won't contain the updated value until after the theano
    function finishes executing. (At first I thought you were doing
    Nesterov momentum, but that would need the gradient from t+1, not t-1)
  2. If you expand the recurrence for stepped_param, it doesn't match
    the formula in appendex A1 of Geoff's paper. It ends up multiplying by
    (1-mom)^2 instead of (1-mom). This probably makes you need a bigger
    learning rate to compensate, since 1-mom will be a small number.

Hope that helps,
Ian

Random dropout at each mini-batch?

In Hinton's paper, it is said that "On each presentation of each training case, each hidden unit is randomly omitted from the network with a probability of 0.5".

To my understanding, in function _dropout_from_layer,
srng = theano.tensor.shared_randomstreams.RandomStreams(
rng.randint(999999))
srng seems fixed and thus those random dropout hidden units are determined (before seeing the samples) and the same for each epoch & mini-batch. In that case, it is far from real dropout. But I might be wrong since I am not that familiar with theano. Can someone clarify this for me?

Regards,

Dropout rate should be set to 0 if not using dropout

Hi,
Thanks for this code.
From looking at it, I think the dropout rate should be set to 0 if not using dropout. Since it looks like the code is going to adapt the weights W based on that rate independent of whether dropout is used or not.

About the Resample Issue

Hi, I am just a little confused about the setting. For hinton's paper you need to sample the dropout units again but this code seems to fix dropout units at the beginning and never resample again. Am I missing something here ? Thanks in advance

-Lin

dropout trainig doesn't work with over 3 hiddent layers

Regardless of the width of the hidden units, it seems if I have more than 3 hidden layers, the dropout training does not work. I wonder if some bug is causing it.

4-layer with backprop only works

$ python mlp.py backprop
... building the model: hidden layers [600, 200, 100, 100], dropout: False [0.0, 0.0, 0.0, 0.0, 0.0]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 7448 iter 930)  **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 7448 iter 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 7448 iter 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 7448 iter 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 7448 iter 4654)
epoch 6, test error 0.3625 (290), learning_rate=0.99003992008 (patience: 7448 iter 5585)  **
epoch 7, test error 0.33875 (271), learning_rate=0.98805984024 (patience: 22340.0 iter 6516)  **
epoch 8, test error 0.3175 (254), learning_rate=0.986083720559 (patience: 26064.0 iter 7447)  **
epoch 9, test error 0.32375 (259), learning_rate=0.984111553118 (patience: 29788.0 iter 8378)
epoch 10, test error 0.325 (260), learning_rate=0.982143330012 (patience: 29788.0 iter 9309)

4-layer with dropout does not work

$ python mlp.py dropout
... building the model: hidden layers [600, 200, 100, 100], dropout: [0.0, 0.5, 0.5, 0.5, 0.5]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 27930 / 930)  **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 27930 / 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 27930 / 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 27930 / 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 27930 / 4654)
...
epoch 29, test error 0.375 (300), learning_rate=0.945486116479 (patience: 27930 / 26998)
epoch 30, test error 0.375 (300), learning_rate=0.943595144246 (patience: 27930 / 27929)
epoch 31, test error 0.375 (300), learning_rate=0.941707953958 (patience: 27930 / 28860)

3-layer with dropout works

$ python mlp.py dropout
... building the model: hidden layers [600, 200, 100], dropout: [0.0, 0.5, 0.5, 0.5]
... training
epoch 1, test error 0.375 (300), learning_rate=1.0 (patience: 27930 / 930)  **
epoch 2, test error 0.375 (300), learning_rate=0.998 (patience: 27930 / 1861)
epoch 3, test error 0.375 (300), learning_rate=0.996004 (patience: 27930 / 2792)
epoch 4, test error 0.375 (300), learning_rate=0.994011992 (patience: 27930 / 3723)
epoch 5, test error 0.375 (300), learning_rate=0.992023968016 (patience: 27930 / 4654)
epoch 6, test error 0.365 (292), learning_rate=0.99003992008 (patience: 27930 / 5585)  **
epoch 7, test error 0.3625 (290), learning_rate=0.98805984024 (patience: 27930 / 6516)  **
epoch 8, test error 0.3375 (270), learning_rate=0.986083720559 (patience: 27930 / 7447)  **
epoch 9, test error 0.3275 (262), learning_rate=0.984111553118 (patience: 29788 / 8378)  **
epoch 10, test error 0.33375 (267), learning_rate=0.982143330012 (patience: 33512 / 9309)
epoch 11, test error 0.32875 (263), learning_rate=0.980179043352 (patience: 33512 / 10240)
epoch 12, test error 0.315 (252), learning_rate=0.978218685265 (patience: 33512 / 11171)  **

no bias in mlp.py

Great initiative to implement the dropout technique in theano! But is there any reason why the mpl.py code doesnt use bias for its neurons?

dropping output units rather than connections

Hi,

I think you've got a bug in your implementation: you're applying the dropout mask to output units rather than elements of your weight matrices, which is what the original version of dropout is intended to do. This means that you're dropping out bias units randomly, which might disrupt the model averaging interpretation of dropout.

I'm not sure if you intended to do this, but if not, you should reconsider having _droput_from_layer apply a mask directly to the Ws, and then computing the layer output (see eqn's 2.3 -- 2.6 of Nitish's thesis)

Incorrect weight scaling on inputs

The input is dropout'd with probability 0.2 which means the weight matrix should be multiplied by 0.8, but the code multiplies it by 0.5.

Momentum again

I use the update rule for the momentum and the gradient rule to train my network. However, the error first start to decrease for the early few epochs and then flatten out.
I check those update rule and compare them with Hinton's dropout paper (also the ImageNet paper), and find that they seem different.

The current update rules implemented are as follow:

updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
    updates[gparam_mom] = mom * gparam_mom + (1. - mom) * gparam

for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param - learning_rate * updates[gparam_mom]

According to my understanding of Appendix A.1 in the dropout paper, learning_rate is NOT multiplied to mom * gparam_mom, while the above code is. According to the formula therein, we should have:

updates = OrderedDict()
for gparam_mom, gparam in zip(gparams_mom, gparams):
    updates[gparam_mom] = mom * gparam_mom - (1. - mom) * learning_rate * gparam

for param, gparam_mom in zip(classifier.params, gparams_mom):
    stepped_param = param + updates[gparam_mom]

Am I right? Or is that the update rule currently implemented is actually used somewhere that I am not ware of? In that case, I'd love some pointers.

As another note, since learning_rate is multiplied by (1. - mom), a large learning_rate is expected to give good reults (no wonder why Hinton use 10 now...)

However, in their ImageNet paper, they use a slightly different rule for updating the momentum, which includes weight decay. Also, learning_rate is no longer multiplied by (1. - mom). In that case, we should expect a small learning_rate. In code, the ImageNet update rule is:

updates = OrderedDict()
for gparam_mom, gparam, param in zip(gparams_mom, gparams, params):
    updates[gparam_mom] = mom * gparam_mom - learning_rate * (weight_decay*param + gparam)

Regards,

Constrain weight matrix columns instead of rows

This block of code constrains the norms of the rows of the weight matrix:

https://github.com/mdenil/dropout/blob/master/mlp.py#L245-L254

It should constrain the norms of the columns as described in the original paper:

Instead of penalizing the squared length (L2 norm) of the whole weight vector, we set an upper bound on the L2 norm of the incoming weight vector for each individual hidden unit. If a weight-update violates this constraint, we renormalize the weights of the hidden unit by division.

The matrix orientation in the code means that each column corresponds to a hidden unit (see: https://github.com/mdenil/dropout/blob/master/mlp.py#L38-L41), so it is the columns and not the rows that should be constrained.

License

Would you please add a LICENSE file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.