cthorey / cs231 Goto Github PK
View Code? Open in Web Editor NEWMy corrections for the Standford class assingments CS231n - Convolutional Neural Networks for Visual Recognition
My corrections for the Standford class assingments CS231n - Convolutional Neural Networks for Visual Recognition
In order to predict labels. Just simple use like
y_predict[i]=np.argmax(np.bincount(closest_y))
I can't view the svm.ipynb file on Github in assignment one. While opening on jupyter notebook it give this error.
Unreadable Notebook: C:\Users\Elixir\Documents\Github\CS231\assignment1\svm.ipynb TypeError("argument of type 'NoneType' is not iterable")
Hey @cthorey ,
I recently went through your batch normalization tutorial here: What does gradient flowing through ... . First of, thank you so much for such an amazing post about batch normalization, I was implementing batch normalization in a FC-DNN but could find only few resources which give code and also derivations like your blog. Even though I was successful in my implementation, my derivations for the affine transformations were slightly off, your post clarified few bugs I had.
Although I have one question about the derivatives of beta and gamma here: CS231/assignment2/cs231n/layers.py / . I was wondering whether,
dbeta values should be normalized by the batch size of the training like so:
dbeta = np.sum(dout, axis=0) / batch_size
similarly, dgamma:
dgamma = np.sum(va2 * dva3, axis=0) / batch_size
In the implementation that I did, I was using full training set ( a very naive implementation ) , and once I found derivatives of gamma and beta, I always divided them by the number of rows in the training set. The results I got were really identical to same architecture built by keras:
I looked at notes of CS231 and several other implementations of batch norm online, all of them were not dividing gradients of gamma and beta by the batch_size, could you please give your thoughts on why that should be the case.
I feel they should be divided in order to normalize the gradients, I also tried not dividing the gradients of my beta and gamma, and as expected their gradients exploded and diverged from optimum values ( my distributions and keras' for beta and gamma were way off ) ... I understand that if I use an entire training set, then almost always I have to divide by train size , but I feel it should also be the case when using mini-batches as well. Curious to know your thoughts :)
Thanks for your time again ! :D
The functions
compute_distances_two_loops
compute_distances_one_loops
compute_distances_no_loops
all are required to compute the L2 distance so the dists matrix should be prepended by np.sqrt
in all the cases
If you change it to key you will see significantly slower conversion. Looks like this typo comes from the original class though.
CS231/assignment2/cs231n/classifiers/cnn.py
Line 143 in 11f0521
Should be:
bn_param['mode'] = mode
please correct me if i am wrong:
Feature to h1 becomes one single affine transformation without a non linearity in between. There should be a non linearity on h0 before it is passed to rnn cell
Your lstm_step_backward line# 343 is written as
# Backprop into step 5
dnext_c += o * (1 - np.tanh(next_c)**2) * dnext_h
which means,
dnext_c = dnext_c + o * (1 - np.tanh(next_c)**2) * dnext_h
But, from your lstm_step_forward line# 304:
next_h = o * np.tanh(next_c)
I think dnext_c from line# 304 is just
dnext_c = o * (1 - np.tanh(next_c)**2) * dnext_h
If you intend it, what am I wrong?
When full connected net has more than 3 layers, the backporp gradient and numerical gradient show significant difference. This issue could be reproduced in the Dropout.ipynb
(in the cell Fully-connected nets with Dropout
).
N, D, H1, H2, C = 2, 15, 20, 30, 10
X = np.random.randn(N, D)
y = np.random.randint(C, size=(N,))
s = np.random.randint(1)
for dropout in [0, 0.25, 1.0]:
print 'Running check with dropout = ', dropout
model = FullyConnectedNet([H1, 10,10,10,10,10,H2], input_dim=D, num_classes=C,
weight_scale=5e-2, dtype=np.float64,
dropout=dropout, seed=s)
loss, grads = model.loss(X, y)
print 'Initial loss: ', loss
for name in sorted(grads):
f = lambda _: model.loss(X, y)[0]
grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)
print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))
print
The output of this would be:
Running check with dropout = 0
Initial loss: 2.30258505897
W1 relative error: 2.41e-03
W2 relative error: 1.21e-03
W3 relative error: 1.60e-03
W4 relative error: 2.15e-03
W5 relative error: 1.75e-03
W6 relative error: 2.10e-03
W7 relative error: 1.89e-03
W8 relative error: 1.37e-03
b1 relative error: 1.76e-03
b2 relative error: 1.69e-02
b3 relative error: 6.03e-01
b4 relative error: 1.00e+00
b5 relative error: 1.00e+00
b6 relative error: 1.00e+00
b7 relative error: 1.00e+00
b8 relative error: 7.83e-11
Running check with dropout = 0.25
We use dropout with p =0.250000
Initial loss: 2.30258509299
W1 relative error: 0.00e+00
W2 relative error: 0.00e+00
W3 relative error: 0.00e+00
W4 relative error: 0.00e+00
W5 relative error: 0.00e+00
W6 relative error: 0.00e+00
W7 relative error: 0.00e+00
W8 relative error: 0.00e+00
b1 relative error: 0.00e+00
b2 relative error: 0.00e+00
b3 relative error: 0.00e+00
b4 relative error: 0.00e+00
b5 relative error: 1.00e+00
b6 relative error: 1.00e+00
b7 relative error: 1.00e+00
b8 relative error: 6.99e-11
Running check with dropout = 1.0
We use dropout with p =1.000000
Initial loss: 2.30258510213
W1 relative error: 3.55e-03
W2 relative error: 2.40e-03
W3 relative error: 2.44e-03
W4 relative error: 1.94e-03
W5 relative error: 1.98e-03
W6 relative error: 1.89e-03
W7 relative error: 2.13e-03
W8 relative error: 2.68e-03
b1 relative error: 2.36e-03
b2 relative error: 6.30e-04
b3 relative error: 7.33e-02
b4 relative error: 2.98e-01
b5 relative error: 1.00e+00
b6 relative error: 1.00e+00
b7 relative error: 1.00e+00
b8 relative error: 1.44e-10
I have tried several random seeds on this and it seems the bias gradient on the last layer are always correct. And the bias error will be extremely large since last hidden layer. However, error on W seems to be correct all the time. I firstly noticed this weird thing on my own implementation and it seems that the same thing occurs in your implementation. Any ideas?
in function compute_distances_no_loops(self, X)
T = np.sum(X**2,axis = 1)
F = np.sum(self.X_train**2,axis = 1).T
F = np.tile(F,(500,5000))
FT = X.dot(self.X_train.T)
print T.shape,F.shape,FT.shape,X.shape,self.X_train.shape
dists = T+F-2*FT`
the code
F = np.tile(F,(500,5000))
will make a matrix which contain 500 * 5000 * 5000 elements
may be the code should be modified like
T = np.reshape(np.sum(X**2,axis = 1),(num_test,1))
F = np.sum(self.X_train**2,axis = 1).T
F = np.tile(F,(num_test,1))
FT = X.dot(self.X_train.T)
print T.shape,F.shape,FT.shape,X.shape,self.X_train.shape
dists = T+F-2*FT`
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.