Coder Social home page Coder Social logo

pytorch-cifar's Introduction

Train CIFAR10 with PyTorch

I'm playing with PyTorch on the CIFAR10 dataset.

Prerequisites

  • Python 3.6+
  • PyTorch 1.0+

Training

# Start training with: 
python main.py

# You can manually resume the training with: 
python main.py --resume --lr=0.01

Accuracy

Model Acc.
VGG16 92.64%
ResNet18 93.02%
ResNet50 93.62%
ResNet101 93.75%
RegNetX_200MF 94.24%
RegNetY_400MF 94.29%
MobileNetV2 94.43%
ResNeXt29(32x4d) 94.73%
ResNeXt29(2x64d) 94.82%
SimpleDLA 94.89%
DenseNet121 95.04%
PreActResNet18 95.11%
DPN92 95.16%
DLA 95.47%

pytorch-cifar's People

Contributors

bearpaw avatar fducau avatar kuangliu avatar ypwhs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-cifar's Issues

Issue about shortcut.

In your resnet model, you did a Relu activation before shortcut which after bn(conv2). I saw there are some other people just shortcut to the output of bn without Relu. Which one is better, did you try them separately before?

However, your work is great for new pytorcher like me ^_^. Hope to see wide resnet and deeplab in the future.

Best wishes.

Performance regarding inference and backward time

Hi Kuangliu,
Nice job on achieving good result on Cifar10. Have you evaluated the speed performance of all the experiment models? It seems Mobilenet v2 cannot achieve significant speed improvement since the architecture is originally designed for Imagenet classification task.

Regards

Standard deviation for transforms.Normalize

Hey,

How did you calculate the standard deviation values for transforms.Normalize? I am getting the same means, but different standard deviations:

import numpy as np

from torchvision import datasets
from torchvision import transforms

transform_train = transforms.Compose([
#     transforms.RandomCrop(32, padding=4),
#     transforms.RandomHorizontalFlip(),
    transforms.ToTensor()
])

trainset = datasets.CIFAR10(root='data', train=True, download=True, transform=transform_train)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=50_000, shuffle=True)

train = train_loader.__iter__().next()[0]

print('Mean: {}'.format(np.mean(train.numpy(), axis=(0, 2, 3))))
# Mean: [ 0.49139765  0.48215759  0.44653141]
print('STD: {}'.format(np.std(train.numpy(), axis=(0, 2, 3))))
# STD: [ 0.24703199  0.24348481  0.26158789]

py3 error

please add a better error message to indicate that Python 3 is not supported

Slower than TF?

I'm testing the speed-up of ResNet on TF and PyTorch.

In TF, typically it can converge within 80k steps, which is 80k batches, and when we set batch-size=128, that should be around ~205 epochs in PyTorch.

One interesting thing is, in TF I can finish 80k steps in about 6 hours. But in PyTorch, running 200 epochs took me around 13 hours. And this will expand to around 20 hours if I want to test 300 epochs.

I thought PyTorch should be much faster than TF. Does anyone knows the solution to this? BTW. I'm using ec2 g2.2xlarge.

Here is the ResNet20 TF implementation

The problem of 'shufflenet.py'

Hello, when I run the 'shufflenet.py', it comes to this bug:

"Traceback (most recent call last):
File "/home/w/Documents/code/cifar10/practice/models/shufflenet.py", line 218, in
test()
File "/home/w/Documents/code/cifar10/practice/models/shufflenet.py", line 214, in test
net = ShuffleNetG2()
File "/home/w/Documents/code/cifar10/practice/models/shufflenet.py", line 202, in ShuffleNetG2
return ShuffleNet(cfg)
File "/home/w/Documents/code/cifar10/practice/models/shufflenet.py", line 171, in init
self.layer1 = self._make_layer(out_planes[0], num_blocks[0], groups)
File "/home/w/Documents/code/cifar10/practice/models/shufflenet.py", line 181, in _make_layer
layers.append(Bottleneck(self.in_planes, out_planes-cat_planes, stride=stride, groups=groups))
File "/home/w/Documents/code/cifar10/practice/models/shufflenet.py", line 139, in init
self.conv1 = nn.Conv2d(in_planes, mid_planes, kernel_size=1, groups=g, bias=False)
File "/home/w/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 297, in init
False, _pair(0), groups, bias)
File "/home/w/.local/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 33, in init
out_channels, in_channels // groups, *kernel_size))
TypeError: new() received an invalid combination of arguments - got (float, int, int, int), but expected one of:

(torch.device device)
(tuple of ints size, torch.device device)
(torch.Storage storage)
(Tensor other)
(object data, torch.device device)"

Do you meet the same problems? I wonder if you could help me solve this bug? Thanks a lot!

VGG16 Accuracy

In VGG16, can you tell me the specifications ?
I think you use SGD with momentum=0.9, weight_decay=5e-4 and lr
0.1 for epoch [0,150)
0.01 for epoch [150,250)
0.001 for epoch [250,350)

can't achieve the reported accuracy on VGG19

Have anyone train a VGG19 network used this code? I used the total exactly the same code in the repository except using a lr_schuduler to change the learning rate automatically. But I can only achieve the accuracy around 88%. Can anyone tell me what's wrong?

How to extract relu layers from the forward function

Hi,

I want to extract the two relu layers in the forward function in BasicBlock?

When I look at model.modules(),
I don't see relu layers, I only see:

<bound method ResNet.modules of ResNet(
(conv1): Conv2d (3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d (16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True)
(conv2): Conv2d (16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True)
(shortcut): Sequential(
)
)
(1): BasicBlock(
(conv1): Conv2d (16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True)
(conv2): Conv2d (16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True)
(shortcut): Sequential(
)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d (16, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
(conv2): Conv2d (32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
(shortcut): Sequential(
(0): Conv2d (16, 32, kernel_size=(1, 1), stride=(2, 2))
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
)
)
(1): BasicBlock(
(conv1): Conv2d (32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
(conv2): Conv2d (32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
(shortcut): Sequential(
)
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d (32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(conv2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(shortcut): Sequential(
(0): Conv2d (32, 64, kernel_size=(1, 1), stride=(2, 2))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
)
)
(1): BasicBlock(
(conv1): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(conv2): Conv2d (64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
(shortcut): Sequential(
)
)
)
(linear): Linear(in_features=64, out_features=10)
)>

Why track the loss and accuracy on test mode?

What is the point of tracking the loss and accuracy iterating over the test set?

If your model is already trained, why don't you simple compute the accuracy over the entire test set and that's it?

small mistake(?) at the inception class code

hello @kuangliu , i would like to thank you for your work 👍 there is a small mistake i guess in the GoogleNet model, the inception class, you added one more conv2d in the 5x5 branch, it shouldn't only be one 1x1 conv followed by one 5x5 conv and not two ?

How long to train pretrained resnet?

Hi! thanks for the repository :)

How long did you train your pretrained resnet 18?

I tried 150, 100, 100 epochs with learning rates decayting as you specified on pre-resnet-50, but I only got ~94.6% accuracy.

Module Not Found Error: No module named 'vgg'

I am having a problem and I am not able to run it

File "/Desktop/Another Cifar/pytorch-cifar-master/main.py", line 16, in
from models import *

File "/Desktop/Another Cifar/pytorch-cifar-master/models/init.py", line 1, in
from vgg import *

ModuleNotFoundError: No module named 'vgg'

Is it because I am using python 3.6?

error showed when I tried to run --resume

Hi, I was trying to resume from one checkpoint and run into error.

==> Resuming from checkpoint..
Traceback (most recent call last):
  File "cifar10.py", line 59, in <module>
    checkpoint = torch.load('./checkpoint/ckpt.t7')
  File "/home/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 231, in load
    return _load(f, map_location, pickle_module)
  File "/home/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 379, in _load
    result = unpickler.load()
AttributeError: Can't get attribute 'Net' on <module '__main__' from 'cifar10.py'>

Can anyone help me?

Why the conv1's stride is changed from 2 to 1 for cifar 10 dataset on MobilenetV2 architecture?

Can you tell me why guyz?

When i changed conv1's stride 1 to 2, then it occurs an error.
In the models/mobilenetv2.py file, I've changed

self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, stride=1, padding=0, bias=False)
to
self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, stride=2, padding=0, bias=False)

and
cfg = [(1, 16, 1, 1), (6, 24, 2, 1), # NOTE: change stride 2 -> 1 for CIFAR10 (6, 32, 3, 2), (6, 64, 4, 2), (6, 96, 3, 1), (6, 160, 3, 2), (6, 320, 1, 1)]
to
cfg = [(1, 16, 1, 1), (6, 24, 2, 2), # NOTE: change stride 2 -> 1 for CIFAR10 (6, 32, 3, 2), (6, 64, 4, 2), (6, 96, 3, 1), (6, 160, 3, 2), (6, 320, 1, 1)]

and ran the code python main.py, then it occurs this error:
RuntimeError: The size of tensor a (8) must match the size of tensor b (16) at non-singleton dimension 3

Then, it it just for matching the size of tensor??
only for that??

Then, Isn't it different from the structure of the MobileNetV2 paper?

Help me guyz :)

ShuffleNet difference with paper at first convolution

Hi @kuangliu,
Thank you for these great models!
For ShuffleNet, I noticed a difference between your implementation and the original paper. Before the ShuffleNet stages, you do a 1x1 convolution, followed by batch normalization and ReLU: out = F.relu(self.bn1(self.conv1(x))), but in the original paper they perform a 3x3 convolution with stride 2 followed by a 3x3 maxpool with stride 2:
image
Was this done intentionally? Thank you in advance
TheFlux7

Why didnt use relu in resnet??

Great work, thanks for sharing your code. I've noticed that your resnet implementation did not use an activation function, could you please tell me why?
the official pytorch resnet code and the original paper both used relu activation function.

This architecture is not following the paper for CIFAR-10

Hi,

In the ResNet publication they propose a different architecture where it comes to CIFAR-10 right?

The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32×32 images, with the per-pixel mean subtracted. The first layer is 3×3 convo- lutions. Then we use a stack of 6n layers with 3×3 convo- lutions on the feature maps of sizes {32, 16, 8} respectively, with 2n layers for each feature map size. The numbers of filters are {16, 32, 64} respectively. The subsampling is per- formed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture:

There is only 3 layers and the feature map sizes are [16, 32, 64] not [64, 128, 256, 512] like for ImageNet.

Accuracy of Resnet50 is much higher than reported!

EDIT: Originally I title this issue "What epoch are reported results from?", but after further results have come to light I've renamed it to: "Accuracy of Resnet50 is much higher than reported!".

I've been reproducing some of these experiments and my output numbers don't exactly line up. At the moment I'm currently assuming its due to different random number seeds for the Kaiming Normal initialization.

Is the accuracy you report always the accuracy of the test set on the 350th epoch? Or are you reporting the accuracy of the best epoch?

In my reproduction of the DPN92 experiment I my measured accuracy of the last epoch is 94.92%, but the highest overall accuracy was 95.10% on epoch 275.

Surprisingly, when I ran the Resnet50 example I got an accuracy of 95.72%! (but this is likely some issue on my end) (Edit: Actually it doesn't seem to be; see bellow)

Question regarding epoch vs. accuracy

Hi,

I have a simple question regarding the epoch vs. accuracy.

Like these,
VGG16 | 92.64%
ResNet18 | 93.02%
ResNet50 | 93.62%

If someone made his own network and trained for 100,000 epochs and it seems to show 95% accuracy at epoch 100,000 at max. and the accuracy declines after that.
And what if ResNet18 showed 93.02% max. at epoch 200 and accuracy declines after that,
Can we say his custom network is better than ResNet18?

Do we have to consider the same number of epoch to compare accuracy between two networks? or max. accuracy only (not considering number of epoch.)?

Let's say at 200 epoch, if ResNet18 has 93.02% and his custom network has around 80%, is it possible to say ResNet18 is better?

Thanks,

preactivation resnet

Hi --

I was looking at the implementation of the preactivation ResNet and had a question -- in the paper, they show the preactivation block below (on the right):
screen shot 2017-07-22 at 3 57 48 pm

which AFAICT should look like:

    def forward(self, x):
        shortcut = self.shortcut(x)
        out = self.bn1(x)
        out = F.relu(out)
        out = self.conv1(out)
        
        out = self.bn2(out)
        out = F.relu(out)
        out = self.conv2(out)
        
        out += shortcut
        return out

which is slightly different from your implementation:

    def forward(self, x):
        out = self.bn1(x)
        out = F.relu(out)
        shortcut = self.shortcut(out)
        out = self.conv1(out)
        
        out = self.bn2(out)
        out = F.relu(out)
        out = self.conv2(out)
        
        out += shortcut
        return out

In the picture, I think your implementation would look like shifting the first bn and relu into the grey band, then splitting into two branches.

Any thoughts? Am I misinterpreting, or is there a reason for the difference?

Why dont we have validation data during cifar trainig?

Please correct me if im wrong, but should not we have validation dataset while we train ?
I am looking at cifar examples and i noticed that we dont have validation dataset, we just have train and test.
Should we skip validation? why?
Can you please explain it to me

Thanks

Some questions on resnet performance

Hi,
I downloaded your codes and tested variations of resnet using the same settings,
but I can't seem to get anywhere near the accuracy mentioned (regardless of the models, all of my tests seem to plateau at around 88~89%). Are there any additional measures that you applied such as weight decay or lowering learning rates?
Neat implementation, by the way.

License?

Could you please add a LICENSE file to this repo? What license are you making the code available under?

Many thanks!

Uses test data to select models during training

Based on my reading of this code, it appears like there is currently a split into only train and test datasets, and test set accuracy is used to select the best model during training. This appears to result in a quite overoptimistic estimate of the test performance of the model (especially due to the 3 stages of learning with different learning rates), because test accuracies were already used for training time model selection (the model is "cheating" by learning on the test data). In particular, for the VGG16 model I noticed the test accuracy jumps by around +10% between the end of the first training stage and the beginning of the second training stage (i.e. between lr=0.1 and lr=0.01).

In theory, if you do want to select models without this "cheating" effect then a way to address this is to split into 3 subsets of the data: train, test, and validation, use validation to pick the best model, and only ever evaluate your model on the test set, and do not choose between models based on test set accuracy. See this StackExchange post for details:

https://stats.stackexchange.com/questions/152907/how-do-you-use-test-data-set-after-cross-validation

However, in practice it seems that for CIFAR10 the models simply need more regularization by using both batch normalization and dropout, and it suffices to simply always report the test accuracy at the latest epoch (thus in practice, there is no need for the 3 way test/train/validate split). See e.g. the below repo or paper:

https://github.com/chengyangfu/pytorch-vgg-cifar10
http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7486599

LeNet performance request

Firstly, thank you for a comprehensive evaluation of all these models on CIFAR using PyTorch.

I noticed you have a LeNet network defined in your models, which is more-or-less identical to the one used in the PyTorch CIFAR example. Do you have the performance of this model on CIFAR-10?

It is a simple baseline, and I want to make sure my numbers are matching (getting only 65% test accuracy, which seems a bit suspicious).

Progress_bar value unpack error

from utils import progress_bar
File "C:\Users\Debadri\pytorch-cifar\utils.py", line 45, in
_, term_width = os.popen('stty size', 'r').read().split()
ValueError: not enough values to unpack (expected 2, got 0)
I'm getting this error.
I'm writing this command:
"python main.py"

Is there any argument I need to pass?
capture

What does the accuracy refer to ?

What doesthe accuracy refer to ?
Train score or test score ?

I have tested with Resnet34. The test accuracy score isnt nearly close enough to the reported one.

Can someone please clarify ?

VGG16 Model and Performace on Cifar-10

I run the Vgg16 model on Cifar-10 ,but the acc in test image only has 89%.
I set epoch =350, and I automatically change learning rate as you mentioned.
And I want to know why do you set classifier=nn.Linear(512,10)? 512 ,the number is how to determine?
Thanks!

Why do you crop the data?

While downloading and transform the dataset, I saw the command:
transforms.RandomCrop(32, padding=4)
But the original img is already 32*32, why do we need to crop?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.