Coder Social home page Coder Social logo

pytorch-goodies's Introduction

Model Statistics

Number of Parameters

num_params = sum(param.numel() for param in model.parameters())

Number of FLOPS

[...]

Weight Initialization

PyTorch layers are initialized by default in their respective reset_parameters() method. For example:

  • nn.Linear
    • weight and bias: uniform distribution [-limit, +limit] where limit is 1. / sqrt(fan_in) and fan_in is the number of input units in the weight tensor.
  • nn.Conv2D
    • weight and bias: uniform distribution [-limit, +limit] where limit is 1. / sqrt(fan_in) and fan_in is the number of input units in the weight tensor.

With this implementation, the variance of the layer outputs is equal to Var(W) = 1 / 3 * sqrt(fan_in) which isn't the best initialization strategy out there.

Note that PyTorch provides convenience functions for some of the initializations. The input and output shapes are computed using the method _calculate_fan_in_and_fan_out() and a gain() method scales the standard deviation to suit a particular activation.

Xavier Initialization

This initialization is general-purpose and meant to "work" pretty well for any activation in practice.

# default xavier init
for m in model.modules():
    if isinstance(m, (nn.Conv2d, nn.Linear)):
        nn.init.xavier_uniform(m.weight)

You can tailor this initialization to your specific activation by using the nn.init.calculate_gain(act) argument.

# default xavier init
for m in model.modules():
    if isinstance(m, (nn.Conv2d, nn.Linear)):
        nn.init.xavier_uniform(m.weight(), gain=nn.init.calculate_gain('relu'))

He et. al Initialization

This is a similarly derived initialization tailored specifically for ReLU activations since they do not exhibit zero mean.

# he initialization
for m in model.modules():
    if isinstance(m, (nn.Conv2d, nn.Linear)):
        nn.init.kaiming_normal(m.weight, mode='fan_in')

For mode=fan_in, the variance of the distribution is ensured in the forward pass, while for mode=fan_out, it is ensured in the backwards pass.

SELU Initialization

Again, this initialization is specifically derived for the SELU activation function. The authors use the fan_in strategy. They mention that there is no significant difference between sampling from a Gaussian, a truncated Gaussian or a Uniform distribution.

# selu init
for m in model.modules():
    if isinstance(m, nn.Conv2d):
        fan_in = m.kernel_size[0] * m.kernel_size[1] * m.in_channels
        nn.init.normal(m.weight, 0, sqrt(1. / fan_in))
    elif isinstance(m, nn.Linear):
        fan_in = m.in_features
        nn.init.normal(m.weight, 0, sqrt(1. / fan_in))

Orthogonal Initialization

Orthogonality is a desirable quality in NN weights in part because it is norm preserving, i.e. it rotates the input matrix, but cannot change its norm (scale/shear). This property is valuable in deep or recurrent networks, where repeated matrix multiplication can result in signals vanishing or exploding.

for m in model.modules():
    if isinstance(m, (nn.Conv2d, nn.Linear)):
        nn.init.orthogonal(m.weight)

Batch Norm Initialization

for m in model:
    if isinstance(m, nn.BatchNorm2d):
        nn.init.constant(m.weight, 1)
        nn.init.constant(m.bias, 0)

Weight Regularization

L2 Regularization

Heavily penalizes peaky weight vectors and encourages diffuse weight vectors. Has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot.

reg = 1e-6
l2_loss = Variable(torch.FloatTensor(1), requires_grad=True)
for name, param in model.named_parameters():
    if 'bias' not in name:
        l2_loss = l2_loss + (0.5 * reg * torch.sum(torch.pow(W, 2)))

L1 Regularization

Encourages sparsity, meaning we encourage the network to select the most useful inputs/features rather than use all.

reg = 1e-6
l1_loss = Variable(torch.FloatTensor(1), requires_grad=True)
for name, param in model.named_parameters():
    if 'bias' not in name:
        l1_loss = l1_loss + (reg * torch.sum(torch.abs(W)))

Orthogonal Regularization

Improves gradient flow by keeping the matrix norm close to unitary.

reg = 1e-6
orth_loss = Variable(torch.FloatTensor(1), requires_grad=True)
for name, param in model.named_parameters():
    if 'bias' not in name:
        param_flat = param.view(param.shape[0], -1)
        sym = torch.mm(param_flat, torch.t(param_flat))
        sym -= Variable(torch.eye(param_flat.shape[0]))
        orth_loss = orth_loss + (reg * sym.sum())

Max Norm Constraint

If a hidden unit's weight vector's L2 norm L ever gets bigger than a certain max value c, multiply the weight vector by c/L. Enforce it immediately after each weight vector update or after every X gradient update.

This constraint is another form of regularization. While L2 penalizes high weights using the loss function, "max norm" acts directly on the weights. L2 exerts a constant pressure to move the weights near zero which could throw away useful information when the loss function doesn't provide incentive for the weights to remain far from zero. On the other hand, "max norm" never drives the weights to near zero. As long as the norm is less than the constraint value, the constraint has no effect.

def max_norm(model, max_val=3, eps=1e-8):
    for name, param in model.named_parameters():
        if 'bias' not in name:
            norm = param.norm(2, dim=0, keepdim=True)
            desired = torch.clamp(norm, 0, max_val)
            param = param * (desired / (eps + norm))

Batch Normalization

[...]

Dropout

[...]

Optimization Misc.

Correct Validation Strategies

[...]

References

  • Thanks to Zijun Deng for inspiring the code for the segmentation metrics.

pytorch-goodies's People

Contributors

kevinzakka avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.