Coder Social home page Coder Social logo

Compressing seq2seq about distiller HOT 9 CLOSED

intellabs avatar intellabs commented on August 26, 2024
Compressing seq2seq

from distiller.

Comments (9)

nzmora avatar nzmora commented on August 26, 2024 3

Hi Vinu,
To properly answer you, I would need to write-up something longer to explain the design, but until then, here's my answer.

Function threshold_model is indeed not used, and perhaps I should remove it.
https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/distiller/pruning/pruner.py#L32

The pruning flow is this:

For each epoch:
compression_scheduler.on_epoch_begin(epoch)
train()
validate()
save_checkpoint()
compression_scheduler.on_epoch_end(epoch)

train():
For each training step:
compression_scheduler.on_minibatch_begin(epoch)
output = model(input_var)
loss = criterion(output, target_var)
compression_scheduler.before_backward_pass(epoch)
loss.backward()
optimizer.step()
compression_scheduler.on_minibatch_end(epoch)

compression_scheduler.on_epoch_begin(epoch)
==> pruning_policy.on_epoch_begin()
====>pruner.set_param_mask(...)
This tells the pruner associated with this instance of PruningPolicy to compute and set the pruning mask. There is a binary mask (0/1) for each parameter tensor. When the mask is applied, each tensor coefficient is either turned to zero (mask for that coefficient is 0) or kept unchanged (mask is 1 for that coefficient).
So the Pruner computes the mask, at the beginning of each epoch (if the compression_scheduler determines that this PruningPolicy should be scheduled in this epoch).

Note that so far, we only prepared the mask - we have not applied it yet.

Now we start training the epoch and we invoke:
compression_scheduler.on_minibatch_begin(epoch)
==> pruning_policy.on_minibatch_begin()
====>zeros_mask_dict[param_name].apply_mask(param)

This last line is responsible for applying the binary mask (we do this by multiplying the tensor with its mask).

To summarize, at the beginning of each epoch, the CompressionScheduler checks if we are supposed to schedule (i.e. invoke the Pruner). If yes, then we call into the Pruner and ask it to prepare the mask.
Then, at the beginning of each epoch, we mask the weights tensors.

I hope this helps,
Neta

from distiller.

nzmora avatar nzmora commented on August 26, 2024 1

Hi Vinu,

[1] Correct
[2] True, except for a big exception when removing channels/filters (this needs to be fixed). When pruning channels, in the current implementation: I set the corresponding channels to zero (i.e. prune).
When "thinning" channels: I remove those (output) channels, but I also look if there are convolutions feeding the pruned convolutions. If there are, then I remove their corresponding filters (and biases). And if there are BN layers between the convolutions, I adjust their parameters, too. The same is true when we prune/thin filters, with the details being somewhat different. I'll write-up a tutorial about this sometime later. If you are doing fine pruning (element-wise pruning), then there is no such thing as thinning - only pruning.
[3] Again, if you want to study channel/filter pruning, I think the best way is to do thinning. There might be positive correlation between the behavior of channel-pruned and a channel-thinned network. By which I mean that perhaps the best performing channel-pruning configuration is also the best performing channel-thinning network. This is a hunch, and I haven't studied this.

[3b] Correct, the code blocks you marked with

    # Remove filters --> THIS IS NOT MANDATORY FOR PRUNING STUDIES ?

are indeed not necessary if not thinning.

Cheers,
Neta

from distiller.

vinutah avatar vinutah commented on August 26, 2024

Hi Neta, Sure, I will make an attempt at adding this example.

from distiller.

nzmora avatar nzmora commented on August 26, 2024

Cool - please tell me if you need anything.

from distiller.

vinutah avatar vinutah commented on August 26, 2024

Hi Neta,

I had a basic question related to the pruning flow, in particular, the point at which the mask which can be set by _ParameterPruner.set_param_mask() is applied to the model parameters.

Here was my question.

In pruner.py , i notice this method
https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/distiller/pruning/pruner.py#L32 that seems to be responsible for actually applying the mask that could be set by set_param_mask of the child classes of _ParameterPruner but from a static search I did not find a singe usage of this method anywhere.

So if this https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/distiller/pruning/pruner.py#L32 method is not used, I wanted to find out where are the masks applied to the model parameters ?

The best answer I could come up is looking at https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/distiller/scheduler.py#L36 .

I am little confused as to why you decided to have this ParameterMasker class https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/distiller/scheduler.py#L29 with the apply_mask method

from distiller.

vinutah avatar vinutah commented on August 26, 2024

Thank you for the explanation.

from distiller.

vinutah avatar vinutah commented on August 26, 2024

Hi Neta,

I have a few follow up queries.

[1]
Unless one really wants to test these pruning functions like in https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/tests/test_pruning.py#L153
or really remove/modify the network architecture like in https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/examples/automated_deep_compression/ADC.py#L306 , one need not worry about thinning functions.

[2] Pruning is just setting zeros in the 4D weight tensors without modifying the network architecture in any manner. In some sense, it is a simulation of the thinning process. Whereas thinning is the real effect of pruning a network.

[3] If one wants to study the effect of pruning on a DNN, we need not be using the thinning utilities, just masking the weights, is a good estimate for sparsity studies ?

To be very specific, if we take https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/tests/test_pruning.py#L110 and look at the snippets commened as # Remove filters and # Test Thinning
these codes are not necessary unless we are actually wanting to changes the network architecture.

    for pair in config.module_pairs:
        # Test that we can access the weights tensor of the first convolution in layer 1
        conv1_p = distiller.model_find_param(model, pair[0] + ".weight")
        assert conv1_p is not None
        num_filters = conv1_p.size(0)

        # Test that there are no zero-filters
        assert distiller.sparsity_3D(conv1_p) == 0.0

        # Create a filter-ranking pruner
        reg_regims = {pair[0] + ".weight": [ratio_to_prune, "3D"]}
        pruner = distiller.pruning.L1RankedStructureParameterPruner("filter_pruner", reg_regims)
        pruner.set_param_mask(conv1_p, pair[0] + ".weight", zeros_mask_dict, meta=None)

        conv1 = common.find_module_by_name(model, pair[0])
        assert conv1 is not None
        # Test that the mask has the correct fraction of filters pruned.
        # We asked for 10%, but there are only 16 filters, so we have to settle for 1/16 filters
        expected_cnt_removed_filters = int(ratio_to_prune * conv1.out_channels)
        expected_pruning = expected_cnt_removed_filters / conv1.out_channels
        masker = zeros_mask_dict[pair[0] + ".weight"]
        assert masker is not None
        assert distiller.sparsity_3D(masker.mask) == expected_pruning

        # Use the mask to prune
        assert distiller.sparsity_3D(conv1_p) == 0
        masker.apply_mask(conv1_p)
        assert distiller.sparsity_3D(conv1_p) == expected_pruning

        # Remove filters --> THIS IS NOT MANDATORY FOR PRUNING STUDIES ?
        conv2 = common.find_module_by_name(model, pair[1])
        assert conv2 is not None
        assert conv1.out_channels == num_filters
        assert conv2.in_channels == num_filters

    # Test thinning --> THIS IS NOT MANDATORY FOR PRUNING STUDIES ?
    distiller.remove_filters(model, zeros_mask_dict, config.arch, config.dataset, optimizer=None)
    assert conv1.out_channels == num_filters - expected_cnt_removed_filters
    assert conv2.in_channels == num_filters - expected_cnt_removed_filters
    return model, zeros_mask_dict

from distiller.

vinutah avatar vinutah commented on August 26, 2024

Thanks Neta !

Another query related the most basic magnitude pruner the
https://github.com/NervanaSystems/distiller/blob/9374732d4993ef23b2e28433422fafd4b1cdab12/distiller/pruning/magnitude_pruner.py#L20

Could you please help me understand the usage of this pruner with an example.
particularly how to set these threshold values either one for entire network or the ability
to set one per layer. and how to pass it using the **kwargs . Basically I am having trouble
[1] understanding the usage of self.threshold[*]
[2] zeros_mask_dict[param_name].mask should we call this method for all param_names in a loop

class MagnitudeParameterPruner(_ParameterPruner):
    def __init__(self, name, thresholds, **kwargs):
        super(MagnitudeParameterPruner, self).__init__(name)
        self.thresholds = thresholds

    def set_param_mask(self, param, param_name, zeros_mask_dict, meta):
        threshold = self.thresholds.get(param_name, self.thresholds['*'])
        zeros_mask_dict[param_name].mask = distiller.threshold_mask(param.data, threshold)

A test case for this pruner using small vgg or resnet would be ideal.

from distiller.

nzmora avatar nzmora commented on August 26, 2024

Hi Vinu,

[0] **kwargs is not currently used - it's there to "absorb" extra keywords that the compression schedule parser might pass it. But that's kind of a silly reason to have it there, and we might remove it from the function signature sometime.

[1] Regarding self.threshold[*]: I think that the test code and new documentation I committed should explain it best now.

[2] Yes, you can call this in a loop if you want to prune all of the parameters of a model. But be careful: because the default self.threshold[*] threshold will be used on any parameter that doesn't have an explicit threshold. You can use a pruner directly, no issue with that, but remember that in general, like in the compress_classifier.py example, the CompressionScheduler will invoke a PruningPolicy which will invoke the Pruner (see for example).
MagnitudeParameterPruner is not used by any example code currently, besides the new test code, but you can look at SparsityLevelParameterPruner which is used in this example, because the two pruners are very similar.
I found magnitude pruning to be pretty hard to use directly because you really need to know the threshold values for each tensor you want to prune. This value changes per tensor and it's hard to get it right (requires a lot of trial and error and is sensitive to the training duration). Therefore, I prefer using a level-pruner.

from distiller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.