Coder Social home page Coder Social logo

about fp16 about synchronized-batchnorm-pytorch HOT 5 OPEN

vacancy avatar vacancy commented on August 18, 2024
about fp16

from synchronized-batchnorm-pytorch.

Comments (5)

vacancy avatar vacancy commented on August 18, 2024

I haven’t tried fp16 in pytorch. Do you think it’s due to some type mismatch: fp32 vs. fp16? It will be great if you could help me to add a try-catch in the forward method of the batch norm class. We should first check if some exceptions have been thrown out there.

from synchronized-batchnorm-pytorch.

666zz666 avatar 666zz666 commented on August 18, 2024

Thanks for your help .Firstly,I am using two gpus. Secondly,I add a try-catch in the forward method of the _SynchronizedBatchNorm class(batchnorm.py).Then ,I locate the error step by step.

1. batchnorm.py:

    if self._parallel_id == 0:
        mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size))

2.comm.py:

results = self._master_callback(intermediates)

The error is 'An error occured.'

My try-catch like this:

except IOError:
print('An error occured trying to read the file.')

except ValueError:
print('Non-numeric data found in the file.')

except ImportError:
print "NO module found"

except EOFError:
print('Why did you do an EOF on me?')

except KeyboardInterrupt:
print('You cancelled the operation.')

except:
print('An error occured.')

from synchronized-batchnorm-pytorch.

vacancy avatar vacancy commented on August 18, 2024

Can you give detailed information about the "error"?

For example, you may directly wrap the whole function body of forward() with a try-catch statement:

try:
    # original codes
except:
    import traceback
    traceback.print_exc()

from synchronized-batchnorm-pytorch.

666zz666 avatar 666zz666 commented on August 18, 2024

The detailed information

Traceback (most recent call last):
File "/mnt/data-2/data/cnn_multi_/cnn_multi/sync_batchnorm/batchnorm.py", line 68, in forward
mean, inv_std = self._sync_master.run_master(ChildMessage(input_sum, input_ssum, sum_size))
File "/mnt/data-2/data/cnn_multi
/cnn_multi/sync_batchnorm/comm.py", line 125, in run_master
results = self.master_callback(intermediates)
File "/mnt/data-2/data/cnn_multi
/cnn_multi/sync_batchnorm/batchnorm.py", line 108, in data_parallel_master
mean, inv_std = self.compute_mean_std(sum, ssum, sum_size)
File "/mnt/data-2/data/cnn_multi
/cnn_multi/sync_batchnorm/batchnorm.py", line 122, in compute_mean_std
mean = sum
/ size
RuntimeError: value cannot be converted to type at::Half without overflow: 528392

from synchronized-batchnorm-pytorch.

vacancy avatar vacancy commented on August 18, 2024

Seems that some values in the tensors exceed the max value of fp16 ... I guess it's the size? Can you double check?

I am not an expert on this: is there any solution to this? I think this should be a general problem for fp16 training.

from synchronized-batchnorm-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.