Comments (5)
I haven’t tried fp16 in pytorch. Do you think it’s due to some type mismatch: fp32 vs. fp16? It will be great if you could help me to add a try-catch in the forward method of the batch norm class. We should first check if some exceptions have been thrown out there.
from synchronized-batchnorm-pytorch.
Thanks for your help .Firstly,I am using two gpus. Secondly,I add a try-catch in the forward method of the _SynchronizedBatchNorm class(batchnorm.py).Then ,I locate the error step by step.
1. batchnorm.py:
if self._parallel_id == 0:
mean, inv_std = self._sync_master.run_master(_ChildMessage(input_sum, input_ssum, sum_size))
2.comm.py:
results = self._master_callback(intermediates)
The error is 'An error occured.'
My try-catch like this:
except IOError:
print('An error occured trying to read the file.')
except ValueError:
print('Non-numeric data found in the file.')
except ImportError:
print "NO module found"
except EOFError:
print('Why did you do an EOF on me?')
except KeyboardInterrupt:
print('You cancelled the operation.')
except:
print('An error occured.')
from synchronized-batchnorm-pytorch.
Can you give detailed information about the "error"?
For example, you may directly wrap the whole function body of forward()
with a try-catch statement:
try:
# original codes
except:
import traceback
traceback.print_exc()
from synchronized-batchnorm-pytorch.
The detailed information
Traceback (most recent call last):
File "/mnt/data-2/data/cnn_multi_/cnn_multi/sync_batchnorm/batchnorm.py", line 68, in forward
mean, inv_std = self._sync_master.run_master(ChildMessage(input_sum, input_ssum, sum_size))
File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/comm.py", line 125, in run_master
results = self.master_callback(intermediates)
File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/batchnorm.py", line 108, in data_parallel_master
mean, inv_std = self.compute_mean_std(sum, ssum, sum_size)
File "/mnt/data-2/data/cnn_multi/cnn_multi/sync_batchnorm/batchnorm.py", line 122, in compute_mean_std
mean = sum / size
RuntimeError: value cannot be converted to type at::Half without overflow: 528392
from synchronized-batchnorm-pytorch.
Seems that some values in the tensors exceed the max value of fp16 ... I guess it's the size
? Can you double check?
I am not an expert on this: is there any solution to this? I think this should be a general problem for fp16 training.
from synchronized-batchnorm-pytorch.
Related Issues (20)
- test gap between training and test HOT 3
- Is this a bug that channel between input tensor and sync batchnorm are mismatch the code still run successful? HOT 2
- How to cite this repo in bib?
- How to use it when testing HOT 1
- RuntimeError with convert_model - "found one of them on device: cpu" HOT 2
- Thinking about 'sync_batchnorm.batchnorm.convert_model(module)'.. HOT 1
- Wired things, module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu HOT 1
- Question on `sqrt(max(var, eps))` HOT 2
- module问题
- Training cannot start HOT 7
- Training cannot start
- batchnor while using distributed dataparallel HOT 1
- a question about the highlight "use sqrt(max(var, eps)) instead of sqrt(var + eps)" HOT 4
- Train Stucked HOT 4
- raining couldn 't start HOT 2
- Training stuck with multiple call of forward function HOT 7
- Where is "track_running_stats" implementation code? HOT 1
- how to export with onnx
- .
- spam
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from synchronized-batchnorm-pytorch.