Hi, Excellent work on the U-Mamba model! I have been attempting to t

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Ohter cases Dataset701_AbdomenCT with <code class

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Nan When training about u-mamba HOT 11 CLOSED

siny1998 commented on August 17, 2024

Nan When training

from u-mamba.

Comments (11)

JunMa11 commented on August 17, 2024

Hi @siny1998 ,

Thanks for your interest.

does the nnUNetTrainerUMambaBot have this issue?

from u-mamba.

siny1998 commented on August 17, 2024

I haven't used the nnUNetTrainerUMambaBot yet. I'll give it a try.

from u-mamba.

JiarunLiu commented on August 17, 2024

Hi @JunMa11,

I have similar issue when I training on Dataset704_Endovis17 dataset with nnUNetTrainerUMambaEnc. But using nnUNetTrainerUMambaEnc won't have this issue. This is my training logs with command nnUNetv2_train 704 2d all -tr nnUNetTrainerUMambaEnc:

This is the configuration used by this training:
Configuration name: 2d
 {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 13, 'patch_size': [384, 640], 'median_image_size_in_voxels': [1080.0, 1920.0], 'spacing': [1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [False, False, False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset704_Endovis17', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [999.0, 1.0, 1.0], 'original_median_shape_after_transp': [1, 1080, 1920], 'image_reader_writer': 'NaturalImage2DIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 255.0, 'mean': 100.00444773514101, 'median': 92.0, 'min': 1.0, 'percentile_00_5': 24.0, 'percentile_99_5': 238.0, 'std': 51.584682233895585}, '1': {'max': 249.0, 'mean': 86.51525510193234, 'median': 72.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 233.0, 'std': 52.24999179949625}, '2': {'max': 255.0, 'mean': 93.29896387602795, 'median': 79.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 244.0, 'std': 56.38243456877845}}} 

2024-01-20 11:56:11.330918: unpacking dataset...
2024-01-20 11:56:11.683550: unpacking done...
2024-01-20 11:56:11.684135: do_dummy_2d_data_aug: False
/mnt/disk2/jiarunliu/Documents/mamba/U-Mamba/umamba/nnunetv2/nets/UMambaEnc.py:41: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert C == self.dim
2024-01-20 11:56:27.301467: Unable to plot network architecture:
2024-01-20 11:56:27.302820: module 'torch.onnx' has no attribute '_optimize_trace'
2024-01-20 11:56:27.350771: 
2024-01-20 11:56:27.350951: Epoch 0
2024-01-20 11:56:27.351222: Current learning rate: 0.01
using pin_memory on device 0
using pin_memory on device 0
2024-01-20 12:01:32.749162: train_loss nan
2024-01-20 12:01:32.749484: val_loss nan
2024-01-20 12:01:32.749587: Pseudo dice [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2024-01-20 12:01:32.749661: Epoch time: 305.4 s
2024-01-20 12:01:32.749717: Yayy! New best EMA pseudo Dice: 0.0

So far, this problem only occurs with Dataset704_Endovis17 + nnUNetTrainerUMambaEnc. I'll trying other settings/datasets later.

from u-mamba.

eclipse0922 commented on August 17, 2024

Ohter cases
Dataset701_AbdomenCT with nnUNetTrainerUMambaEnc.
I gave the command nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaEnc.
I got the same problem as others

I also tried nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaBot.
it works without nan.

Other Dataset
702(3D) Bot/ Enc both seems OK
703(2D) Bot causes negative loss form epoch 1/ Enc seems OK

704(2D) Bot/Enc both seems OK

I didn't try it all the way to the end until the model converged, so there was no problem at the beginning, but it's possible it could have changed after several epochs passed.

And the default optimizer is set to SGD for the nnUNetTrainerUMambaEnc nnUNetTrainerUMambaBot classes,
I changed it to use nnUNetTrainerAdam instead, it worked without nan issue at the beginning. but it shows me nan again after several epoch passed.

from u-mamba.

eclipse0922 commented on August 17, 2024

As with any training model, the combination of optimiser, trainer, and learning rate seems to matter.
I tried SGD with learning rate 1e-3, much stable than before(default 1e-2)

from u-mamba.

JunMa11 commented on August 17, 2024

Hi all,

Thanks for your valuable feedback!
We are testing the model on more datasets. Welcome to subscribe to our update here.

from u-mamba.

innocence0206 commented on August 17, 2024

I got the 'Nan' loss too with nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaEnc, and I have tried redoing nnUNetv2_plan_and_preprocess -d 701 --verify_dataset_integrity, but it didn't work

from u-mamba.

gumayusi3 commented on August 17, 2024

I also have this problem when training OCTA dataset

from u-mamba.

wyjzll commented on August 17, 2024

Excellent work! I also faced the same problem on several datasets. Dice, Train_loss, and Val_loss collapsed after 40+ or 200+ epoches randomly. UmambaBot worked but Enc didn't.

from u-mamba.

Missyfirst commented on August 17, 2024

@wyjzll I have the same problems with you. UmambaBot worked but Enc didn't.

from u-mamba.

JunMa11 commented on August 17, 2024

Hi all,

Thanks for your valuable feedback. After diving into the implementation, we found that AMP will lead to nan.

We have provided new Enc trainers without AMP:

https://github.com/bowang-lab/U-Mamba/blob/main/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainerUMambaEncNoAMP.py

from u-mamba.

Nan When training about u-mamba HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent