Coder Social home page Coder Social logo

Nan When training about u-mamba HOT 11 CLOSED

siny1998 avatar siny1998 commented on August 17, 2024
Nan When training

from u-mamba.

Comments (11)

JunMa11 avatar JunMa11 commented on August 17, 2024

Hi @siny1998 ,

Thanks for your interest.

does the nnUNetTrainerUMambaBot have this issue?

from u-mamba.

siny1998 avatar siny1998 commented on August 17, 2024

I haven't used the nnUNetTrainerUMambaBot yet. I'll give it a try.

from u-mamba.

JiarunLiu avatar JiarunLiu commented on August 17, 2024

Hi @JunMa11,

I have similar issue when I training on Dataset704_Endovis17 dataset with nnUNetTrainerUMambaEnc. But using nnUNetTrainerUMambaEnc won't have this issue. This is my training logs with command nnUNetv2_train 704 2d all -tr nnUNetTrainerUMambaEnc:

This is the configuration used by this training:
Configuration name: 2d
 {'data_identifier': 'nnUNetPlans_2d', 'preprocessor_name': 'DefaultPreprocessor', 'batch_size': 13, 'patch_size': [384, 640], 'median_image_size_in_voxels': [1080.0, 1920.0], 'spacing': [1.0, 1.0], 'normalization_schemes': ['ZScoreNormalization', 'ZScoreNormalization', 'ZScoreNormalization'], 'use_mask_for_norm': [False, False, False], 'UNet_class_name': 'PlainConvUNet', 'UNet_base_num_features': 32, 'n_conv_per_stage_encoder': [2, 2, 2, 2, 2, 2, 2], 'n_conv_per_stage_decoder': [2, 2, 2, 2, 2, 2], 'num_pool_per_axis': [6, 6], 'pool_op_kernel_sizes': [[1, 1], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2], [2, 2]], 'conv_kernel_sizes': [[3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3], [3, 3]], 'unet_max_num_features': 512, 'resampling_fn_data': 'resample_data_or_seg_to_shape', 'resampling_fn_seg': 'resample_data_or_seg_to_shape', 'resampling_fn_data_kwargs': {'is_seg': False, 'order': 3, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_seg_kwargs': {'is_seg': True, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'resampling_fn_probabilities': 'resample_data_or_seg_to_shape', 'resampling_fn_probabilities_kwargs': {'is_seg': False, 'order': 1, 'order_z': 0, 'force_separate_z': None}, 'batch_dice': True} 

These are the global plan.json settings:
 {'dataset_name': 'Dataset704_Endovis17', 'plans_name': 'nnUNetPlans', 'original_median_spacing_after_transp': [999.0, 1.0, 1.0], 'original_median_shape_after_transp': [1, 1080, 1920], 'image_reader_writer': 'NaturalImage2DIO', 'transpose_forward': [0, 1, 2], 'transpose_backward': [0, 1, 2], 'experiment_planner_used': 'ExperimentPlanner', 'label_manager': 'LabelManager', 'foreground_intensity_properties_per_channel': {'0': {'max': 255.0, 'mean': 100.00444773514101, 'median': 92.0, 'min': 1.0, 'percentile_00_5': 24.0, 'percentile_99_5': 238.0, 'std': 51.584682233895585}, '1': {'max': 249.0, 'mean': 86.51525510193234, 'median': 72.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 233.0, 'std': 52.24999179949625}, '2': {'max': 255.0, 'mean': 93.29896387602795, 'median': 79.0, 'min': 0.0, 'percentile_00_5': 21.0, 'percentile_99_5': 244.0, 'std': 56.38243456877845}}} 

2024-01-20 11:56:11.330918: unpacking dataset...
2024-01-20 11:56:11.683550: unpacking done...
2024-01-20 11:56:11.684135: do_dummy_2d_data_aug: False
/mnt/disk2/jiarunliu/Documents/mamba/U-Mamba/umamba/nnunetv2/nets/UMambaEnc.py:41: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert C == self.dim
2024-01-20 11:56:27.301467: Unable to plot network architecture:
2024-01-20 11:56:27.302820: module 'torch.onnx' has no attribute '_optimize_trace'
2024-01-20 11:56:27.350771: 
2024-01-20 11:56:27.350951: Epoch 0
2024-01-20 11:56:27.351222: Current learning rate: 0.01
using pin_memory on device 0
using pin_memory on device 0
2024-01-20 12:01:32.749162: train_loss nan
2024-01-20 12:01:32.749484: val_loss nan
2024-01-20 12:01:32.749587: Pseudo dice [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2024-01-20 12:01:32.749661: Epoch time: 305.4 s
2024-01-20 12:01:32.749717: Yayy! New best EMA pseudo Dice: 0.0

So far, this problem only occurs with Dataset704_Endovis17 + nnUNetTrainerUMambaEnc. I'll trying other settings/datasets later.

from u-mamba.

eclipse0922 avatar eclipse0922 commented on August 17, 2024

Ohter cases
Dataset701_AbdomenCT with nnUNetTrainerUMambaEnc.
I gave the command nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaEnc.
I got the same problem as others

I also tried nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaBot.
it works without nan.

Other Dataset
702(3D) Bot/ Enc both seems OK
703(2D) Bot causes negative loss form epoch 1/ Enc seems OK
image

704(2D) Bot/Enc both seems OK

I didn't try it all the way to the end until the model converged, so there was no problem at the beginning, but it's possible it could have changed after several epochs passed.

And the default optimizer is set to SGD for the nnUNetTrainerUMambaEnc nnUNetTrainerUMambaBot classes,
I changed it to use nnUNetTrainerAdam instead, it worked without nan issue at the beginning. but it shows me nan again after several epoch passed.

from u-mamba.

eclipse0922 avatar eclipse0922 commented on August 17, 2024

As with any training model, the combination of optimiser, trainer, and learning rate seems to matter.
I tried SGD with learning rate 1e-3, much stable than before(default 1e-2)

from u-mamba.

JunMa11 avatar JunMa11 commented on August 17, 2024

Hi all,

Thanks for your valuable feedback!
We are testing the model on more datasets. Welcome to subscribe to our update here.

from u-mamba.

innocence0206 avatar innocence0206 commented on August 17, 2024

I got the 'Nan' loss too with nnUNetv2_train 701 3d_fullres all -tr nnUNetTrainerUMambaEnc, and I have tried redoing nnUNetv2_plan_and_preprocess -d 701 --verify_dataset_integrity, but it didn't work

from u-mamba.

gumayusi3 avatar gumayusi3 commented on August 17, 2024

I also have this problem when training OCTA dataset

from u-mamba.

wyjzll avatar wyjzll commented on August 17, 2024

Excellent work! I also faced the same problem on several datasets. Dice, Train_loss, and Val_loss collapsed after 40+ or 200+ epoches randomly. UmambaBot worked but Enc didn't.

from u-mamba.

Missyfirst avatar Missyfirst commented on August 17, 2024

@wyjzll I have the same problems with you. UmambaBot worked but Enc didn't.

from u-mamba.

JunMa11 avatar JunMa11 commented on August 17, 2024

Hi all,

Thanks for your valuable feedback. After diving into the implementation, we found that AMP will lead to nan.

We have provided new Enc trainers without AMP:

https://github.com/bowang-lab/U-Mamba/blob/main/umamba/nnunetv2/training/nnUNetTrainer/nnUNetTrainerUMambaEncNoAMP.py

from u-mamba.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.