csguoh / mambair Goto Github PK

[ECCV2024] An official pytorch implement of the paper "MambaIR: A simple baseline for image restoration with state-space model".

License: Apache License 2.0

Python 99.89% MATLAB 0.11%

mambair's People

Contributors

Stargazers

Watchers

mambair's Issues

I encountered a problem of NAN after training for several epochs. I tried to reduce the learning rate to address this issue. However, it worked only for several epochs and the loss value became NAN again. Do you have any experience in addressing this problem?

About “Train on Real Denoising” section

hello， when i run "python setup.py develop --no_cuda_extgf", i will recive "no option --no_cuda_extgf"
how can i address it?

A Humble Request for Assistance

Dear esteemed author,

First and foremost, I would like to express my heartfelt gratitude for your remarkable work. After nearly 5 days of diligent training, the model has successfully completed its learning process. For testing purposes, I selected the final model, MambaIR-main/experiments/MambaIR_SR_x2/models/net_g_latest.pth. If it wouldn't be too much trouble, could you kindly take a moment to review my approach and confirm whether it is correct?

The command I employed for testing is as follows:

python basicsr/test.py -opt options/test/test_MambaIR_SR_x2.yml

I am incredibly appreciative of your thoughtful reminder. As per your guidance, I have made sure to include the path in the test script to avoid any potential errors.


2024-04-11 20:25:22,989 INFO: Dataset [PairedImageDataset] - Set5 is built.
2024-04-11 20:25:22,990 INFO: Number of test images in Set5: 5
2024-04-11 20:25:23,033 INFO: Dataset [PairedImageDataset] - Set14 is built.
2024-04-11 20:25:23,034 INFO: Number of test images in Set14: 14
2024-04-11 20:25:23,111 INFO: Dataset [PairedImageDataset] - B100 is built.
2024-04-11 20:25:23,112 INFO: Number of test images in B100: 100
2024-04-11 20:25:23,187 INFO: Dataset [PairedImageDataset] - Urban100 is built.
2024-04-11 20:25:23,187 INFO: Number of test images in Urban100: 100
2024-04-11 20:25:23,243 INFO: Dataset [PairedImageDataset] - Manga109 is built.
2024-04-11 20:25:23,244 INFO: Number of test images in Manga109: 109
2024-04-11 20:25:23,802 INFO: Network [MambaIR] is created.
2024-04-11 20:25:27,104 INFO: Loading MambaIR model from /aiarena/gpfs/MambaIR-main/experiments/MambaIR_SR_x2/models/net_g_latest.pth, with param key: [params].
2024-04-11 20:25:27,368 INFO: Model [MambaIRModel] is created.
2024-04-11 20:25:27,368 INFO: Testing Set5...
2024-04-11 20:25:33,674 INFO: Validation Set5
         # psnr: 38.3964        Best: 38.3964 @ test_MambaIR_SR_x2 iter
         # ssim: 0.9619 Best: 0.9619 @ test_MambaIR_SR_x2 iter

2024-04-11 20:25:33,674 INFO: Testing Set14...
2024-04-11 20:25:58,600 INFO: Validation Set14
         # psnr: 34.4674        Best: 34.4674 @ test_MambaIR_SR_x2 iter
         # ssim: 0.9245 Best: 0.9245 @ test_MambaIR_SR_x2 iter

2024-04-11 20:25:58,601 INFO: Testing B100...
2024-04-11 20:27:38,379 INFO: Validation B100
         # psnr: 32.4877        Best: 32.4877 @ test_MambaIR_SR_x2 iter
         # ssim: 0.9036 Best: 0.9036 @ test_MambaIR_SR_x2 iter

2024-04-11 20:27:38,379 INFO: Testing Urban100...
2024-04-11 20:37:53,034 INFO: Validation Urban100
         # psnr: 33.7748        Best: 33.7748 @ test_MambaIR_SR_x2 iter
         # ssim: 0.9415 Best: 0.9415 @ test_MambaIR_SR_x2 iter

2024-04-11 20:37:53,036 INFO: Testing Manga109...
2024-04-11 20:50:51,653 INFO: Validation Manga109
         # psnr: 39.7395        Best: 39.7395 @ test_MambaIR_SR_x2 iter
         # ssim: 0.9794 Best: 0.9794 @ test_MambaIR_SR_x2 iter

Is this all right?

Your expertise and assistance in this matter would be immensely valued.

With utmost respect and gratitude,
xaswq

About the Effective Receptive Field (ERF) visualization

Hi, thanks for your work! Can you show the code about the Effective Receptive Field (ERF) visualization in your paper. We are interested in this.

About inference speed

Could you please provide an actual inference speed comparison?

About the readme photo

How to make the photo like this，so cool

when move the control block, the photo will change from fuzzy to clear，interesting😀

About training speed

Hello, your work is excellent! When I use MambaIR in my own work, I find that its training speed is very slow. When the number of iterations is 129500, it is expected to take 6 days, which is much longer than the time spent in transformer application. Is this normal? What is your opinion? I have tried to reduce the number of channels embedded with features and the number of RSSB blocks in them, my image size is 128, looking forward to your reply!

Issue with Parallel Training in MambaIR Project

Dear Author,

Firstly, I would like to express my appreciation for sharing the open-source code of your research paper's project. It has been immensely helpful for me.

I have encountered an issue while attempting to parallelize training with your code. I am using the following command for parallel training on 4 nodes:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 basicsr/train.py -opt options/train/train_MambaIR_SR_x2.yml --launcher pytorch

However, I receive the following errors:

/usr/local/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appl
ication as needed.
*****************************************
usage: train.py [-h] [-opt OPT] [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
train.py: error: unrecognized arguments: --local_rank=1
usage: train.py [-h] [-opt OPT] [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
usage: train.py [-h] [-opt OPT] [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
usage: train.py [-h] [-opt OPT] [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
train.py: error: unrecognized arguments: --local_rank=2
train.py: error: unrecognized arguments: --local_rank=3
train.py: error: unrecognized arguments: --local_rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 40842) of binary: /usr/local/bin/python

It appears that the script is not recognizing --local_rank arguments. When I use the command:

torchrun --nproc_per_node=4 --master_port=1234 basicsr/train.py -opt options/train/train_MambaIR_SR_x2.yml

It shows FileExistsError: [Errno 17] File exists: ...

Furthermore, when I change the command to single-node training:

torchrun --nproc_per_node=1 --master_port=1234 basicsr/train.py -opt options/train/train_MambaIR_SR_x2.yml

The training proceeds without errors. It leads me to believe that the issue might be related to file naming during the parallelization process since the error suggests a problem with file renaming when a path already exists.

Could you please provide some guidance on how to resolve this issue? Is there a specific configuration or version requirement for using torch.distributed.launch with your project? I have also adjusted the GPU settings in the train_MambaIR_SR_x2.yml configuration file accordingly.

Thank you for your attention to this matter. Looking forward to your advice.

Best regards

Expected u.is_cuda() to be true, but got false.

Thank you for your nice work. I intend to use your VSSBlock in my model, but I encountered the following error. Could you please help me resolve it? Thank you

Params and Flops of MambaIRUNet.

I use thop to analyze the MambaIRUNet and the number of params is 25.92M. However, when I count the params using the following function, the params number is 31.50M:

def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad)

Could you please exhibit the params and flops of MambaIRUNet?

Question about 3090 4-card training speed

Dear author,

I noticed the estimated training time for a 4-card 3090 setup is around 5 days for 5,000 iterations. Is this expected? I also found that using 1 card vs 4 cards both result in a 5 day estimate.

python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 basicsr/train.py -opt options/train/train_MambaIR_SR_x2.yml --launcher pytorch

Could you please advise if this training speed is normal? Any insights would be greatly appreciated.

When will the pre-trained weights be released specifically?

Hi, guys! Congra on your meaningful work! When will the pre-trained weights be released specifically? can't wait to have a try

import selective_scan_cuda ImportError

excuse me,I raised the error when importing this package
error lies:
/home/dhw/anaconda3/bin/conda run -n zIR2 --no-capture-output python /home/dhw/zjb_workspace/IR/realDenoising/test_real_denoising_sidd.py
Traceback (most recent call last):
File "/home/dhw/zjb_workspace/IR/realDenoising/test_real_denoising_sidd.py", line 8, in
from basicsr.models.archs.mambairunet_arch import MambaIRUNet
File "/home/dhw/zjb_workspace/IR/basicsr/init.py", line 1, in
from .archs import *
File "/home/dhw/zjb_workspace/IR/basicsr/archs/init.py", line 16, in
_arch_modules = [importlib.import_module(f'basicsr.archs.{file_name}') for file_name in arch_filenames]
File "/home/dhw/zjb_workspace/IR/basicsr/archs/init.py", line 16, in
_arch_modules = [importlib.import_module(f'basicsr.archs.{file_name}') for file_name in arch_filenames]
File "/home/dhw/anaconda3/envs/zIR2/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/home/dhw/zjb_workspace/IR/basicsr/archs/mambair_arch.py", line 11, in
from mamba_ssm.ops.selective_scan_interface import selective_scan_fn, selective_scan_ref
File "/home/dhw/anaconda3/envs/zIR2/lib/python3.9/site-packages/mamba_ssm/init.py", line 3, in
from mamba_ssm.ops.selective_scan_interface import selective_scan_fn, mamba_inner_fn
File "/home/dhw/anaconda3/envs/zIR2/lib/python3.9/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 11, in
import selective_scan_cuda
ImportError: /home/dhw/anaconda3/envs/zIR2/lib/python3.9/site-packages/selective_scan_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
ERROR conda.cli.main_run:execute(124): conda run python /home/dhw/zjb_workspace/IR/realDenoising/test_real_denoising_sidd.py failed. (See above for error)

Process finished with exit code 1

Several minor problems you may encounter when training for the real denoising task.

1/ It should be '--no_cuda_ext' rather than '--no_cuda_extgf' in readme file for installing basicsr.
2/ You may change '--local-rank' into '--local_rank' if you meet error about local rank. (Just the basicsr version issue.)

MambaIR/realDenoising/basicsr/train.py

Line 36 in 0eba085

parser.add_argument('--local-rank', type=int, default=0)

3/ As is stated in the paper, the Charbonnier loss is used instead of L1, but the yaml file still sets it as L1.

Inquiry Regarding Training Requirements for light-sr Model

Hello,

I've noticed that the author has mentioned across various platforms that training the MambaIR model generally requires about 8 NVIDIA V100 GPUs for a duration of approximately 7 days. I am curious to know if this specification also applies to the training of the light-sr model. Could you please provide some insights on whether the same hardware and timeframe are recommended for light-sr, or if there are different requirements for optimal training results?

Thank you for your time and assistance.

Best regards.

Recommended hyperparams for high-resolution input

Hi! Thanks for open-sourcing such good work!

I am trying to apply MambaIR to my artifact removal task. Due to the artifacts being globally distributed, I didn't crop the image to a smaller size, trying to maintain the global information about the artifacts. And I failed to launch training, running out of the GPU memory.

The original image size is 512. I am wondering if MambaIR can be applied with such full-resolution images directly? If so, would you be so kind to share knowledge on the hyperparams you recommend (patch_size, embed_dim, etc)?

Or, when inferring on a high-resolution image, is it a convention to first crop the input image to a smaller size like 64x64, and do multiple inference patch-by-patch?

Thanks :)

About 'Test on Real Image Denoising' meet some questions

Thank you for your contribution in this field, but when I was running ‘Test on Real Image Denoising’this part of the code there were two errors that I couldn't solve，The specific details are as follows：
python test_real_denoising_dnd.py
first,

I found that these parameters are not present in the model,such as heads: [1, 2, 4, 8],window_size: [8, 8, 8, 8],interval: [32, 16, 8, 4].So I deleted these three parameters

But I encountered the second error

I downloaded ' Real image Denoising’, but it reported an error

So may I ask if there is anything wrong with me? What should I do?
Thank you

About the DND dataset results

Hi, thanks for your excellent work.
I was wondering if your DND dataset results were submitted to the website for testing? May I ask how long it will take to get the result?
Can you please resolve this issue?
Thanks in advance.

Sensitivity of SS2D Block to Input Image Feature Size

Description:

I have trained an EDSR-style network with 20 residual blocks (RB). The baseline network has 20 RBs, while the experimental variant inserts one MySS2D block in the middle, resulting in the structure: 10RBs + 1MySS2D + 10RBs.

Here are the details of my observations:

Training Details:
- Patch Size: 128x128
Testing Observations:
- Using 128x128 Blocks: When I divide the input image feature into 128x128 blocks before passing through the SS2D block, the PSNR is normal.
- Using Original 512x512 Image: When I directly input the original image of 512x512 size, the PSNR is significantly lower, with around a 5dB drop compared to the baseline network.

This leads me to suspect that the SS2D block might be sensitive to the input image block size and possibly overfitted to the patch size used during training.

Code Implementation:

from mambair_arch import SS2D

class MySS2D(nn.Module):
    def __init__(self, C):
        super().__init__()
        self.body = SS2D(C)
        self.s = nn.Parameter(torch.tensor([0.0]))
    
    def forward(self, x):
        B, C, H, W = x.shape
        return x + self.s * self.body(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)

Questions:

Is the SS2D block sensitive to the input image feature size?
Is there a potential mistake in my implementation that could be causing this issue?
Any insights or instructions on how to address this issue would be greatly appreciated.

Thank you very much for your assistance!

Does Mamba have to run in Ubuntu please?

Hi, does Mamba have to run in Ubuntu environment please?

Issue with real denoising

Hi, thanks for your excellent work.

I think there is an inconsistency between the pre-trained model and the established model for real denoising, as mentioned in the next issue.

Can you please resolve this issue?

Thanks in advance.

Regarding real denoising with arbitrary resolution input

As a beginner, I'd like to ask this question: Do you need to preprocess images before testing denoising on images with arbitrary resolutions? Such as cropping, or is it okay to denoise as long as the images are read without any processing?
I am very much looking forward to your response. Thank you for your open-source work. Thank you

About RealDN

Hello! It's a really good job and I'm very interested in it. But I couldn't find the real denoising network ARTUNet in the configuration file ./realDenoising/options/train_MambaIR_RealDN.yml, Could you please help to point out the specific location? Thank you.

About the calculation of params count and MACs for lightweight Image Super-Resolution

Thank you very much for your work! I'd like to ask, what toolkit did you use to calculate the params count and MACs? It seems that the toolkit I have here cannot compute the parameter count and MACs.

RealSR任务的预训练参数模型文件是否放错了？

我是从这个链接下载的：https://drive.google.com/file/d/1ZOFzcex2g9_B6Xtf8-qnMx08OnAjKD0M/view?usp=sharing。下载后发现checkpoint中的keys是这些：
['conv0.weight', 'conv0.bias', 'conv1.weight_orig', 'conv1.weight_u', 'conv1.weight_v', 'conv2.weight_orig', 'conv2.weight_u', 'conv2.weight_v', 'conv3.weight_orig', 'conv3.weight_u', 'conv3.weight_v', 'conv4.weight_orig', 'conv4.weight_u', 'conv4.weight_v', 'conv5.weight_orig', 'conv5.weight_u', 'conv5.weight_v', 'conv6.weight_orig', 'conv6.weight_u', 'conv6.weight_v', 'conv7.weight_orig', 'conv7.weight_u', 'conv7.weight_v', 'conv8.weight_orig', 'conv8.weight_u', 'conv8.weight_v', 'conv9.weight', 'conv9.bias'])
跟MambaIR模型没有关系，是不是放错了？我已经通过日志文件，对齐了模型结构，目前代码打印出来的模型结构和日志中的是一样的，但是加载预训练模型参数时遇到键值对不上的问题。

A Question about ERF

Hello, I'm encountering difficulty in visualizing the Effective Receptive Field (ERF) of models for my task in MRI super-resolution. The ERF of MambaIR appears to be typical and is displayed below:

However, when visualizing the ERF of the SwinIR model, it displays an abnormal result as shown below:

I'm uncertain about the reason behind this discrepancy. Could the author provide any suggestions or insights?

Question about visualization

Hi, great works, thanks for your contribution.
I would like to revising the channel activation value of the model, how to get the visualization like ↓

Thanks
BR

When will the pre-trained weights be released?

You've done a great job. When will the pre-trained weights be released?

The param of MambaIR_lightSR is more than 1M, which is not the same as Table 5 of the paper.

I run the file of flops_param.py revising the parameter in the function of buildMambaIR as the same as the MambaIR_lightSR. But the result is 1.138M, not the same as Table 5 of the paper. Why does this happen? （This picture is the parameter of MambaIR_lightSR I modified based on train_MambaIR_lightSR_x2.yml）

I am wondering which one should we tend to refer? Expect for you reply.

About method comparison

hi，thanks for provide your pioneer work, I wonder have you compared the MambaIR with NAFNet? They are both a simple and linearity method. Looking forward your reply! ^_^

csguoh / mambair Goto Github PK

mambair's People

Contributors

Stargazers

Watchers

Forkers

mambair's Issues

Recommend Projects

Recommend Topics

Recommend Org