csguoh / mambair Goto Github PK
View Code? Open in Web Editor NEW[ECCV2024] An official pytorch implement of the paper "MambaIR: A simple baseline for image restoration with state-space model".
License: Apache License 2.0
[ECCV2024] An official pytorch implement of the paper "MambaIR: A simple baseline for image restoration with state-space model".
License: Apache License 2.0
I encountered a problem of NAN after training for several epochs. I tried to reduce the learning rate to address this issue. However, it worked only for several epochs and the loss value became NAN again. Do you have any experience in addressing this problem?
hello, when i run "python setup.py develop --no_cuda_extgf", i will recive "no option --no_cuda_extgf"
how can i address it?
Dear esteemed author,
First and foremost, I would like to express my heartfelt gratitude for your remarkable work. After nearly 5 days of diligent training, the model has successfully completed its learning process. For testing purposes, I selected the final model, MambaIR-main/experiments/MambaIR_SR_x2/models/net_g_latest.pth
. If it wouldn't be too much trouble, could you kindly take a moment to review my approach and confirm whether it is correct?
The command I employed for testing is as follows:
python basicsr/test.py -opt options/test/test_MambaIR_SR_x2.yml
I am incredibly appreciative of your thoughtful reminder. As per your guidance, I have made sure to include the path in the test script to avoid any potential errors.
2024-04-11 20:25:22,989 INFO: Dataset [PairedImageDataset] - Set5 is built.
2024-04-11 20:25:22,990 INFO: Number of test images in Set5: 5
2024-04-11 20:25:23,033 INFO: Dataset [PairedImageDataset] - Set14 is built.
2024-04-11 20:25:23,034 INFO: Number of test images in Set14: 14
2024-04-11 20:25:23,111 INFO: Dataset [PairedImageDataset] - B100 is built.
2024-04-11 20:25:23,112 INFO: Number of test images in B100: 100
2024-04-11 20:25:23,187 INFO: Dataset [PairedImageDataset] - Urban100 is built.
2024-04-11 20:25:23,187 INFO: Number of test images in Urban100: 100
2024-04-11 20:25:23,243 INFO: Dataset [PairedImageDataset] - Manga109 is built.
2024-04-11 20:25:23,244 INFO: Number of test images in Manga109: 109
2024-04-11 20:25:23,802 INFO: Network [MambaIR] is created.
2024-04-11 20:25:27,104 INFO: Loading MambaIR model from /aiarena/gpfs/MambaIR-main/experiments/MambaIR_SR_x2/models/net_g_latest.pth, with param key: [params].
2024-04-11 20:25:27,368 INFO: Model [MambaIRModel] is created.
2024-04-11 20:25:27,368 INFO: Testing Set5...
2024-04-11 20:25:33,674 INFO: Validation Set5
# psnr: 38.3964 Best: 38.3964 @ test_MambaIR_SR_x2 iter
# ssim: 0.9619 Best: 0.9619 @ test_MambaIR_SR_x2 iter
2024-04-11 20:25:33,674 INFO: Testing Set14...
2024-04-11 20:25:58,600 INFO: Validation Set14
# psnr: 34.4674 Best: 34.4674 @ test_MambaIR_SR_x2 iter
# ssim: 0.9245 Best: 0.9245 @ test_MambaIR_SR_x2 iter
2024-04-11 20:25:58,601 INFO: Testing B100...
2024-04-11 20:27:38,379 INFO: Validation B100
# psnr: 32.4877 Best: 32.4877 @ test_MambaIR_SR_x2 iter
# ssim: 0.9036 Best: 0.9036 @ test_MambaIR_SR_x2 iter
2024-04-11 20:27:38,379 INFO: Testing Urban100...
2024-04-11 20:37:53,034 INFO: Validation Urban100
# psnr: 33.7748 Best: 33.7748 @ test_MambaIR_SR_x2 iter
# ssim: 0.9415 Best: 0.9415 @ test_MambaIR_SR_x2 iter
2024-04-11 20:37:53,036 INFO: Testing Manga109...
2024-04-11 20:50:51,653 INFO: Validation Manga109
# psnr: 39.7395 Best: 39.7395 @ test_MambaIR_SR_x2 iter
# ssim: 0.9794 Best: 0.9794 @ test_MambaIR_SR_x2 iter
Is this all right?
Your expertise and assistance in this matter would be immensely valued.
With utmost respect and gratitude,
xaswq
Could you please provide an actual inference speed comparison?
Hello, your work is excellent! When I use MambaIR in my own work, I find that its training speed is very slow. When the number of iterations is 129500, it is expected to take 6 days, which is much longer than the time spent in transformer application. Is this normal? What is your opinion? I have tried to reduce the number of channels embedded with features and the number of RSSB blocks in them, my image size is 128, looking forward to your reply!
Dear Author,
Firstly, I would like to express my appreciation for sharing the open-source code of your research paper's project. It has been immensely helpful for me.
I have encountered an issue while attempting to parallelize training with your code. I am using the following command for parallel training on 4 nodes:
python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 basicsr/train.py -opt options/train/train_MambaIR_SR_x2.yml --launcher pytorch
However, I receive the following errors:
/usr/local/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your appl
ication as needed.
*****************************************
usage: train.py [-h] [-opt OPT] [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
train.py: error: unrecognized arguments: --local_rank=1
usage: train.py [-h] [-opt OPT] [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
usage: train.py [-h] [-opt OPT] [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
usage: train.py [-h] [-opt OPT] [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local-rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
train.py: error: unrecognized arguments: --local_rank=2
train.py: error: unrecognized arguments: --local_rank=3
train.py: error: unrecognized arguments: --local_rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 40842) of binary: /usr/local/bin/python
It appears that the script is not recognizing --local_rank
arguments. When I use the command:
torchrun --nproc_per_node=4 --master_port=1234 basicsr/train.py -opt options/train/train_MambaIR_SR_x2.yml
It shows FileExistsError: [Errno 17] File exists: ...
Furthermore, when I change the command to single-node training:
torchrun --nproc_per_node=1 --master_port=1234 basicsr/train.py -opt options/train/train_MambaIR_SR_x2.yml
The training proceeds without errors. It leads me to believe that the issue might be related to file naming during the parallelization process since the error suggests a problem with file renaming when a path already exists.
Could you please provide some guidance on how to resolve this issue? Is there a specific configuration or version requirement for using torch.distributed.launch
with your project? I have also adjusted the GPU settings in the train_MambaIR_SR_x2.yml
configuration file accordingly.
Thank you for your attention to this matter. Looking forward to your advice.
Best regards
I use thop to analyze the MambaIRUNet and the number of params is 25.92M. However, when I count the params using the following function, the params number is 31.50M:
def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad)
Could you please exhibit the params and flops of MambaIRUNet?
Dear author,
I noticed the estimated training time for a 4-card 3090 setup is around 5 days for 5,000 iterations. Is this expected? I also found that using 1 card vs 4 cards both result in a 5 day estimate.
python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 basicsr/train.py -opt options/train/train_MambaIR_SR_x2.yml --launcher pytorch
Could you please advise if this training speed is normal? Any insights would be greatly appreciated.
Hi, guys! Congra on your meaningful work! When will the pre-trained weights be released specifically? can't wait to have a try
excuse me,I raised the error when importing this package
error lies:
/home/dhw/anaconda3/bin/conda run -n zIR2 --no-capture-output python /home/dhw/zjb_workspace/IR/realDenoising/test_real_denoising_sidd.py
Traceback (most recent call last):
File "/home/dhw/zjb_workspace/IR/realDenoising/test_real_denoising_sidd.py", line 8, in
from basicsr.models.archs.mambairunet_arch import MambaIRUNet
File "/home/dhw/zjb_workspace/IR/basicsr/init.py", line 1, in
from .archs import *
File "/home/dhw/zjb_workspace/IR/basicsr/archs/init.py", line 16, in
_arch_modules = [importlib.import_module(f'basicsr.archs.{file_name}') for file_name in arch_filenames]
File "/home/dhw/zjb_workspace/IR/basicsr/archs/init.py", line 16, in
_arch_modules = [importlib.import_module(f'basicsr.archs.{file_name}') for file_name in arch_filenames]
File "/home/dhw/anaconda3/envs/zIR2/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/home/dhw/zjb_workspace/IR/basicsr/archs/mambair_arch.py", line 11, in
from mamba_ssm.ops.selective_scan_interface import selective_scan_fn, selective_scan_ref
File "/home/dhw/anaconda3/envs/zIR2/lib/python3.9/site-packages/mamba_ssm/init.py", line 3, in
from mamba_ssm.ops.selective_scan_interface import selective_scan_fn, mamba_inner_fn
File "/home/dhw/anaconda3/envs/zIR2/lib/python3.9/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 11, in
import selective_scan_cuda
ImportError: /home/dhw/anaconda3/envs/zIR2/lib/python3.9/site-packages/selective_scan_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE
ERROR conda.cli.main_run:execute(124): conda run python /home/dhw/zjb_workspace/IR/realDenoising/test_real_denoising_sidd.py
failed. (See above for error)
Process finished with exit code 1
1/ It should be '--no_cuda_ext' rather than '--no_cuda_extgf' in readme file for installing basicsr.
2/ You may change '--local-rank' into '--local_rank' if you meet error about local rank. (Just the basicsr version issue.)
MambaIR/realDenoising/basicsr/train.py
Line 36 in 0eba085
Hello,
I've noticed that the author has mentioned across various platforms that training the MambaIR model generally requires about 8 NVIDIA V100 GPUs for a duration of approximately 7 days. I am curious to know if this specification also applies to the training of the light-sr model. Could you please provide some insights on whether the same hardware and timeframe are recommended for light-sr, or if there are different requirements for optimal training results?
Thank you for your time and assistance.
Best regards.
Hi! Thanks for open-sourcing such good work!
I am trying to apply MambaIR to my artifact removal task. Due to the artifacts being globally distributed, I didn't crop the image to a smaller size, trying to maintain the global information about the artifacts. And I failed to launch training, running out of the GPU memory.
The original image size is 512. I am wondering if MambaIR can be applied with such full-resolution images directly? If so, would you be so kind to share knowledge on the hyperparams you recommend (patch_size
, embed_dim
, etc)?
Or, when inferring on a high-resolution image, is it a convention to first crop the input image to a smaller size like 64x64, and do multiple inference patch-by-patch?
Thanks :)
Thank you for your contribution in this field, but when I was running ‘Test on Real Image Denoising’this part of the code there were two errors that I couldn't solve,The specific details are as follows:
python test_real_denoising_dnd.py
first,
I found that these parameters are not present in the model,such as heads: [1, 2, 4, 8],window_size: [8, 8, 8, 8],interval: [32, 16, 8, 4].So I deleted these three parameters
But I encountered the second error
I downloaded ' Real image Denoising’, but it reported an error
So may I ask if there is anything wrong with me? What should I do?
Thank you
Hi, thanks for your excellent work.
I was wondering if your DND dataset results were submitted to the website for testing? May I ask how long it will take to get the result?
Can you please resolve this issue?
Thanks in advance.
Description:
I have trained an EDSR-style network with 20 residual blocks (RB). The baseline network has 20 RBs, while the experimental variant inserts one MySS2D block in the middle, resulting in the structure: 10RBs + 1MySS2D + 10RBs.
Here are the details of my observations:
Training Details:
Testing Observations:
This leads me to suspect that the SS2D block might be sensitive to the input image block size and possibly overfitted to the patch size used during training.
Code Implementation:
from mambair_arch import SS2D
class MySS2D(nn.Module):
def __init__(self, C):
super().__init__()
self.body = SS2D(C)
self.s = nn.Parameter(torch.tensor([0.0]))
def forward(self, x):
B, C, H, W = x.shape
return x + self.s * self.body(x.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)
Questions:
Thank you very much for your assistance!
Hi, does Mamba have to run in Ubuntu environment please?
Hi, thanks for your excellent work.
I think there is an inconsistency between the pre-trained model and the established model for real denoising, as mentioned in the next issue.
Can you please resolve this issue?
Thanks in advance.
As a beginner, I'd like to ask this question: Do you need to preprocess images before testing denoising on images with arbitrary resolutions? Such as cropping, or is it okay to denoise as long as the images are read without any processing?
I am very much looking forward to your response. Thank you for your open-source work. Thank you
Hello! It's a really good job and I'm very interested in it. But I couldn't find the real denoising network ARTUNet in the configuration file ./realDenoising/options/train_MambaIR_RealDN.yml, Could you please help to point out the specific location? Thank you.
Thank you very much for your work! I'd like to ask, what toolkit did you use to calculate the params count and MACs? It seems that the toolkit I have here cannot compute the parameter count and MACs.
我是从这个链接下载的:https://drive.google.com/file/d/1ZOFzcex2g9_B6Xtf8-qnMx08OnAjKD0M/view?usp=sharing。 下载后发现checkpoint中的keys是这些:
['conv0.weight', 'conv0.bias', 'conv1.weight_orig', 'conv1.weight_u', 'conv1.weight_v', 'conv2.weight_orig', 'conv2.weight_u', 'conv2.weight_v', 'conv3.weight_orig', 'conv3.weight_u', 'conv3.weight_v', 'conv4.weight_orig', 'conv4.weight_u', 'conv4.weight_v', 'conv5.weight_orig', 'conv5.weight_u', 'conv5.weight_v', 'conv6.weight_orig', 'conv6.weight_u', 'conv6.weight_v', 'conv7.weight_orig', 'conv7.weight_u', 'conv7.weight_v', 'conv8.weight_orig', 'conv8.weight_u', 'conv8.weight_v', 'conv9.weight', 'conv9.bias'])
跟MambaIR模型没有关系,是不是放错了?我已经通过日志文件,对齐了模型结构,目前代码打印出来的模型结构和日志中的是一样的,但是加载预训练模型参数时遇到键值对不上的问题。
Hello, I'm encountering difficulty in visualizing the Effective Receptive Field (ERF) of models for my task in MRI super-resolution. The ERF of MambaIR appears to be typical and is displayed below:
However, when visualizing the ERF of the SwinIR model, it displays an abnormal result as shown below:
I'm uncertain about the reason behind this discrepancy. Could the author provide any suggestions or insights?
You've done a great job. When will the pre-trained weights be released?
I run the file of flops_param.py revising the parameter in the function of buildMambaIR as the same as the MambaIR_lightSR. But the result is 1.138M, not the same as Table 5 of the paper. Why does this happen? (This picture is the parameter of MambaIR_lightSR I modified based on train_MambaIR_lightSR_x2.yml)
你好,我在配好环境测试时出现了这个错误:KeyError: "No object named 'MambaIRModel' found in 'model' registry!",请问这是为什么呢?
Thank you very much for your work! I am curious about the mlp_ratio in the SR x2 settings.
In the options/train/train_MambaIR_SR_x2.yml, the mlp_ratio is 2.0, while in the log provided in here is 4.0.
I am wondering which one should we tend to refer? Expect for you reply.
hi,thanks for provide your pioneer work, I wonder have you compared the MambaIR with NAFNet? They are both a simple and linearity method. Looking forward your reply! ^_^
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.