microsoft / climax Goto Github PK
View Code? Open in Web Editor NEWFoundation model for weather & climate
Home Page: https://microsoft.github.io/ClimaX/
License: MIT License
Foundation model for weather & climate
Home Page: https://microsoft.github.io/ClimaX/
License: MIT License
I apologize for this request, but the raw data is extremely large and inconvenient to download and organize. Would it be possible to share the downscaled dataset after processing?
Thank you for the open-source code and detailed documentation of the experiments. I had a few questions about pre-training, which I can't seem to find in the paper or appendix. Could you please help?
Thanks.
I was running the global training on 4 GPUs and 48 CPUs.
The num of workers parameter (--data.num_workers) works for the value 1.
However, the training failed when I tried increasing the num of workers for efficient data loading.
I tried values, 48, 24, 12, 8, but all failed with the following error:
Traceback (most recent call last):
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in
launch
return function(*args, **kwargs)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1150, in _run_train
self._run_sanity_check()
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in _run_sanity_check
self._call_callback_hooks("on_sanity_check_end")
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1340, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 358, in
on_sanity_check_end
assert self.val_sanity_progress_bar_id is not None
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/g/data/wb00/admin/staging/ClimaX/src/climax/global_forecast/train.py", line 41, in <module>
main()
File "/g/data/wb00/admin/staging/ClimaX/src/climax/global_forecast/train.py", line 34, in main
cli.trainer.fit(cli.model, datamodule=cli.datamodule)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 59, in _call_and_handle_interrupt
trainer.strategy.reconciliate_processes(traceback.format_exc())
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 461, in reconciliate_processes
raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0
Traceback (most recent call last):
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in
launch
return function(*args, **kwargs)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1150, in _run_train
self._run_sanity_check()
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in _run_sanity_check
self._call_callback_hooks("on_sanity_check_end")
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1340, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 358, in
on_sanity_check_end
assert self.val_sanity_progress_bar_id is not None
AssertionError
At this moment the train works with only one worker, otherwise you get the above error.
Any suggestion, how to solve this?
In your paper, you mention that the model is pre-trained on 80 V100s. How long was the model trained for? I'm working on a review paper and would like an estimate of training time.
Hi. Thanks for the great work!
I can't quickly make it out at a glance.
Does the model support fine tuning regional forecasts at a higher resoluton than the model was pretrained on?
I am looking forward to the pre-train dataset and script release.
Is there any update?
Hi,
From my understanding, the hrs_each_step variable in the global_forcast_climax.yaml file is tracking how many hours are in between input samples. ERA5 for example uses hourly data, so the default is to fine tune our data using all of the available data, so hrs_each_step is 1. However, if you change hrs_each_step to any other value, you still will always have the same number on input examples as if it was 1, and the results become worse the biggest hrs_each_step gets.
I have finetuned models for different lead times (for example, 48 hours) using a hrs_each_step = 1, but if I want to generate predictions using the test set, my understanding is that when hrs_each_step is 1, there should hourly examples generated, but if hours_each_step is 24, then I would be doing daily predictions at a fixed time of the day. This is not how this currently works.
Can you explain exactly what the hrs_each_step variable is supposed to do, if not be the number of hours in between each input sample?
hi,
Thanks for the repo. Is there a way to train the model without pre-training on ERA5? As when I try the command
python -u src/climax/global_forecast/train.py --config configs/global_forecast_climax.yaml --trainer.strategy=ddp --trainer.devices=1 --trainer.max_epochs=50 --data.root_dir=process_data --data.predict_range=24 --data.out_variables=['z_500','t_850','t2m','u10'] --data.batch_size=16 --model.pretrained_path="" --model.lr=5e-7 --model.beta_1="0.9" --model.beta_2="0.99" --model.weight_decay=1e-5
It automatically starts with pre-training. Is there a flag somewhere to disable the pre-training and train on ERA5?
Regards,
Yogesh
Hello, thank you for this package.
However I have trouble installing it :
Creating the conda environment following :
conda env create --file docker/environment.yml
will return 'Looking for incompatible packages' and after some time will abort and return the multiple conflicts.
Hi there,
I would like to ask if the pretrained weight at https://climaxrelease.blob.core.windows.net/checkpoints/ClimaX-5.625deg.ckpt had been opened for public access. If not, could you kindly let me know how to get it?
Many thanks.
Hi guys,
I'm trying to download your pre-trained 5.625deg datasets from dataserv.ub.tum.de, I tried for a whole day, but this website seems to respond slowly while I'm in a good network condition, and I can't download.
Could you place that dataset on another website like onedrive or googledrive?
Or give more instructions on how to pre-process the original data into your format?
Thank you so much if you can help.
Hi guys,
Thanks for the code and examples, much appreciated.
I know that you are actively working on more examples.
Can you add some examples of Multi-Node, Multi-GPU training?
That would be much helpful.
Best,
Maruf
I changed the relevant save path inside the config file, and I did get the corresponding log file, but I can't open it with tensorboard, and it reports an error: log file could not be found in this directory. Is there a problem with the file itself? How can I fix it?
Hello
Thank you for great work. My query is regarding the calculation of boundaries. In regional forecasting task, there are boundaries given by range of lat/long. Lets say, in north america
'NorthAmerica': { # 8x14
'lat_range': (15, 65),
'lon_range': (220, 300)
},
and global is obvious
'Global': { # 32, 64
'lat_range': (-90, 90),
'lon_range': (0, 360)
}
Given the file https://cordex.org/wp-content/uploads/2012/11/CORDEX-domain-description_231015.pdf
with the rectangular legend, calculations are not what mentioned in the code? Can you please breakdown your calculations of ranges, for example north america, so the steps can be mapped to other regions.
Thanks @rejuvyesh @tung-nd
Hello,
I've gone through the ClimaX documentation and understand the data preparation and training process. However, I'm unsure how to use the trained model for nowcasting or short-term weather prediction. Is this possible? If so, could you guide me on how to do this?
Thanks.
Hi there,
I am trying to run the pre-training code. However, for the following bug and pre-training can not be done on multiple nodes:
a) In line 134 of the pre-train config file ('https://github.com/microsoft/ClimaX/blob/main/configs/pretrain_climax.yaml'), only one dataset key is defined, which is 'mpi-esm'.
b) On the other hand, line 177 of the datamodule.py ('https://github.com/microsoft/ClimaX/blob/main/src/climax/pretrain/datamodule.py') asserts that the number of data dictionary key has to match the number of nodes (assert num_nodes == len(self.dict_data_train.keys())
)
c) Putting, a) and b) together, since you have defined only one dataset key, that means that the pre-training can run on a single node only.
So, do you have a separate configuration with multiple datasets?
Or, did you use only a single node with eight GPUs for training (like, a A100 dgx node only)?
Please, can you clarify?
I look forward to hearing from you.
Best,
Maruf
Are there any commands to install the environment that you can refer to?
Hi there,
Thanks for sharing the great work!
I'm following the https://microsoft.github.io/ClimaX/usage/ to create a pre-training database, and I'm having some problems.
After I run
`python src/climax/pretrain/train.py --config configs/pretrain_climax.yaml \
--trainer.strategy=ddp --trainer.devices=8 \
--trainer.max_epochs=100 \
--data.batch_size=16 \
--model.lr=5e-4 --model.beta_1="0.9" --model.beta_2="0.95" \
--model.weight_decay=1e-5
`
I encountered two problems
3.1: The input from the neural network alerted me to an image size mismatch, I adjusted the img_size: [32, 64] to img_size: [128, 256] in the yaml file based on this error, the code works, but I don't find any setting that determines the image size in any of the previous operations. How can Iset the image size?
3.2 OOM, even though I have set the batch size to 1, my server still tells me OOM (4*40G, V100), I guess that some variables have been in memory during the pre-training process but not sure how can I deal with that.
Thanks for your time!
If use Docker to build the image as introductions,the name should obey dns rules,and so the name of image must be lowercase?
After I changed the name of image to upper case,the process can not completed with enviromental errors.
Hi, thank you for releasing the code for your paper, it's been a great help for our own research. When finetuning the pretrained models using your code we noticed that the lr scheduler was behaving unexpectedly, and after some search we believe the reason for that lies in this code snippet:
ClimaX/src/climax/global_forecast/module.py
Lines 223 to 230 in efd6de4
It seems to me that the interval should be set as "epoch" and not "step" in line 230, as the milestones for the scheduler are also given in epochs.
Hi guys,
I noticed you did a few commits today and yesterday.
You added a new folder and modified several other files in commit no. 3 (#3).
I downloaded the code a week ago and was doing some work. Now, I do not know the impact of the changed code and whether I need to switch to the latest commit.
Please, can you add a release note listing the reasons for code changes, like bugs fixed or new features added? Such information will be much helpful for debugging.
Best,
Maruf
In attempting the Downscaling task, following the publicly available code on GitHub did not yield the reported performance in the paper. Specifically, the Root Mean Squared Error (RMSE) for T2m was 6.08, whereas the paper reports 2.79. I am uncertain if there are key points I should be mindful of to address this discrepancy.
I noticed some discrepancies between the descriptions in the paper and the provided code, such as the setting of the learning rate. Despite trying various combinations, I have been unable to obtain the correct results. I would appreciate your advice and guidance on this matter.
I would like to inquire about the choice of the pre-training model—should I select the 1.40625-degree model? I have encountered some confusion during my attempts, and I am seeking your professional opinion on this matter.
Hello? I have a question regarding the use of the ClimaX code.
I have collected data separately, and it has already been cropped for a specific region. I want to use this data for training with the GlobalForecast code.
I understand that the RegionForecast mentioned in the paper was trained by specifying regions for the entire dataset. Are there any specific considerations to keep in mind when using the GlobalForecast code without using the RegionForecast code?
For example, should I adjust the 'lat' input when calculating the loss, and so on?
Thank you.
Thank you for contributing to the open source.
I am using ClimaX to fine-tune on different domain data.
I have observed that the training is terminating prematurely.
However, I am unable to identify which part of the code is causing this.
Could you help me with this?
Thank you.
When using ocean variable data for model training, there are many Nan values that need to be processed. What value is used to fill them? 0 or mean?
Hi guys
I learned from the paper that ClimaX is able to be used for downscaling. How to run a downscaling model? I cannot see anything from the source code or the ClimaX website. Can anyone give me some examples?
Hey, I had a query regarding the predict range and hrs_each_step variable.
If I see it correctly, predict range is programmatically implemented to take that many indices assuming that the number of hrs in the data is 1. Am i correct? (what i mean is that, given a predict range of 2, if the index of inputs, is 3 then the index of outputs is 5).
My query is based on the fact that I am finetuning on data that is 3 hourly, and hence if i need to make 72 hour predictions, i need to set predict_range as 24, and hrs_each_step as 3.
Another question: i am unable to figure out the reason on why this is being done in the class Forecast in dataset.py. Why is it required to reduce the input by predict range
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.