microsoft / climax Goto Github PK

Foundation model for weather & climate

Home Page: https://microsoft.github.io/ClimaX/

License: MIT License

Dockerfile 0.70% Python 99.30%

climax's Issues

Would it be possible to kindly share the downscaling data?

I apologize for this request, but the raw data is extremely large and inconvenient to download and organize. Would it be possible to share the downscaled dataset after processing?

Questions regarding pre-training

Thank you for the open-source code and detailed documentation of the experiments. I had a few questions about pre-training, which I can't seem to find in the paper or appendix. Could you please help?

How long did pre-training take on the 80 V100 GPUs?

The learning rate schedule in the appendix suggests that pre-training has a total of 200,000 steps and fine-tuning has 100,000 steps, but I'm not sure (for one, I don't think the number of fine-tuning steps would be the same for all downstream tasks)

What is the temporal resolution of the initial states of ERA5 used in training and evaluation?
Are there any results on ERA5 performance without pre-training (at all)?

Thanks.

Training fails for any `--data.num_workers` value greater than 1

I was running the global training on 4 GPUs and 48 CPUs.
The num of workers parameter (--data.num_workers) works for the value 1.
However, the training failed when I tried increasing the num of workers for efficient data loading.
I tried values, 48, 24, 12, 8, but all failed with the following error:

Traceback (most recent call last):
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in 
launch
    return function(*args, **kwargs)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
    results = self._run_stage()
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
    self._run_train()
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1150, in _run_train
    self._run_sanity_check()
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in _run_sanity_check
    self._call_callback_hooks("on_sanity_check_end")
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1340, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 358, in 
on_sanity_check_end
    assert self.val_sanity_progress_bar_id is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/g/data/wb00/admin/staging/ClimaX/src/climax/global_forecast/train.py", line 41, in <module>
    main()
  File "/g/data/wb00/admin/staging/ClimaX/src/climax/global_forecast/train.py", line 34, in main
    cli.trainer.fit(cli.model, datamodule=cli.datamodule)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
    call._call_and_handle_interrupt(
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 59, in _call_and_handle_interrupt
    trainer.strategy.reconciliate_processes(traceback.format_exc())
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 461, in reconciliate_processes
    raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0 
 Traceback (most recent call last):
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in 
launch
    return function(*args, **kwargs)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
    results = self._run_stage()
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
    self._run_train()
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1150, in _run_train
    self._run_sanity_check()
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in _run_sanity_check
    self._call_callback_hooks("on_sanity_check_end")
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1340, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "/dir/proj/terry/env/climax/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/rich_progress.py", line 358, in 
on_sanity_check_end
    assert self.val_sanity_progress_bar_id is not None
AssertionError

At this moment the train works with only one worker, otherwise you get the above error.
Any suggestion, how to solve this?

Required training time

In your paper, you mention that the model is pre-trained on 80 V100s. How long was the model trained for? I'm working on a review paper and would like an estimate of training time.

Fine tuning regional forecasts at a higher resolution

Hi. Thanks for the great work!

I can't quickly make it out at a glance.
Does the model support fine tuning regional forecasts at a higher resoluton than the model was pretrained on?

How to download the IFS data?

Hi,
I am trying to download the IFS data from the TIGGE archive, as mentioned in the paper. However, I am not able to connect to the servers listed in Bougeault et al., 2010, such as the CMA, ECWMF and TIGGE. Could you please share your ways to download the data?

Thanks a lot! :)

Pretrain dataset release

I am looking forward to the pre-train dataset and script release.
Is there any update?

What is the point of the hrs_each_step variable?

Hi,

From my understanding, the hrs_each_step variable in the global_forcast_climax.yaml file is tracking how many hours are in between input samples. ERA5 for example uses hourly data, so the default is to fine tune our data using all of the available data, so hrs_each_step is 1. However, if you change hrs_each_step to any other value, you still will always have the same number on input examples as if it was 1, and the results become worse the biggest hrs_each_step gets.

I have finetuned models for different lead times (for example, 48 hours) using a hrs_each_step = 1, but if I want to generate predictions using the test set, my understanding is that when hrs_each_step is 1, there should hourly examples generated, but if hours_each_step is 24, then I would be doing daily predictions at a fixed time of the day. This is not how this currently works.

Can you explain exactly what the hrs_each_step variable is supposed to do, if not be the number of hours in between each input sample?

Training ClimaX without pre-training

hi,

Thanks for the repo. Is there a way to train the model without pre-training on ERA5? As when I try the command

python -u src/climax/global_forecast/train.py --config configs/global_forecast_climax.yaml --trainer.strategy=ddp --trainer.devices=1 --trainer.max_epochs=50 --data.root_dir=process_data --data.predict_range=24 --data.out_variables=['z_500','t_850','t2m','u10'] --data.batch_size=16 --model.pretrained_path="" --model.lr=5e-7 --model.beta_1="0.9" --model.beta_2="0.99" --model.weight_decay=1e-5

It automatically starts with pre-training. Is there a flag somewhere to disable the pre-training and train on ERA5?

Regards,
Yogesh

Conflicts on conda installation

Hello, thank you for this package.
However I have trouble installing it :
Creating the conda environment following :
conda env create --file docker/environment.yml
will return 'Looking for incompatible packages' and after some time will abort and return the multiple conflicts.

Cannot access pretrained weight

Hi there,

I would like to ask if the pretrained weight at https://climaxrelease.blob.core.windows.net/checkpoints/ClimaX-5.625deg.ckpt had been opened for public access. If not, could you kindly let me know how to get it?

Many thanks.

Preprocessed Datasets failed to download

Hi guys,

I'm trying to download your pre-trained 5.625deg datasets from dataserv.ub.tum.de, I tried for a whole day, but this website seems to respond slowly while I'm in a good network condition, and I can't download.

Could you place that dataset on another website like onedrive or googledrive?

Or give more instructions on how to pre-process the original data into your format?

Thank you so much if you can help.

Multi-Node GPU train Example.

Hi guys,

Thanks for the code and examples, much appreciated.

I know that you are actively working on more examples.
Can you add some examples of Multi-Node, Multi-GPU training?
That would be much helpful.

Best,
Maruf

How to view log files in tensorboard format？

I changed the relevant save path inside the config file, and I did get the corresponding log file, but I can't open it with tensorboard, and it reports an error: log file could not be found in this directory. Is there a problem with the file itself? How can I fix it?

Regional forecasting lat long boundaries

Hello
Thank you for great work. My query is regarding the calculation of boundaries. In regional forecasting task, there are boundaries given by range of lat/long. Lets say, in north america

'NorthAmerica': { # 8x14
'lat_range': (15, 65),
'lon_range': (220, 300)
},

and global is obvious

'Global': { # 32, 64
'lat_range': (-90, 90),
'lon_range': (0, 360)
}

Given the file https://cordex.org/wp-content/uploads/2012/11/CORDEX-domain-description_231015.pdf
with the rectangular legend, calculations are not what mentioned in the code? Can you please breakdown your calculations of ranges, for example north america, so the steps can be mapped to other regions.
Thanks @rejuvyesh @tung-nd

How to use trained ClimaX model for predictions?

Hello,
I've gone through the ClimaX documentation and understand the data preparation and training process. However, I'm unsure how to use the trained model for nowcasting or short-term weather prediction. Is this possible? If so, could you guide me on how to do this?
Thanks.

The pre-train code has a bug with the number of nodes

Hi there,

I am trying to run the pre-training code. However, for the following bug and pre-training can not be done on multiple nodes:

a) In line 134 of the pre-train config file ('https://github.com/microsoft/ClimaX/blob/main/configs/pretrain_climax.yaml'), only one dataset key is defined, which is 'mpi-esm'.

b) On the other hand, line 177 of the datamodule.py ('https://github.com/microsoft/ClimaX/blob/main/src/climax/pretrain/datamodule.py') asserts that the number of data dictionary key has to match the number of nodes (assert num_nodes == len(self.dict_data_train.keys()))

c) Putting, a) and b) together, since you have defined only one dataset key, that means that the pre-training can run on a single node only.

So, do you have a separate configuration with multiple datasets?
Or, did you use only a single node with eight GPUs for training (like, a A100 dgx node only)?
Please, can you clarify?

I look forward to hearing from you.

Best,
Maruf

How to install the running environment

Are there any commands to install the environment that you can refer to?

Pretrain dataset prepare and Out-of-memory problem

Hi there,

Thanks for sharing the great work!

I'm following the https://microsoft.github.io/ClimaX/usage/ to create a pre-training database, and I'm having some problems.

using CMIP6 as an example, how much space do I need to store all the files, e.g. "2m_temperature", "10m_u_component_of_wind", # "10m_v_component_of_wind", etc.?
Is there a way I can reuse the method ClimateLearn provides for downloading the dataset (https://climatelearn.readthedocs.io/en/latest/user-guide/tasks_and_datasets.html#weatherbench-cmip6-download)
I ran the snakemake file and only got 2m_temperature (given the network connection issues and hard disk capacity), then I changed the dict_in_variables: in the config file pretrain_climax.yaml to something like this:

SmtL weiSe ereminLSer eeaokminL secae geekemaial ee bgmal

After I run

`python src/climax/pretrain/train.py --config configs/pretrain_climax.yaml \

--trainer.strategy=ddp --trainer.devices=8 \

--trainer.max_epochs=100 \

--data.batch_size=16 \

--model.lr=5e-4 --model.beta_1="0.9" --model.beta_2="0.95" \

--model.weight_decay=1e-5

I encountered two problems
3.1: The input from the neural network alerted me to an image size mismatch, I adjusted the img_size: [32, 64] to img_size: [128, 256] in the yaml file based on this error, the code works, but I don't find any setting that determines the image size in any of the previous operations. How can Iset the image size?

3.2 OOM, even though I have set the batch size to 1, my server still tells me OOM (4*40G, V100), I guess that some variables have been in memory during the pre-training process but not sure how can I deal with that.

Thanks for your time!

If use Docker to build the image as introductions,the name should obey dns rules,and so the name of image must be lowercase?

If use Docker to build the image as introductions,the name should obey dns rules,and so the name of image must be lowercase?
After I changed the name of image to upper case,the process can not completed with enviromental errors.

Possible bug in lr scheduler

Hi, thank you for releasing the code for your paper, it's been a great help for our own research. When finetuning the pretrained models using your code we noticed that the lr scheduler was behaving unexpectedly, and after some search we believe the reason for that lies in this code snippet:

ClimaX/src/climax/global_forecast/module.py

Lines 223 to 230 in efd6de4

    
           lr_scheduler = LinearWarmupCosineAnnealingLR( 
        
               optimizer, 
        
               self.hparams.warmup_epochs, 
        
               self.hparams.max_epochs, 
        
               self.hparams.warmup_start_lr, 
        
               self.hparams.eta_min, 
        
           ) 
        
           scheduler = {"scheduler": lr_scheduler, "interval": "step", "frequency": 1}

It seems to me that the interval should be set as "epoch" and not "step" in line 230, as the milestones for the scheduler are also given in epochs.

Add update release notes.

Hi guys,

I noticed you did a few commits today and yesterday.
You added a new folder and modified several other files in commit no. 3 (#3).
I downloaded the code a week ago and was doing some work. Now, I do not know the impact of the changed code and whether I need to switch to the latest commit.

Please, can you add a release note listing the reasons for code changes, like bugs fixed or new features added? Such information will be much helpful for debugging.

Best,
Maruf

The replication issues with the downscaling task.

In attempting the Downscaling task, following the publicly available code on GitHub did not yield the reported performance in the paper. Specifically, the Root Mean Squared Error (RMSE) for T2m was 6.08, whereas the paper reports 2.79. I am uncertain if there are key points I should be mindful of to address this discrepancy.
I noticed some discrepancies between the descriptions in the paper and the provided code, such as the setting of the learning rate. Despite trying various combinations, I have been unable to obtain the correct results. I would appreciate your advice and guidance on this matter.
I would like to inquire about the choice of the pre-training model—should I select the 1.40625-degree model? I have encountered some confusion during my attempts, and I am seeking your professional opinion on this matter.

Question about Using GlobalForecast Code with Pre-cropped Data

Hello? I have a question regarding the use of the ClimaX code.

I have collected data separately, and it has already been cropped for a specific region. I want to use this data for training with the GlobalForecast code.

I understand that the RegionForecast mentioned in the paper was trained by specifying regions for the entire dataset. Are there any specific considerations to keep in mind when using the GlobalForecast code without using the RegionForecast code?

For example, should I adjust the 'lat' input when calculating the loss, and so on?

Thank you.

How can I check for early stopping conditions in this code?

Thank you for contributing to the open source.
I am using ClimaX to fine-tune on different domain data.
I have observed that the training is terminating prematurely.
However, I am unable to identify which part of the code is causing this.
Could you help me with this?

Thank you.

How to handle Nan values in training data?

When using ocean variable data for model training, there are many Nan values that need to be processed. What value is used to fill them? 0 or mean?

How to train a downscaling model?

Hi guys
I learned from the paper that ClimaX is able to be used for downscaling. How to run a downscaling model? I cannot see anything from the source code or the ClimaX website. Can anyone give me some examples?

Predict Range and hrs_each_step

Hey, I had a query regarding the predict range and hrs_each_step variable.

If I see it correctly, predict range is programmatically implemented to take that many indices assuming that the number of hrs in the data is 1. Am i correct? (what i mean is that, given a predict range of 2, if the index of inputs, is 3 then the index of outputs is 5).

My query is based on the fact that I am finetuning on data that is 3 hourly, and hence if i need to make 72 hour predictions, i need to set predict_range as 24, and hrs_each_step as 3.

Another question: i am unable to figure out the reason on why this is being done in the class Forecast in dataset.py. Why is it required to reduce the input by predict range

	lr_scheduler = LinearWarmupCosineAnnealingLR(
	optimizer,
	self.hparams.warmup_epochs,
	self.hparams.max_epochs,
	self.hparams.warmup_start_lr,
	self.hparams.eta_min,
	)
	scheduler = {"scheduler": lr_scheduler, "interval": "step", "frequency": 1}

microsoft / climax Goto Github PK

climax's Issues

Recommend Projects

Recommend Topics

Recommend Org