Comments (5)
Hi Tung,
Thanks for your help, I have completed the ERA5 preprocessing. However, I have a question about the CMIP data. Did you use the weatherbench CMIP dataset for pre-training? Or, was it different?
Also, you have snakemake files for data download and processing (https://github.com/tung-nd/climax_all/tree/main/snakemake_configs), which is good. However, there is no instruction to run the code (https://github.com/tung-nd/climax_all#readme). Please, can you add the steps for data preprocessing?
I look forward to your reply.
from climax.
Hi,
Thank you for your interest in ClimaX. The current code requires that the number of pretraining datasets matches the number of nodes due to efficiency reasons (reading too many data files belonging to multiple datasets on the same node will create a bottleneck). To get the results in the paper we did pretrain on multiple datasets using multiple nodes, and the config in this repo only serves as an example and users are free to create their new config files for multi-dataset pretraining. As a reference, you can look at this config file: https://github.com/tung-nd/climax_all/blob/main/configs/train_tokenized_vit_multi_cmip6_continuous.yaml
from climax.
Thanks very much for the reply.
We are really interested in your model. We have our ERA5 and CMIP6 replicas on disk and already processed ERA5 data for pre-train. We are hoping to preprocess CMIP6, also. However, the multi-node train issue is not solved, yet.
If two nodes are used, their ranks would be 0 and 1, respectively.
Now, lines 179-182 of datamodule.py
('https://github.com/microsoft/ClimaX/blob/main/src/climax/pretrain/datamodule.py') loads only the key index that matches the node rank:
for idx, k in enumerate(self.dict_data_train.keys()):
if idx == node_rank:
data_train = self.dict_data_train[k]
break
So, the entire dataset will be loaded in the first node, and the second node will be empty. Thus, it is a "one dataset, one node" rule.
Even if one uses multiple datasets, one dataset will use only one node, the rest of the nodes will be idle.
If we want to distribute each dataset to the rest of the cluster, how to solve this ?
from climax.
Yes, right now it is "one dataset, one node" rule. A hack we used to have # nodes > # datasets was to consider each dataset as multiple sub-datasets.
For example, if you want # nodes = 2 x # datasets, you can divide (in the datamodule so we do not have to manually preprocess the data) each dataset into 2 non-overlapping subsets.
You can see how we do this in our old codebase at https://github.com/tung-nd/climax_all. You change the config file https://github.com/tung-nd/climax_all/blob/main/configs/train_tokenized_vit_multi_cmip6_continuous.yaml to something like
dict_start_idx: {
'mpi-esm-1': 0,
'mpi-esm-2': 0.5,
}
dict_end_idx: {
'mpi-esm-1': 0.5,
'mpi-esm-2': 1.0,
}
And what this means to the datamodule can be seen here https://github.com/tung-nd/climax_all/blob/main/src/datamodules/pretrain_multi_source_module.py
from climax.
@mahm1846 did it resolve your issues?
from climax.
Related Issues (20)
- Regional forecasting lat long boundaries HOT 4
- Fine tuning regional forecasts at a higher resolution HOT 2
- Cannot access pretrained weight HOT 5
- Possible bug in lr scheduler HOT 4
- Training ClimaX without pre-training HOT 3
- Questions regarding pre-training HOT 5
- Question about Using GlobalForecast Code with Pre-cropped Data HOT 2
- Pretrain dataset prepare and Out-of-memory problem HOT 1
- Would it be possible to kindly share the downscaling data? HOT 3
- How can I check for early stopping conditions in this code? HOT 1
- The replication issues with the downscaling task. HOT 7
- How to download the IFS data? HOT 1
- How to use trained ClimaX model for predictions? HOT 1
- What is the point of the hrs_each_step variable? HOT 1
- Required training time HOT 3
- If use Docker to build the image as introductions,the name should obey dns rules,and so the name of image must be lowercase? HOT 7
- Predict Range and hrs_each_step
- How to handle Nan values in training data? HOT 1
- How to view log files in tensorboard format? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from climax.