Hi there, I am trying to run the pre-training code. However, for the

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The pre-train code has a bug with the number of nodes about climax HOT 5 CLOSED

mahm1846 commented on July 18, 2024

The pre-train code has a bug with the number of nodes

from climax.

Comments (5)

mahm1846 commented on July 18, 2024 1

Hi Tung,

Thanks for your help, I have completed the ERA5 preprocessing. However, I have a question about the CMIP data. Did you use the weatherbench CMIP dataset for pre-training? Or, was it different?

Also, you have snakemake files for data download and processing (https://github.com/tung-nd/climax_all/tree/main/snakemake_configs), which is good. However, there is no instruction to run the code (https://github.com/tung-nd/climax_all#readme). Please, can you add the steps for data preprocessing?

I look forward to your reply.

from climax.

tung-nd commented on July 18, 2024

Hi,

Thank you for your interest in ClimaX. The current code requires that the number of pretraining datasets matches the number of nodes due to efficiency reasons (reading too many data files belonging to multiple datasets on the same node will create a bottleneck). To get the results in the paper we did pretrain on multiple datasets using multiple nodes, and the config in this repo only serves as an example and users are free to create their new config files for multi-dataset pretraining. As a reference, you can look at this config file: https://github.com/tung-nd/climax_all/blob/main/configs/train_tokenized_vit_multi_cmip6_continuous.yaml

from climax.

mahm1846 commented on July 18, 2024

Thanks very much for the reply.
We are really interested in your model. We have our ERA5 and CMIP6 replicas on disk and already processed ERA5 data for pre-train. We are hoping to preprocess CMIP6, also. However, the multi-node train issue is not solved, yet.

If two nodes are used, their ranks would be 0 and 1, respectively.
Now, lines 179-182 of datamodule.py ('https://github.com/microsoft/ClimaX/blob/main/src/climax/pretrain/datamodule.py') loads only the key index that matches the node rank:

            for idx, k in enumerate(self.dict_data_train.keys()):
                if idx == node_rank:
                    data_train = self.dict_data_train[k]
                    break

So, the entire dataset will be loaded in the first node, and the second node will be empty. Thus, it is a "one dataset, one node" rule.

Even if one uses multiple datasets, one dataset will use only one node, the rest of the nodes will be idle.
If we want to distribute each dataset to the rest of the cluster, how to solve this ?

from climax.

tung-nd commented on July 18, 2024

Yes, right now it is "one dataset, one node" rule. A hack we used to have # nodes > # datasets was to consider each dataset as multiple sub-datasets.

For example, if you want # nodes = 2 x # datasets, you can divide (in the datamodule so we do not have to manually preprocess the data) each dataset into 2 non-overlapping subsets.

You can see how we do this in our old codebase at https://github.com/tung-nd/climax_all. You change the config file https://github.com/tung-nd/climax_all/blob/main/configs/train_tokenized_vit_multi_cmip6_continuous.yaml to something like

dict_start_idx: {
      'mpi-esm-1': 0,
      'mpi-esm-2': 0.5,
  }
  dict_end_idx: {
      'mpi-esm-1': 0.5,
      'mpi-esm-2': 1.0,
  }

And what this means to the datamodule can be seen here https://github.com/tung-nd/climax_all/blob/main/src/datamodules/pretrain_multi_source_module.py

from climax.

tung-nd commented on July 18, 2024

@mahm1846 did it resolve your issues?

from climax.

The pre-train code has a bug with the number of nodes about climax HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent