Coder Social home page Coder Social logo

Comments (5)

mahm1846 avatar mahm1846 commented on July 18, 2024 1

Hi Tung,

Thanks for your help, I have completed the ERA5 preprocessing. However, I have a question about the CMIP data. Did you use the weatherbench CMIP dataset for pre-training? Or, was it different?

Also, you have snakemake files for data download and processing (https://github.com/tung-nd/climax_all/tree/main/snakemake_configs), which is good. However, there is no instruction to run the code (https://github.com/tung-nd/climax_all#readme). Please, can you add the steps for data preprocessing?

I look forward to your reply.

from climax.

tung-nd avatar tung-nd commented on July 18, 2024

Hi,

Thank you for your interest in ClimaX. The current code requires that the number of pretraining datasets matches the number of nodes due to efficiency reasons (reading too many data files belonging to multiple datasets on the same node will create a bottleneck). To get the results in the paper we did pretrain on multiple datasets using multiple nodes, and the config in this repo only serves as an example and users are free to create their new config files for multi-dataset pretraining. As a reference, you can look at this config file: https://github.com/tung-nd/climax_all/blob/main/configs/train_tokenized_vit_multi_cmip6_continuous.yaml

from climax.

mahm1846 avatar mahm1846 commented on July 18, 2024

Thanks very much for the reply.
We are really interested in your model. We have our ERA5 and CMIP6 replicas on disk and already processed ERA5 data for pre-train. We are hoping to preprocess CMIP6, also. However, the multi-node train issue is not solved, yet.

If two nodes are used, their ranks would be 0 and 1, respectively.
Now, lines 179-182 of datamodule.py ('https://github.com/microsoft/ClimaX/blob/main/src/climax/pretrain/datamodule.py') loads only the key index that matches the node rank:

            for idx, k in enumerate(self.dict_data_train.keys()):
                if idx == node_rank:
                    data_train = self.dict_data_train[k]
                    break

So, the entire dataset will be loaded in the first node, and the second node will be empty. Thus, it is a "one dataset, one node" rule.

Even if one uses multiple datasets, one dataset will use only one node, the rest of the nodes will be idle.
If we want to distribute each dataset to the rest of the cluster, how to solve this ?

from climax.

tung-nd avatar tung-nd commented on July 18, 2024

Yes, right now it is "one dataset, one node" rule. A hack we used to have # nodes > # datasets was to consider each dataset as multiple sub-datasets.

For example, if you want # nodes = 2 x # datasets, you can divide (in the datamodule so we do not have to manually preprocess the data) each dataset into 2 non-overlapping subsets.

You can see how we do this in our old codebase at https://github.com/tung-nd/climax_all. You change the config file https://github.com/tung-nd/climax_all/blob/main/configs/train_tokenized_vit_multi_cmip6_continuous.yaml to something like

dict_start_idx: {
      'mpi-esm-1': 0,
      'mpi-esm-2': 0.5,
  }
  dict_end_idx: {
      'mpi-esm-1': 0.5,
      'mpi-esm-2': 1.0,
  }

And what this means to the datamodule can be seen here https://github.com/tung-nd/climax_all/blob/main/src/datamodules/pretrain_multi_source_module.py

from climax.

tung-nd avatar tung-nd commented on July 18, 2024

@mahm1846 did it resolve your issues?

from climax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.