noaa-gfdl / pace Goto Github PK

Re-write of FV3GFS weather/climate model in Python

License: Apache License 2.0

Makefile 2.52% Dockerfile 0.95% Shell 10.05% Python 86.48%

pace's Introduction

Pace

Pace is an implementation of the FV3GFS / SHiELD atmospheric model developed by NOAA/GFDL using the NDSL middleware in Python, itself based on GT4Py and DaCe. The model can be run on a laptop using Python-based backend or on thousands of heterogeneous compute nodes of a large supercomputer.

🚧 WARNING This repo is under active development - supported features and procedures can change rapidly and without notice. 🚧

The repository model code is split between pyFV3 for the dynamical core and pySHiELD for the physics parametrization. A full depencies looks like the following:

flowchart TD
GT4Py.cartesian --> |Stencil DSL|NDSL
DaCe  --> |Full program opt|NDSL
NDSL --> pyFV3
NDSL --> pySHiELD
pyFV3 --> |Dynamics|Pace
pySHiELD --> |Physics|Pace

Quickstart - bare metal

Build

Pace requires:

GCC > 9.2
MPI
Python 3.8.

For GPU backends CUDA and/or ROCm is required depending on the targeted hardware.

For GT stencils backends, you will also need the headers of the boost libraries in your $PATH. This could be down like this.

cd BOOST/ROOT
wget https://boostorg.jfrog.io/artifactory/main/release/1.79.0/source/boost_1_79_0.tar.gz
tar -xzf boost_1_79_0.tar.gz
mkdir -p boost_1_79_0/include
mv boost_1_79_0/boost boost_1_79_0/include/
export BOOST_ROOT=BOOST/ROOT/boost_1_79_0

When cloning Pace you will need to update the repository's submodules as well:

git clone --recursive https://github.com/NOAA-GFDL/pace.git

or if you have already cloned the repository:

git submodule update --init --recursive

We recommend creating a python venv or conda environment specifically for Pace.

python3 -m venv venv_name
source venv_name/bin/activate

Inside of your pace venv or conda environment pip install the Python requirements, GT4Py, and Pace:

pip3 install -r requirements_dev.txt -c constraints.txt

Shell scripts to install Pace on specific machines such as Gaea can be found in examples/build_scripts/.

Run

With the environment activated, you can run an example baroclinic test case with the following command:

mpirun -n 6 python3 -m pace.run examples/configs/baroclinic_c12.yaml

# or with oversubscribe if you do not have at least 6 cores
mpirun -n 6 --oversubscribe python3 -m pace.run examples/configs/baroclinic_c12.yaml

After the run completes, you will see an output direcotry output.zarr. An example to visualize the output is provided in examples/plot_output.py. See the driver example section for more details.

Environment variable configuration

PACE_CONSTANTS: Pace is bundled with various constants.
- GFDL NOAA's FV3 dynamical core constants (original port)
- GFS Constant as defined in NOAA GFS
- GEOS Constant as defined in GEOS v13
PACE_FLOAT_PRECISION: default precision of the field & scalars in the numerics. Default to 64.
PACE_LOGLEVEL: logging level to display (DEBUG, INFO, WARNING, ERROR, CRITICAL). Default to INFO.

Quickstart - Docker

Build

While it is possible to install and build pace bare-metal, we can ensure all system libraries are installed with the correct versions by using a Docker container to test and develop pace.

First, you will need to update the git submodules so that any dependencies are cloned and at the correct version:

git submodule update --init --recursive

Then build the pace docker image at the top level.

make build

Run

make dev
mpirun --mca btl_vader_single_copy_mechanism none -n 6 python -m pace.run /examples/configs/baroclinic_c12.yaml

History

This repository was first developed at AI2 and the institute conserves an archived copy with the latest state before the NOAA took over.

Running pace in containers

Docker images exist in the Github Container Registry associated with the NOAA-GFDL organization. These images are publicly accessible and can be used to run a Docker container to work with pace. The following are directions on how to setup the pace conda environment interactively in a container.

The latest images can be pulled with the Docker as shown below or with any other container management tools:

docker pull ghcr.io/noaa-gfdl/pace_mpich:3.8

for MPICH installation of MPI; and

docker pull ghcr.io/noaa-gfdl/pace_openmpi:3.8

for OpenMPI installation of MPI.

If permission issues arise during the pull, a Github personal token may be required. The steps to create a personal token is found here

Once the token has been generated, the image can be pulled for example with with:

docker login --username GITHUB_USERNAME --password TOKEN
docker pull ghcr.io/noaa-gfdl/pace_mpich:3.8

Any container management tools compatible with Docker images can be used to run the container interactively from the pulled image. With Docker, the following command runs the container interactively.

docker run -it pace_mpich:3.8

In the container, the default base conda environment is already activated. The pace conda environment can be created by following the steps below:

git clone --recursive -b develop https://github.com/NOAA-GFDL/pace.git pace
cd pace
cp /home/scripts/setup_env.sh . && chmod +x setup_env.sh
source ./setup_env.sh

pace's People

Contributors

Stargazers

Watchers

Forkers

bensonr geos-esm oelbert mahf708 thabbott dspwithaheart fmalatino mlee03 openweatherai xyuan josephmouallem

pace's Issues

RF_fast / Tau inconsistency

The rf_fast = true configuration setting hasn't been implemented, but we also don't support rf_fast = false with a nonzero tau. For consistency's (and completeness') sake we should implement one (ideally both) of these. Support for nonzero tau with rf_fast = False is higher priority.

ak/bk stability check - Xi Chen tool wrapper.

One concern (not related to this commit) is that there is no foolproof way to automatically generate a good set of ak/bk levels, and when creating an entirely novel set of levels some degree of hand-tuning is necessary to avoid stability problems. Xi Chen had developed a tool which can help, and interpolating from predefined level sets is usually pretty good, but some user discretion is needed when making new level sets.

There is no a priori way to ensure stability but typically we can examine the differences in delp and delz in a variety of situations, which is what Xi's tool does. We could run tests to ensure that the differences do not exceed some tolerance. This might be a good way forward; furthermore smoothing can be applied to improve the smoothness and stability of the selected coordinate.

( Per discussion in #36 )

rename 'ks' variable in Pace

The ks variable in Pace and FV3 is supposed to designate the topmost layer in which the bk of the eta pair becomes zero. That layer and above is known as the sponge layer. It makes sense to rename the variable in Pace to be more descriptive as it does not have the same meaning as is and js.

Update Code Review Checklist

Pace PRs still imply you should check your code against the old DSL team's checklist, which is not accessible to anyone outside AI2. If we want to still use this as a guide for Pace code it should be somewhere accessible, maybe in the repo itself? We can (should) also edit it so it's in line with our current standards (and omits the OOP refactor section), or we should just remove that bullet point from the PR text completely if we don't want to use it

pace documentation (https://ai2cm.github.io/pace/)

there is markdown pace documentation https://ai2cm.github.io/pace/, but we are unable to edit, could this page be part of our github repo, so we can add more documents to it?

Remap profile: `kord` restrictions

Right now we have kord > 10 has a guard against bad configuration. This is too loose and should be restricted further

Use of `CachingCommX` in `driver` review

Currently driver has dependencies on CachingCommX methods from ndsl. This needs review such that they can be initialized, but not configured by the driver

Diagnostics configuration (diagnostics_config in yaml) change to common form in GFDL

Probable NaN in geometry calculation on c12

Following this hint:

/home/runner/work/pace/pace/util/pace/util/grid/geometry.py:516: RuntimeWarning: invalid value encountered in divide
    del6_v = sina_u * dy / dxc

The /0 seems to be located in the halos, meaning a zero-diff on the actual numerics.

Update test_eta.py

Is your feature request related to a problem? Please describe.
The current unit tests in main/grid/test_eta.py do not check that the tests failed for the expected reasons.

Describe the solution you'd like
The tests should be improved upon to check the produced error messages in order to verify that the program failed as expected.

Describe alternatives you've considered
Leave as is.

Additional context

Functions with similar names/method consolidated in definition and location

Current functions that fit this category:

compute_eta (defined and called in multiple files)
initialize_delp/_initialize_delp (similar functionality)
initialize_edge_pressure/_initialize_edge_pressure (similar functionality)

I plan to track more of these down and either leave one definition in place or for something like compute_eta, place in a central module that can be imported

Refactor local import in Quantity

Floating point precision lead to introducing a rather ugly pace.dsl.typing import Float locally to Quantity.__init__ to break a circular import

This include so deep in pace.util can lead to circular include. Since Quantity builds should not be in the critical path, this is "fine" for now but requires to be refactored out.
Strategies that seems obvious (but aren't):

Move Float in a separate file, e.g. pace.dsl.typing_float. Problem: circular dependency is on the import of anything under pace.dsl here
Remove boundary import (the culprit of circular). Problem: it's a legit need of Communicator which is also a legit need
upstream
Make the Float a parameter to the __init__ of Quantity. Probably the actual way to fix this, lots of changes might have
heavy side-effect and be bug prone.

Apologies to whoever works on this. If it's future me... I deserved it.

More broadly, there's imports of pace.dsl in pace.util which breaks the dependency cycle of the repository

Remove fv_core.res.nc file in util/tests

After the merging of PR #36, most tests will read in the ak and bk coefficients from netcdf files stored off-site for the CI. The only exception will be test_restart_fortran where the pressure coefficients are read in from /util/tests/data/c12_restart/fv_core.res.nc. For consistency, test_restart_fortran should be modified to also read in the ak/bk values from a netcdf file stored off-site. With this change, the fv_core.res.nc file will no longer be necessary and can be removed from pace, which would make pace only more perfect.

[orch:dace:X] bad reinterpret cast leads to NaN

In AcousticsDynamics.__call__ a couple of dt are being computed. Namely:

dt_acoustics_substep: from the timestep and the n_split
dt2: half of the above

When using Float=np.float32 (or PACE_FLOAT_PRECISION=32) those still remains 64-bit float, even when type hinted to Float. This is a DaCe bug. To go around it we use two callbacks (dt_acoustic_substep, and dt2 on self) which forces the type has a return.

DaCe to fix - Pace to remove the workaround

Backends multi-node distributed compilaton

An enduring issue with the model right now is the incapacity to efficiently build at-scale. Every stencil takes a significant amount of time to build due to the well known under performance of nvcc. This coupled to the fact that the cube sphere means we have up to 9 different code path (following the placement of a rank on any given tile), this leads to build time up into the 3+ hrs.

A solution is to use distributed compilation on multi-node*. Using the new identify code path technique, that guarantees relocability we should be able to compile with 54 ranks and scale up to any layout.

Here's an outline of a solution:

Rank 0 spins a file socket server - acting as a scheduler for everybody else
When hitting FrozenStencil, the rank queries the server for stencil state
- Build: stencil is not built - build it
- Stub: stencil is being built - stub for now come back when execution is needed
- Load: stencil is ready load it
When a stencil needs to be executed, the rank queries the server until given the "Load" call

Why not multithread? Because Python+ GIL = sad developer

Swap eta.py pressure list for config load

grid/eta.py lists the pressure level depending on the number of levels. This is a nasty hack to go around the real solution: load them from config.

Grid Type should be an enum

The grid_type variable indicates what the model's grid geometry is, and is currently an int, in keeping with the Fortran code. The main distinction is if grid_type is 0, 1, or 2 that specifies a cubed-sphere geometry (only 0 is currently supported, the gnomonic equidistant cubed-sphere), and if grid_type is 3 or greater the model is run with an orthogonal doubly-periodic grid (typically set with grid_type=4). Pace doesn't need grid_type to be an integer, and for clarity we should switch to an enum that governs the grid type in the code first with gnomonic cubed sphere and orthogonal members.

Fail to init model on not implemented configurations

The Pace repository does not carry all the capacity of the FV3 model. Therefore we need to strongly fail when the namelist/configuration required is not going to work or work poorly.

A first implementation is to use raise NotImplementedError in the __init__ function of the modules. DO NOT RAISE AT RUNTIME, keep guard at initialization or fear my wrath.

Current non-exhaustive list:

do_sat_adjust
kord is 9 or 10
nwat is 6
non hydrostatic

List to be double checked and completed

Updates to openmpi and mpich unit tests for use of python 3.11

Describe the solution you'd like
Updates to the constraints and python version installed during main unit test with openmpi and mpich to match requirements needed for workflow testing of PR 82 which will update the development branch to using python 3.11.

Unit test: test_restart_serial.py OSError regarding RESTART directory not empty

Upon a run of the test_restart_serial.py unit test in the tests/main/driver directory, on both the main repo and my fork for fv3core reorganization, the test fails due to an OSError:

FAILED tests/main/driver/test_restart_serial.py::test_restart_save_to_disk - OSError: [Errno 39] Directory not empty: 'RESTART'

The RESTART directory does appear empty during investigation although.

Tests should live in the correct repos

One of the last parts of the dsl/framework split is making sure the tests exist in the same repositories as the code they test. Currently the fvcore and grid tests in pace should be in PyFV3 and NDSL

Pace refactor

See GEOS-ESM#41

variables contained within data structures

In the native Fortran code there are many variables stored in the various sub-structures within the ATM datatype. There are separate structures for nesting, inline physics, grid-related quantities, control variables/flags. It makes sense to perform a review of the many variables within the datatypes to ensure we are keeping the bare minimum. This is to ensure we are keeping memory pressure on the various processors we are targeting to a minimum. For example, how often do we use area vs inverse area and does it make sense to keep both.

Updates to metadata in pace setup.py

Describe the solution you'd like
The metadata of the pace package contained in its setup.py should be updated to include updated ownership information, while still retaining AI2 contributions.

Describe alternatives you've considered
Already have made @oelbert the point-of-contact

buildenv submodule needs to be updated

The buildenv submodule is pointing to the ai2cm organization. Need to create a detached fork of buildenv within NOAA-GFDL and update the submodule link to utilize the new location.

pace git clone fails in a container

Cloning pace via git clone --recursive https://github.com/NOAA-GFDL/pace.git fails in a container due to publickey errors.

The .gitmodules file should be specified with https uris instead of ssh urls.

BFB comparison tests for pace

Is your feature request related to a problem? Please describe.
A bit for bit (BFB) comparison test needed to ensure incoming changes do not effect reproduciblity of output during runs.

Describe the solution you'd like
An regression test using cprnc for BFB comparison of output from a run before and after changes to be merged.

FV3DYCORE constants should be GFDL constants

Simple nomenclature update to change the name of the FV3DYCORE constants to be consistent with FMS (GFDL constants)

GT_CACHE_DIR_NAME is not honored

Due to nature of the restrictive compilation for both orchestrated and non-orchestrated backends (e.g. compile of N ranks run on M) generating the path to the cache following GT_CACHE_DIR_NAME will lead to issues.
Mainly we need GT4Py, DaCe and pieces of Pace to all agree, while there is more than one entry to the code base.

Pending a fix coming, the GT_CACHE_ROOT works and the directory is always called .gt_cache_XXXXXX.

GT_CACHE_DIR_NAME issue cannot be solved easily without going back to GT4Py.

OOP-ify Physics Configuration

Once we have a good handle on the requirements of the physics schemes we'll integrate into Pace we should refactor the physics scheme selection code in physics.py and physics_state.py added in PR 44 so that it uses an OO structure and api instead of tons of if-conditions.

Licensing infromation

At some point we need to decide on a license for the code as well as insert license (and disclaimer??) information in each file.

add cffi for external c/fortran function interface?

as we plan to interface fortran code later, would it be a good time to add cffi module to the build system?

Refactor of tracers

Right now the tracers are organized as an hardcoded list of strings, e.g.

tracer_variables = [
    "qvapor",
    "qliquid",
    "qrain",
    "qice",
    "qsnow",
    "qgraupel",
    "qo3mr",
    "qsgs_tke",
    "qcld",
]

The amount of species to work on is known via constants.NQ, which is then used throughout the code (including in loop boundaries).
Later in the code, some configuration is down via kord to reflect strategy of remapping for each Tracers e.g.

kord_tracer = [kord] * self._nq
[...]
MapSingle(
    stencil_factory,
    quantity_factory,
    kord_tracer[i],
    0,
    dims=[X_DIM, Y_DIM, Z_DIM],
)

The list continues, but the point remains: tracers semantic is hardcoded and dispersed in many variables.

We need a semantically strong Tracers object that know to:

load from configuration
name tracers
know how many tracers are advected or/and know how to iterate over tracers
index tracers in state (or even hold the memory, TBD)
carries the configurations related to the Tracers (kord, etc.)

Basically break the organization in two:

out of critical path: make a global object
critical path: keep the Field-based SOA-like system for performance.

noaa-gfdl / pace Goto Github PK

pace's Introduction

Pace

Quickstart - bare metal

Build

Run

Environment variable configuration

Quickstart - Docker

Build

Run

History

Running pace in containers

pace's People

Contributors

Stargazers

Watchers

Forkers

pace's Issues

Recommend Projects

Recommend Topics

Recommend Org