earth-system-radiation / rte-rrtmgp Goto Github PK

View Code? Open in Web Editor NEW

72.0 72.0 64.0 118.33 MB

RTE+RRTMGP is a set of codes for computing radiative fluxes in planetary atmospheres.

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.65% Fortran 98.03% Shell 0.31% Python 1.01%

climate-model radiative-transfer

rte-rrtmgp's People

Contributors

Stargazers

Watchers

Forkers

mrnorman clementval chiil mennoveerman dustinswales skosukhin menzel-gfdl observingclouds peterukk phmarti sarats naromero77 xyuan bbw7561135 brhillman alexeedm antarcticrainforest charleskawczynski e3sm-project jingxianwen leifdenby pkugiser noaa-gfdl anningcheng-noaa william-r-dieter climbfuji melissawm phaustin rscohn2 erikp94 anasena sourceryinstitute esiwace-s1 gfacco geos-esm mathomp4 mkavulich noaa-epic samueltrahannoaa czender pranay-reddy-kommera jussienko zhanglikate annabinoy hafs-community brian-eaton mjiacono m214089 tiangehuang-cedric lsoucasse brendancol ncar makepath makepath sjsprecious saraasara try2code

rte-rrtmgp's Issues

Memory layout in gas optics kernels

An issue received via email:

From: Thomas Jahns <[[email protected]](mailto:[email protected])>
Sent: Monday, March 7, 2022 9:57 AM
To: [[email protected]](mailto:[email protected]) <[[email protected]](mailto:[email protected])>
Cc: Claudia Frauen <[[email protected]](mailto:[email protected])>; Jörg Behrens <[[email protected]](mailto:[email protected])>
Subject: Memory layout in rrtmgp/kernels/mo_gas_optics_kernels.F90

Hi,

I was roped in by some colleagues to look at some issues in the mo_gas_optics_kernels module and noticed two other, independent problems for compiler optimization of the module:

1. all routines are made public and bind(c) which prevents inline for compilers adhering to strict semantics for symbols that can be interpositioned. Would it be possible to change the declaration of interpolate3D_byflav to private and remove the bind(c) to permit Fortran assumed shape arguments.

2. the above would allow to change the fields from e.g.

integer, dimension(2, ncol,nlay,nflav), intent(out) :: jeta
real(wp), dimension(2, ncol,nlay,nflav), intent(out) :: col_mix
real(wp), dimension(2,2,2,ncol,nlay,nflav), intent(out) :: fmajor
real(wp), dimension(2,2, ncol,nlay,nflav), intent(out) :: fminor

to

integer, dimension(ncol,2, nlay,nflav), intent(out) :: jeta
real(wp), dimension(ncol,2, nlay,nflav), intent(out) :: col_mix
real(wp), dimension(ncol,2,2,2,nlay,nflav), intent(out) :: fmajor
real(wp), dimension(ncol,2,2, nlay,nflav), intent(out) :: fminor

and enable the compiler to process the data with multiple packed SIMD-streams instead of emitting instructions for vector gather/scatter operations of data that is essentially in AoS layout.

I think I could submit a corresponding code change if the idea seems possible in general.

Kind regards, Thomas

OpenMP errors when compiling latest release

Hi,

I get errors that I am unable to fix when compiling the latest release (v1.3). I am using CMake for the build process.
OS: Ubuntu 20.04
CMakeLists.txt

error output: output.txt

Any idea what that is?

Error in dimension size (documentation or code)

In the comments, lev_source_inc, lev_source_dec are claimed to have a second dimension of size nlay+1, but when allocated, the size is only nlay.

RFMIP SW standalone test works on GPUs on Summit but not other hosts

At present, using GPU code and running on GPUs produces correct answers for the RFMIP SW standalone test case on the OLCF host Summit but incorrect answers on other hosts. We are seeking a solution.

Request for data files names to use ISO dates

There are data files which do contain the generation date in their name. The older highres once have an ISO conforming datestring include, while the newer once with lower resolution refer to the time Augustus was leading the Roman Empire. Could that be changed to a proper date string?

Best way of calling RTE+RRTMGP from a GCM

Hi,

this is a probably very silly question:
What is the best way of calling RTE+RRTMGP from a GCM?
Say, I have dry air density and temperature in all the grid boxes and I want to get back the spectrally integrated spectral flux density convergences as W/m^3.
Do I have to write out the atmospheric state as a netcdf file, then call RTE+RRTMGP and then get back a netcdf file with the radiation fluxes?
Should I compile the source code of RTE+RRTMGP into the GCM executable and then write an interface? Or should I compile RTE+RRTMGP separately and then use the files librrtmgp.a and librte.a?
I could not find information on this.

Thanks
Max

Removing dtc/ccpp branch

Paging @SamuelTrahanNOAA @dustinswales: My understanding is that the implementation of RTE+RRTMGP has been consolidated in the UFS and that we are free to delete the dtc/ccpp branch. Any comments?

mo_load_coefficients opens files for writing instead of just reading

In examples/mo_load_coefficients, the coefficients files are opened with mode NF90_WRITE. Since these files are not intended to be written to, they should probably be open in mode NF90_NOWRITE.

Large diffs from baseline files on master

When I run the master branch:

[imn@login5:/gpfs/alpine/scratch/imn/stf006/rte-rrtmgp] 8-) git branch
* master
[imn@login5:/gpfs/alpine/scratch/imn/stf006/rte-rrtmgp] 8-) git diff
[imn@login5:/gpfs/alpine/scratch/imn/stf006/rte-rrtmgp] 8-)

with PGI on Summit using -O1, I get the following diffs for a CPU run:

Variable rld differs (max abs difference: 3.814697e-06; max frac. difference: 1.178709e-05%)
Variable rsu differs (max abs difference: 4.685364e-01; max frac. difference: 1.464250e-01%)
Variable rsd differs (max abs difference: 1.940063e+00; max frac. difference: 1.464253e-01%)

I'm thinking maybe the baseline files are stale compared to current master.

Intel Compiler Optimization Problem on AMD EPYC ( Milan )

Problem:
Intel compiler issues when building ICON with RTE-RRTMGP on "Levante", the new HPC system at DKRZ consisting of AMD EPYC Milan processors. The build using the Intel compiler is possible only if almost all optimizations are turned off. Especially the compiler options -m64 -march=core-avx2 do not work at all.

The error occurs in the subroutine interpolation in file rte-rrtmgp/rrtmgp/kernels/mo_gas_optics_kernels.F90 in the following nested loop:


eta = merge(col_gas(icol,ilay,igases(1)) / col_mix(itemp,icol,ilay,iflav), 0.5_wp, &
                        col_mix(itemp,icol,ilay,iflav) > 2._wp * tiny(col_mix))
loceta = eta * float(neta-1)
jeta(itemp,icol,ilay,iflav) = min(int(loceta)+1, neta-1)

The result of the calculations for jeta is sometimes jeta=-2147483647, which leads to a segmentation fault.

Workarounds tried so far :

suppress optimizations on the module mo_gas_optics_kernels.F90 by using !DIR$ NOOPTIMIZE at the beginning of the module. It works but the performance degradation is unacceptable.
replacing the declaration "REAL(wp), DIMENSION(2, ncol, nlay,nflav), INTENT(out) :: col_mix" with the assumed-shape variant "REAL(wp), DIMENSION(:,:,:,:), INTENT(out) :: col_mix". This workaround works much better than the first one, but the performance is not as expected.

Next steps :

trying the gcc compiler
try to reproduce the error with the stand alone tests of RTE-RRTMGP

Bug in rrtmgp/kernels/mo_gas_optics_kernels.F90

Hi,
we found a bug in line 137/138 of rrtmgp/kernels/mo_gas_optics_kernels.F90. The code is:

            col_mix(itemp,icol,ilay,iflav) = col_gas(icol,ilay,igases(1)) + ratio_eta_half * col_gas(icol,ilay,igases(2))
            eta = merge(col_gas(icol,ilay,igases(1)) / col_mix(itemp,icol,ilay,iflav), 0.5_wp, &
                        col_mix(itemp,icol,ilay,iflav) > 2._wp * tiny(col_mix))

following the Fortran standard is merge following the same rules as conditional expressions like if and a compiler is allowed to evaluate the potential results in parallel to the evaluation of the logical expression. This leads to a potential divide by 0 in the second line given above. To solve this the two lines need to be reformulated so that this cannot happen anymore.
We could do this, but first want to know, if you have a better solution in mind how to fix this.
Cheerio,
Luis

Unable to run RTE-RRTMGP GPU code on NVIDIA A100 GPU

I was unable to run the RTE-RRTMGP GPU code on NVIDIA A100 GPU with either nvhpc/24.1 or nvhpc/24.3 compiler.

The error message looks like:

./check_equivalence test_atmospheres.nc /glade/derecho/scratch/sunjian/rte-rrtmgp/rrtmgp-data/rrtmgp-gas-sw-g224.nc
 gas optics is for the shortwave
   pressure    limits (Pa):    1.005183574463000         109663.3158428461    
   temperature limits (K):    160.0000000000000         355.0000000000000    
   Intialized atmosphere twice
   Default calculation
   Vertical orientation invariance
   Changing TSI fails
   TSI invariance
   halving/doubling fails
   Incrementing with 1scl fails
   Incrementing with 2str fails
   Incrementing with nstr fails
   Incrementing
Warning: ieee_invalid is signaling
Warning: ieee_inexact is signaling
ERROR STOP 1

I am using the same compiler flags suggested by https://github.com/earth-system-radiation/rte-rrtmgp/blob/main/.github/workflows/containerized-ci.yml#L40-L42.

combine_and_reorder_nstr permutes dimensions on “p”

The combine_and_reorder_nstr routine appears to incorrectly index the input p array, which is assumed to have dimensions p(ncol,nlay,ngpt,nmom) but is accessed within combine_and_reorder_nstr as p(imom,icol,ilay,igpt).

It also appears that p(1:2,...) are written to, then p(1:nmom,...) are overwritten in a loop. In the event that nmom < 2 explicitly writing to p(2,...) will cause a segfault. This also causes some complications/dangers with the openACC port I think.

More tests to improve coverage

Based on RRTMGP.jl's coverage, the lowest coverage comes from

mo_fluxes_byband_kernels, with the Julia counter part coverage here
mo_optical_props_kernels, with Julia counter part coverage here

It might be nice to add examples that touch more of these kernels.

Minimum temperature limit creates issues for E3SM in aquaplanet mode

I ran into this problem while trying to develop an RCE compset for E3SM. The valid temperature range for RRTMGP goes down to 180K, but for whatever reason when running in aqua planet the upper levels routinely see temperatures down to 150K.

For the time being I've enabled temperature clipping, but I'm not sure what the effect will be. It seems like a good idea to expand the look up tables to support these conditions.

Potential indexing error rte/kernels-openacc/mo_rte_solver_kernels.F90

It looks like there is an indexing issue on this line:

rte-rrtmgp/rte/kernels-openacc/mo_rte_solver_kernels.F90

Line 1016 in 3b5d157

if (tau(icol,ilay,ngpt) > 1.0e-8_wp) then

and maybe these lines too:

rte-rrtmgp/rte/kernels-openacc/mo_rte_solver_kernels.F90

Lines 1018 to 1022 in 3b5d157

    
             lev_source_top = lev_source(icol,ilay  ,ngpt) 
        
             lev_source_bot = lev_source(icol,ilay+1,ngpt) 
        
           else 
        
             lev_source_top = lev_source(icol,ilay+1,ngpt) 
        
             lev_source_bot = lev_source(icol,ilay  ,ngpt)

Shouldn't ngpt be changed to igpt?

RTE shortwave kernel not vectorizing

Hello,

When I tested the current code on an AMD machine I found that sw_dif_and_source is not vectorizing with either ifort 2021.6.0 or GNU Fortran 11.3 compilers.

Intel's optimization report says the following:

   LOOP BEGIN at ../rte/kernels/mo_rte_solver_kernels.F90(1389,7)
      remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
      remark #15346: vector dependence: assumed ANTI dependence between DIR_FLUX_INC(i) (1465:11) and DIR_FLUX_TRANS(i) (1471:11)
   LOOP END

(Ignore line numbers, I've added things - but this is for the version of sw_dif_and_source in the main branch).

Timings confirm sw_dif_and_source is very slow due to this:

                            Called  Recurse     Wall      max      min %_of_clea
clear_sky_total (SW)           1     -       1.946    1.946    1.946   100.00
 gas_optics (SW)            100     -       0.773 8.83e-03 7.65e-03    39.70
 rte_sw                     100     -       1.173    0.017    0.012    60.26
   sw_2stream               100     -       1.170    0.017    0.012    60.12
     sw_dif_and_source    11200     -       1.109 2.79e-03 9.20e-05    56.97
     sw_adding            11200     -       0.043 8.00e-06 3.00e-06     2.19

This can be fixed by adding !$OMP SIMD before the ncol loop:

   LOOP BEGIN at ../rte/kernels/mo_rte_solver_kernels.F90(1389,7)
      remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
      remark #15442: entire loop may be executed in remainder
      remark #15448: unmasked aligned unit stride loads: 1 
      remark #15449: unmasked aligned unit stride stores: 1 
      remark #15450: unmasked unaligned unit stride loads: 6 
      remark #15451: unmasked unaligned unit stride stores: 4 
      remark #15456: masked unaligned unit stride loads: 3 
      remark #15457: masked unaligned unit stride stores: 3 
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 470 
      remark #15477: vector cost: 249.250 
      remark #15478: estimated potential speedup: 1.860 
      remark #15482: vectorized math library calls: 2 
      remark #15486: divides: 3 
      remark #15488: --- end vector cost summary ---
   LOOP END

                            Called  Recurse     Wall      max      min %_of_clea
  clear_sky_total (SW)           1     -       1.362    1.362    1.362   100.00
    gas_optics (SW)            100     -       0.781    0.013 7.64e-03    57.34
    rte_sw                     100     -       0.580 6.01e-03 5.76e-03    42.60
      sw_2stream               100     -       0.577 5.98e-03 5.71e-03    42.38
        sw_dif_and_source    11200     -       0.470 6.30e-05 4.00e-05    34.53
        sw_adding            11200     -       0.043 1.00e-05 3.00e-06     3.19

This is with Intel. With gfortran, there is another serious issue: the mu0 conditional prevents vectorization. Removing it halved the total runtime of the RFMIP SW program (2 -> 1 second, now faster than ifort!). Perhaps the code could be changed so that the mo_rte_sw checks that mu0 is positive in all the input columns, otherwise calls the slower version with mu0 conditional inside kernels? Or you could even write the functionality into the same (but admittedly more bloated) SW kernel with an argument check_for_positive_mu0 (perhaps some host models have already removed nighttime columns before calling RTE-SW, so they can set it to false).

A large small fix (avoid single precision overflow)

Hi,

I had an overflow issue in the minor gas kernel when using single precision (on gfortran, I believe, could have been ifort). I fixed it by simply adding a parenthesis to line 406:

!scaling = scaling *          col_gas(icol,ilay,idx_minor_scaling(imnr)) * vmr_fact * dry_fact
! When using single precision, the scaling variable previously became infinity at this stage for some minor gases, due to
!(very big number)*         (very big number) . Fixed by computing last term first (vmr_fact is very small)
scaling = scaling *          (col_gas(icol,ilay,idx_minor_scaling(imnr)) * vmr_fact * dry_fact)

(Minor optimization) In case you're unaware there is an intrinsic function in Fortran 2008 called findloc which is orders of magnitude faster than the manual search loop in function find_gas in mo_gas_concentrations.f90. This is most likely unimportant when gases are only initialized once but my neural network code uses find_gas within the gas optics module so it became significant. Lines 480-484 to find_gas = findloc(this%gas_name,gas,dim=1), and GAS_NOT_IN_LIST needs to be changed to 0.

Macro name in build system

As per suggestion by @vectorflux in #188:

export RTE_KERNELS="openacc" for the OpenMP-acc mode is unaesthetic. "accelerator" would be better.

A flexible benchmarking program

For timing purposes it would be great to have a flexible benchmarking program. Timing estimates could build on @alexeedm's timing with system_clock in the all-sky example.

We could use fixed concentrations for some greenhouse gases and analytic formulae for temperature, pressure, ozone, and water vapor, for example following the RCEMIP protocol. Then we might want to vary:

The number of columns tested
The number of columns done at a time
The number of levels
Clouds - yes or no?

@alexeedm @Chiil Any additions or comments?

Need a "get" function for solar_src in ty_gas_optics_rrtmgp

I'm in a situation trying to create an RCE compset for E3SM where I need to calculate the tsi_scaling variable to be sent in to rte_sw. This simply requires summing k_dist_sw%solar_src(:), but this is a private data that needs a "get" function to access. The alternative would require modifying the external files, which I'd like to avoid. The simple fix would be to add a "get_solar_src" method to ty_gas_optics_rrtmgp.

Kernel arguments

In branch feature-reduce-in-place, which will be merged into develop once OpenACC/MP issues are resolved, the kernels compute the sum over all g-points "in place" if the fluxes argument to the high-level Fortran driver is of type ty_fluxes_broadband. This is implemented in the kernels using what are essentially C-compatible optional arguments: logical variables and kernel arguments which may not be referenced within the kernel depending on the value of the logical. So, for example, in the call to the LW kernel the arrays broadband_up and _dn (with dimensions ncol, nlay+1) are assigned values if the logical do_broadband is .true., while the arrays flux_up and _dn (with dimensions ncol, nlay+1, ngpt) are assigned if do_broadband == .false.

For the CPU versions of the kernels this has the potential advantage that the flux_ arrays need not be allocated at their full sizes in the driver if do_broadband == .true.; we could rather use a small decoy array in the call to the kernel. This would reduce the considerable memory footprint of the calculation and was one motivation for the change.

On the other hand at least one compiler (NAG, using debugging flags) attempts to fill every array with intent(out) to NaN if its values is not explicitly computed, so not providing fully-sized arrays causes a runtime error. And, on the GPU side, fluxes for each g-point are still computed in parallel, so working memory needs to be allocated even if only the sum of the spectrally-resolved fluxes are used.

@alexeedm @Chiil I'm curious about your opinion - should we always completely allocate all the arrays passed to the kernels even if they won't be referenced? Are there alternatives?

gpt_Jac optional arguement code may be incorrect for OpenMP offload

As discussed in this PR #117 (comment)

This style of OpenMP offload code may not be correct
https://github.com/earth-system-radiation/rte-rrtmgp/blob/main/rte/kernels-openacc/mo_rte_solver_kernels.F90#L203-L213

Even in the case of do_Jacobians=False, the OpenMP offload compiler may map variables which are not available on the CPU and lead to spurious seg. faults or other undefined behaviour. Note that most GPUs only have stack memory, so the variables are going to get allocated on the GPU even if the code is not executed on the GPU. (I think, but not 100% sure)

There are times when seg. faults with the Cray OpenMP offload are encountered in the presence optional argument. It would be much better to examine alternate code structure to avoid these kernels as is proposed in the aforementioned PR.

ESGF, used in CI, is flakey

Various parts of the CI (the tests running automatically for each commit) rely on files from RFMIP hosted on the Earth System Grid. The ESGF nodes are flakey and the CI fails not infrequently because the CI can't retrieve the files.

One example is the LBLRTM results used in tests/validation-plots.py. As these tests are being run in a container, we could consider adding the reference files to the container, trying the various ESFG nodes, and using the files inside the container if all ESGF nodes are broken.

radn_dn is intent(out) in lw_solver_noscat but expected to have values

It looks like radn_dn should be intent(inout) in lw_solver_noscat, because it is assumed to have defined values within the routine (line 110 in mo_rte_solver_kernels.F90 in master). This doesn't throw an error with the PGI compiler, but the intent should probably be fixed for safety/clarity.

Inconsistent RRTMGP results on CPU with different code paths

Hi,

I was running RRTMGP (tag: v1.5) in CAM on Cheyenne (a supercomputer at NCAR, CPU nodes only) and I observed different results when I used different RRTMGP code paths for compilation (the same intel compiler, compset, grid, etc for other settings).

To be specific, case 1 used the following code paths from this repo:

rrtmgp/kernels
rte/kernels

while case 2 used the following code paths from this repo:

rrtmgp/kernels-openacc
rte/kernels-openacc

I wonder whether the difference in the CPU results for these two cases is expected or not. Or is there any compiler flag that I should include in my CAM build?

RFMIP clear sky test failing with forcing index != 1

Generic names in C bindings

Currently, certain subroutines are exposed via C bindings under very generic names, e.g. interpolation. That might trigger a name collision in a user application.

A possible solution is to come up with some common prefix and use it for all public functions of the library, e.g. rrtm_interpolation.

Presence of mo_rng_mklvsl.F90 causes build issues in E3SM

The mo_rng_mklvsl module attempts to include the source file mkl_vsl.f90, which is not included in the rte-rrtmgp source, so any attempt to build the mo_rng_mklvsl module will fail if this file is not sourced ahead of time. Unfortunately, this causes problems with our build environment in E3SM, which scans entire directories for source code files and attempts to build ALL of the files it finds. Thus, the mere presence of mo_rng_mklvsl.F90 in this directory will cause a build to fail for us. My solution until now has been to rename mo_rng_mklvsl.F90 to mo_rng_mklvsl.F90~ to trick the build scripts, but I now want to make the rte-rrtmgp repository a submodule in our repo to allow me to more easily pull in updates to rte-rrtmgp while preserving the git history for rte-rrtmgp in E3SM as well. I would suggest then removing this file from the rte-rrtmgp repo, since it won't build anyway, or at least moving it to a different directory (rng_mklvsl, perhaps?).

Inconsistent dimension name for string length between gas optics NetCDF files

The dimension indicating string length is named string_len in the longwave gas optics coefficients file, but string32 in the shortwave file. This results in some complications in a generic reader. In earlier versions both were named string_len. It would be great if that could be restored.

GNU bug and array temporaries

There's a GNU bug in examples/all-sky/mo_garand_atmos_io.F90 for GNU wherein you cannot pass arrays of string literals that have different lengths. Workaround is to create a temporary: character(len=32) :: strtmp(2), and assign a dim name to each index and then use the temporary.

Also, there are array temporaries being created in the following places:

At line 234 of file ../rrtmgp/kernels/mo_gas_optics_kernels.F90
Fortran runtime warning: An array temporary was created for argument 'gpt_flv' of procedure 'gas_optical_depths_minor'
At line 254 of file ../rrtmgp/kernels/mo_gas_optics_kernels.F90
Fortran runtime warning: An array temporary was created for argument 'gpt_flv' of procedure 'gas_optical_depths_minor'
At line 187 of file ../rte/kernels/mo_rte_solver_kernels.F90
Fortran runtime warning: An array temporary was created for argument 'inc_flux' of procedure 'apply_bc_gpt'
+ ./rrtmgp_clouds rrtmgp-clouds.nc /home/temp/rte-rrtmgp/rrtmgp/data/rrtmgp-data-sw-g224-2018-12-04.nc /home/temp/rte-rrtmgp/extensions/cloud_optics/rrtmgp-cloud-optics-coeffs-sw.nc 3000
At line 234 of file ../rrtmgp/kernels/mo_gas_optics_kernels.F90
Fortran runtime warning: An array temporary was created for argument 'gpt_flv' of procedure 'gas_optical_depths_minor'
At line 254 of file ../rrtmgp/kernels/mo_gas_optics_kernels.F90
Fortran runtime warning: An array temporary was created for argument 'gpt_flv' of procedure 'gas_optical_depths_minor'

These can be worked around by using an array you create yourself, copying the data into it, and passing the new array.

Units in cloud optics extension?

Hi,

What are the units for cloud ice and liquid water paths, is it g/kg? I would like to do all-sky computations using CAMS data. I noticed that the liquid water path I computed (from mixing ratios in kg/kg) had a maximum value of ~0.2 while the all-sky program fills the array with values of 0 or 10. The units should be written in mo_clouds_optics.F90.

Gas optics fail for profiles that do not span both upper and lower atmosphere

Gas optics routines will currently fail when passed any profiles that do not span both the upper and lower atmosphere, as defined by the value of press_ref_trop in the coefficients files. This appears to be due to a bug in the interpolation routine in mo_gas_optics_kernels, where if the threshold between the lower and upper atmosphere is not crossed, the arrays itropo_lower and itropo_upper are left unset (in my case, these were left with values of zero, but this is probably compiler-dependent since they do not appear to be explicitly initialized anywhere). The code in question seems to be the following:

        ! itropo = 1 lower atmosphere; itropo = 2 upper atmosphere
        itropo = merge(1,2,tropo(icol,ilay))
        if (ilay .eq. 1) then
          itropo_last(icol) = itropo
        endif
        if (itropo_last(icol) .ne. itropo) then
          if (itropo .eq. 1) then ! layers go from TOA to surface
            itropo_lower(icol,1) = ilay
            itropo_lower(icol,2) = nlay
            itropo_upper(icol,1) = 1
            itropo_upper(icol,2) = ilay - 1
            itropo_last(icol) = itropo
          else ! layers go from surface to TOA
            itropo_lower(icol,1) = 1
            itropo_lower(icol,2) = ilay - 1
            itropo_upper(icol,1) = ilay
            itropo_upper(icol,2) = nlay
            itropo_last(icol) = itropo
          endif
        endif

This causes problems in gas_optical_depths_minor, where the values of itropo are used to set the loop indices over the vertical dimension of the column gas amounts. If either itropo_lower or itropo_uppter contain zeros, then this code will create a segmentation fault when trying to access the col_gas array out of bounds.

Verify index

This is more of a question than an issue, is the index for ilev in tau(:,ilev,igpt) correct here? Based on when top_at_1 is true, it looks like there might be a +1 missing, but I'm not sure.

Potential for index out of bounds in ice cloud optics?

I have been trying to use the provided cloud optics with rte+rrtmgp, but it looks like there might be a potential issue with the table lookup function compute_all_from_table. In particular, line 650 in mo_cloud_optics.F90 computes an index into the table, but this will result in an index <= 0 if effective radius re is less than offset. The code in question is:

index = min(floor((re(icol,ilay) - offset)/step_size)+1, nsteps-1)

Perhaps this is missing something like a max(index, 1) here? I.e., it seems like this would fix the problem:

index = max(min(floor((re(icol,ilay) - offset)/step_size)+1, nsteps-1), 1)

but perhaps I am misunderstanding. I suppose it could also be assumed that it is the user's responsibility to restrict the cloud mask to only cases with effective radii within the bounds of the table, in which case maybe it is appropriate that this code will error out if it encounters a value of re below the minimum specified in the table. Still, it would probably be better to throw an explicit error message, rather than letting the code segfault.

Failed stage_files.py on Summit

On Summit, we get the following failure with stage_files.py using the following from a fresh conda install:

module load python/3.7.0-anaconda3-5.3.0
conda create -n newenv python=3.7 anaconda
source activate newenv
conda install netcdf4 xarray numpy
python ./stage_files.py

Conda env here: https://gist.github.com/mrnorman/be751789d5357b45453d56c8a4c917db

(newenv) [imn@login2:~/rte-rrtmgp/examples/rfmip-clear-sky] 8-) python ./stage_files.py 
Dowloading RFMIP input files
Dowloading scripts for generating output templates
Traceback (most recent call last):
  File "./stage_files.py", line 34, in <module>
    urllib.request.urlretrieve(templ_scr_url, templ_scr)
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/urllib/request.py", line 1360, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/urllib/request.py", line 1285, in do_open
    h = http_class(host, timeout=req.timeout, **http_conn_args)
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/http/client.py", line 1368, in __init__
    context = ssl._create_default_https_context()
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/ssl.py", line 565, in create_default_context
    context = SSLContext(PROTOCOL_TLS)
  File "/ccs/home/imn/.conda/envs/newenv/lib/python3.7/ssl.py", line 388, in __new__
    self = _SSLContext.__new__(cls, protocol)
ssl.SSLError: [SSL] malloc failure (_ssl.c:2994)
(newenv) [imn@login2:~/rte-rrtmgp/examples/rfmip-clear-sky] >:O

bug in branch icon-integration

There are folders rte-rrtmgp/rte/kernels, rte-rrtmgp/rte/kernels-openacc and
rte-rrtmgp/rrtmgp/kernels, rte-rrtmgp/rrtmgp/kernels-openacc
The problem is that there is no file mo_rrtmgp_util_reorder_kernels.F90
in the rte-rrtmgp/rrtmgp/kernels-openacc folder but it is used by mo_rrtmgp_util_reorder.F90 of rte-rrtmgp/rrtmgp/.

Similarly, mo_fluxes_broadband_kernels.F90 resides in rte/kernels, but not in rte/kernels-openacc, but is used in mo_fluxes.F90 that is compiled in both cases...
PR#50

RRTMGP and Single Precision

TL;DR: Is there an "official" way to run RRTMGP in single-precision?

So this is a bit of a long ramble of a question so bear with me...

In trying to get RRTMGP v1.5 working in GEOS, we found it was crashing with a segfault...and the previous version didn't seem to the last we tested.

Eventually, I was able to figure out that, for some reason, the set of Intel Fortran flags we switched to recently--which enable vectorization model-wide--lead to a segfault in RRTMGP, and we made that switch in between tests. Now, I don't think the actual issue is in RRTMGP since changing the flags back to our old non-vectorized flags lets it work. I've tried to track down which of our flags cause the issue, but no luck so far.

But I got curious and thought, "I wonder if I can run RRTMGP in single-precision?" for some extra speed (since part of my job is code optimization). So I changed mo_rte_kind.F90, recompiled, and boom. Big crash! But, a bit of GitHub searching led me to this commit on the rte-rrtgmp-nn repo of @peterukk. So, I added a k_min, made similar edits, and I can run RRTMGP in single precision! Even weirder, it works with both non-vectorized and vectorized flags (which really makes me think someone else is messing with RRTMGP's memory state.)

And this got me thinking: is there a "correct" way to run single-precision RRTMGP? My naive attempt at just changing wp failed, but with the tweak from @peterukk it did seem to work and the output wasn't a different planet...

Automated code coverage

The repository includes some examples/ and a few tests/ exercising various options in the code. These are cobbled together into a makeshift continuous integration that ensures changes to the code don't change the answers (unless the point of reference is also changed).

It would be terrific to add automated code coverage to the continuous integration so we know how much of the code is being exercised, and so new code is always paired with new tests. We are signed up with codecov.io but I have not been able to figure out how to produce the .xml or .json files needed with the Intel compiler (see branch feature-code-coverage), nor have I been able to figure out how to manage coverage using the gnu toolchain when source code and executables are scattered across directories.

We'd welcome help or a PR that showed us how to generate code coverage reports during continuous integration.

Data external to code repo?

To date data for RRTMGP schemes have been distributed alongside code in the repository, while data for verification and validation including continuous integration are distributed via FTP.

I propose to package the data used by the schemes, along with validation and verification data, via e.g. Zenodo, where each version of the data gets a new (but related) DOI. Something like:

rrtmgp_data/
rrtmgp_data/gas_optics/
rrtmgp_data/cloud_optics

rrtmgp_verification_data/
rrtmgp_verification_data/examples/
rrtmgp_verification_data/examples/rfmip-clear-sky
rrtmgp_verification_data/examples/all-sky

Files could be fetched for continuous integration and/or local use with the shell (e.g. wget https://zenodo.org/record/XXX/files/FILENAME.tgz) and/or with Pooch in Python.

@Chiil @vectorflux @skosukhin Any thoughts?

Testing OpenMP GPU offload with Intel OneAPI compilers

@naromero77 from Intel had some suggestions via email for using the Intel OneAPI compilers to check aspects of the implement of the OpenMP offload.

I recently had some ideas that may help you continue OpenMP offload development and possibly do CI. With the Intel oneAPI compilers, you can do the following:

Test OpenMP offload with the target being the CPU instead of the GPU:
ifx -c -fiopenmp -fopenmp-target=x86_64 <files>.
Code and should compile and run AS IS. However, you cannot detect issues with data transfers.
Test OpenMP offload with the target being the GPU but OpenMP offload DISABLED at runtime
ifx -c -fiopenmp -fopenmp-target=spir64 <files>
You will not to set the runtime environment variable OMP_TARGET_OFFLOAD=DISABLED. Again, you cannot detect issues with data transfers.

The first method should be supported on any Intel CPU. The second method should also be supported on any Intel CPU and would be the preferred method.

Distribute libraries (conda, maybe apt?)

We would like to distribute the RTE and RRTMGP libraries via conda and possibly other channels (e.g. apt-get). Libraries would be updated at each release.

For conda we will need to write a recipe and get it adopted as a feedstock. I think conda uses CMake, which we don't currently support.

We can build separate libraries for the kernels and the Fortran frontend, or libraries that combine both. The Python frontend needs just the kernels - is that what we'd start with?

Small bug in mo_gas_optics_rrtmgp.F90

I think there may be a small indexing error in the subroutine create_flavor() in mo_gas_optics_rrtmgp.F90.
https://github.com/earth-system-radiation/rte-rrtmgp/blob/main/rrtmgp/mo_gas_optics_rrtmgp.F90#L1558
To be consistent with the k-distribution data format, line 1558:
do iatm=1,size(key_species,1)
should be replaced with:
do iatm=1,size(key_species,2)

From the LW k-distribution file, key_species is ordered(pair, atmos_layer, bnd), [2,2,6]. Fortunately it has no impact since pair=atmos_layer.

gfortran's -fcheck=all throws array out of bounds errors

Running the RFMIP test cases with gfortran 6.4.0 with -fcheck=all throws array out of bounds errors. The error I got is below. I can look more into this once I finish the E3SM CRM port, but someone else might be able to get to this in the meantime. The code should probably be run through this flag until it runs clean. I also plan to run through valgrind after it runs through -fcheck=all cleanly.

[imn@login1:~/rte-rrtmgp/branches/master] >:O cat ../../scripts/master/fcheck_lw.out 
At line 161 of file rrtmgp_rfmip_lw.F90
Fortran runtime error: Index '0' of dimension 1 of array 'gases_to_use' below lower bound of 1

mo_compute_bc uses abstract ty_gas_optics directly

It looks like mo_compute_bc is mistakenly using the ty_gas_optics abstract type directly. I think the type(ty_gas_optics) either needs to change to class(ty_gas_optics), or it needs to change to type(ty_gas_optics_rrtmgp). Note that this prevents the code from compiling.

Optimized compiler flags for PGI incorrect

The indicated compiler flags for PGI in Compiler-flags.md are intel-specific (i.e. identical to the intel flags). Using them in combination with nvfortran obviously breaks compilation.

	lev_source_top = lev_source(icol,ilay ,ngpt)
	lev_source_bot = lev_source(icol,ilay+1,ngpt)
	else
	lev_source_top = lev_source(icol,ilay+1,ngpt)
	lev_source_bot = lev_source(icol,ilay ,ngpt)