Coder Social home page Coder Social logo

Comments (13)

durack1 avatar durack1 commented on July 4, 2024

In the presentation (noted above), E3SM atmospheric data contributed to CMIP6 had a compression ratio of 1.8, so around half the lossless uncompressed size (netcdf deflate) vs vanilla writes.

To gauge the impact of using lossy compression on data usability, a couple of example cases would be great to test:

  1. the impact of lossy compression (BitRound, BitGroom, GranularBR) when multiple layered operations occur (e.g. ocean data is regridded, then multiple separate fields (thetao, so, ..) are used to calculate a derived quantity - e.g. ocean heat content - so very preliminary discussion is contained in Griffies et al., 2016 Appendix A
  2. the impact of lossy compression on radiation calculations e.g. degraded cloud area fraction being able to recover correct radiation properties (e.g. @martinjuckes)
  3. the impact of unit changes. In CMIP5 ocean potential temperature (thetao) was stored/requested in units of K, whereas in CMIP6 we mapped to degC. I recall that this had an impact on deflation (one was far better than another), but do not recall the specific, or which one was better - maybe degC gave us better compression, so an improvement in CMIP6?
  4. also consider cases where 70% of the grid is masked/missing (land variables), or 30% (ocean variables)
  5. impacts on read/write performance/compute vs storage benefits/reduction
  6. impacts on the calculation of long-term (centennial) drift in ocean properties, which need to be accounted for in piControl simulations to get at forced vs unforced variable responses

ping @juliettelavoie @geo-rao - note the 1-n suggestions above are unrelated to CMOR development, but figured it would be useful to co-locate this information so that discussions outside this repo can start to familiarize themselves with some of the dev discussions

from cmor.

taylor13 avatar taylor13 commented on July 4, 2024

I think several criteria will need to be considered before we specify an appropriate truncation of the mantissa of numbers. Here are a few:

  1. The precision of any observations of the field.
  2. The number of digits needed to recover at least 99% (or whatever limit we agree) of the information content (see Klower et al. https://www.nature.com/articles/s43588-021-00156-2 ). We need to be careful, I think, relying on the Klower approach, which was developed to serve the needs of weather prediction. After a quick skimming of that article, it appears to me they consider only the spatial variations of a field to determine its information content. There may be climate applications where the time-evolution of a field is just as important, and it is possible that simply resolving adequately spatial variations might not be sufficient to resolve temporal variations.
  3. The kinds of differences that will be performed on a field. As Martin pointed out, in diagnosing models, the spatial or temporal differences in a variable may be meaningful even if we don't believe the absolute values to anywhere near that precision. It may be difficult to determine ahead of time all of the differences (and similar types of diagnostic calculations) which will be performed and therefore what precision is needed.

from cmor.

durack1 avatar durack1 commented on July 4, 2024

@taylor13 agreed, it will be useful to catch these comments and redirect toward another place so that testing impacts on data usability and access can be undertaken.

This particular issue can be closed when we've ascertained how to expose these new netcdf functions through CMOR, and indeed whether this is available in the CMOR3 or CMOR4 (future) releases. If it is possible, having these available in a soon to be released version would be my preference, if this is relatively little work

from cmor.

mauzey1 avatar mauzey1 commented on July 4, 2024

It should be easy to expose nc_def_var_quantize and nc_def_var_zstandard in the same way we do with nc_def_var_deflate and cmor_set_deflate.

We might need to add check for the version of NetCDF4 being used to determine if the functions are supported similar to the following code.

cmor/Src/cmor.c

Lines 25 to 42 in 047fd2c

/* ==================================================================== */
/* this is defining NETCDF4 variable if we are */
/* using NETCDF3 not used anywhere else */
/* ==================================================================== */
#ifndef NC_NETCDF4
#define NC_NETCDF4 0
#define NC_CLASSIC_MODEL 0
int nc_def_var_deflate(int i, int j, int k, int l, int m)
{
return (0);
};
int nc_def_var_chunking(int i, int j, int k, size_t * l)
{
return (0);
};
#endif

from cmor.

matthew-mizielinski avatar matthew-mizielinski commented on July 4, 2024

Another piece of context here; Baker et al added some data with lossy compression to an ensemble of CESM simulations. Several of the figures here highlight the benefits of and issues created by introducing it

from cmor.

durack1 avatar durack1 commented on July 4, 2024

Another piece of context here; Baker et al added some data with lossy compression to an ensemble of CESM simulations. Several of the figures here highlight the benefits of and issues created by introducing it

Nice catch! It looks like depending on what you're calculating (Figures 5, 6, 9, 13) it does matter. Their section 6 "Lessons learned" is certainly worth reading, and notes that relationships between variables matter, focused on their surface energy balance anomaly using 4 separate variables - pointing out that commonly derived variables need to be a primary consideration, as do high frequency/precipitation extremes and other very data sensitive analyses. They also point out how missing/fill values are treated needs to be a consideration. It could be useful to loop around with Allison and Gary (@strandwg) to see if follow on analyses are available

from cmor.

taylor13 avatar taylor13 commented on July 4, 2024

Thanks for the link to Baker. Looks like useful information.

from cmor.

strandwg avatar strandwg commented on July 4, 2024

Allison is definitely the person to talk to.

from cmor.

sashakames avatar sashakames commented on July 4, 2024

Hey, a consideration to stick with "Lossy" compression and skip the "lossless" compression is that the latter is done in hardware on the storage layer and much more efficient that way, you are saving CPU cycles that way. Just another consideration, and I didn't have a chance to raise that with Charlie yesterday.

from cmor.

durack1 avatar durack1 commented on July 4, 2024

the latter is done in hardware on the storage layer and much more efficient that way, you are saving CPU cycles that way.

Is this on the GPFS hardware that you're talking about? So this is infrastructure/hardware dependent, right?

from cmor.

sashakames avatar sashakames commented on July 4, 2024

typically the blockstorage controller. All the US datacenters have it for sure, its a fairly standard feature been around for a while, but true not a guarantee that every center will enable that. (But I would be surprised if they don't by now)

from cmor.

sashakames avatar sashakames commented on July 4, 2024

OTOH, the end-user will end up with a larger file size if downloaded locally, so there would be a trade off (but the downloading tool could then rewrite the file with L3 lossless compression to save space locally; I'd hope by CMIP7 wget is a relic and most use a more robust client that would offer such a feature).

from cmor.

czender avatar czender commented on July 4, 2024

That's news to me. What type of lossless compression do these blockstorage controllers implement? Zstandard? DEFLATE? Or...?

from cmor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.