Comments (13)
In the presentation (noted above), E3SM atmospheric data contributed to CMIP6 had a compression ratio of 1.8, so around half the lossless uncompressed size (netcdf deflate) vs vanilla writes.
To gauge the impact of using lossy compression on data usability, a couple of example cases would be great to test:
- the impact of lossy compression (BitRound, BitGroom, GranularBR) when multiple layered operations occur (e.g. ocean data is regridded, then multiple separate fields (thetao, so, ..) are used to calculate a derived quantity - e.g. ocean heat content - so very preliminary discussion is contained in Griffies et al., 2016 Appendix A
- the impact of lossy compression on radiation calculations e.g. degraded cloud area fraction being able to recover correct radiation properties (e.g. @martinjuckes)
- the impact of unit changes. In CMIP5 ocean potential temperature (thetao) was stored/requested in units of K, whereas in CMIP6 we mapped to degC. I recall that this had an impact on deflation (one was far better than another), but do not recall the specific, or which one was better - maybe degC gave us better compression, so an improvement in CMIP6?
- also consider cases where 70% of the grid is masked/missing (land variables), or 30% (ocean variables)
- impacts on read/write performance/compute vs storage benefits/reduction
- impacts on the calculation of long-term (centennial) drift in ocean properties, which need to be accounted for in piControl simulations to get at forced vs unforced variable responses
ping @juliettelavoie @geo-rao - note the 1-n suggestions above are unrelated to CMOR development, but figured it would be useful to co-locate this information so that discussions outside this repo can start to familiarize themselves with some of the dev discussions
from cmor.
I think several criteria will need to be considered before we specify an appropriate truncation of the mantissa of numbers. Here are a few:
- The precision of any observations of the field.
- The number of digits needed to recover at least 99% (or whatever limit we agree) of the information content (see Klower et al. https://www.nature.com/articles/s43588-021-00156-2 ). We need to be careful, I think, relying on the Klower approach, which was developed to serve the needs of weather prediction. After a quick skimming of that article, it appears to me they consider only the spatial variations of a field to determine its information content. There may be climate applications where the time-evolution of a field is just as important, and it is possible that simply resolving adequately spatial variations might not be sufficient to resolve temporal variations.
- The kinds of differences that will be performed on a field. As Martin pointed out, in diagnosing models, the spatial or temporal differences in a variable may be meaningful even if we don't believe the absolute values to anywhere near that precision. It may be difficult to determine ahead of time all of the differences (and similar types of diagnostic calculations) which will be performed and therefore what precision is needed.
from cmor.
@taylor13 agreed, it will be useful to catch these comments and redirect toward another place so that testing impacts on data usability and access can be undertaken.
This particular issue can be closed when we've ascertained how to expose these new netcdf functions through CMOR, and indeed whether this is available in the CMOR3 or CMOR4 (future) releases. If it is possible, having these available in a soon to be released version would be my preference, if this is relatively little work
from cmor.
It should be easy to expose nc_def_var_quantize and nc_def_var_zstandard in the same way we do with nc_def_var_deflate and cmor_set_deflate.
We might need to add check for the version of NetCDF4 being used to determine if the functions are supported similar to the following code.
Lines 25 to 42 in 047fd2c
from cmor.
Another piece of context here; Baker et al added some data with lossy compression to an ensemble of CESM simulations. Several of the figures here highlight the benefits of and issues created by introducing it
from cmor.
Another piece of context here; Baker et al added some data with lossy compression to an ensemble of CESM simulations. Several of the figures here highlight the benefits of and issues created by introducing it
Nice catch! It looks like depending on what you're calculating (Figures 5, 6, 9, 13) it does matter. Their section 6 "Lessons learned" is certainly worth reading, and notes that relationships between variables matter, focused on their surface energy balance anomaly using 4 separate variables - pointing out that commonly derived variables need to be a primary consideration, as do high frequency/precipitation extremes and other very data sensitive analyses. They also point out how missing/fill values are treated needs to be a consideration. It could be useful to loop around with Allison and Gary (@strandwg) to see if follow on analyses are available
from cmor.
Thanks for the link to Baker. Looks like useful information.
from cmor.
Allison is definitely the person to talk to.
from cmor.
Hey, a consideration to stick with "Lossy" compression and skip the "lossless" compression is that the latter is done in hardware on the storage layer and much more efficient that way, you are saving CPU cycles that way. Just another consideration, and I didn't have a chance to raise that with Charlie yesterday.
from cmor.
the latter is done in hardware on the storage layer and much more efficient that way, you are saving CPU cycles that way.
Is this on the GPFS hardware that you're talking about? So this is infrastructure/hardware dependent, right?
from cmor.
typically the blockstorage controller. All the US datacenters have it for sure, its a fairly standard feature been around for a while, but true not a guarantee that every center will enable that. (But I would be surprised if they don't by now)
from cmor.
OTOH, the end-user will end up with a larger file size if downloaded locally, so there would be a trade off (but the downloading tool could then rewrite the file with L3 lossless compression to save space locally; I'd hope by CMIP7 wget is a relic and most use a more robust client that would offer such a feature).
from cmor.
That's news to me. What type of lossless compression do these blockstorage controllers implement? Zstandard? DEFLATE? Or...?
from cmor.
Related Issues (20)
- "call to undeclared function 'calculate_leadtime_coord'" error in recent Xcode/Clang build for OSX HOT 4
- Python 3.12 build
- Renaming default branch to 'main' on October 11, 2023 HOT 1
- CMOR 3.7.3
- CMOR segfaults with mip cmor tables and CMIP6Plus CV.json HOT 14
- Test suite cleanup
- Remove unused attributes when processing CMIP6Plus datasets HOT 14
- Exposing latest netcdf 4.9.x library functionality: quantize, zstandard HOT 43
- unclear warning... HOT 5
- bounds required on singleton lon and lat? HOT 5
- avoid attributes of bounds of auxilliary coordinates (`vertices_latitudes` / `vertices_longitude`) HOT 5
- Calibrating CMOR3 & 4 forward development plans HOT 7
- CMOR 3.8.0 Release HOT 4
- Update README.md to remove v3.7 reference
- default `realm = "REALM"` is always written although not required by CV HOT 2
- order in `required_global_attributes` matters HOT 1
- input time type as INT HOT 3
- CircleCI current image deprecated HOT 1
- Numpy 2.0 compatibility issue HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cmor.