ncar / ldcpy Goto Github PK

Statistical and visual tools for gathering metrics and comparing Earth System Model data files. A common use case is comparing data that has been lossily compressed with the original data.

Home Page: https://ldcpy.readthedocs.io/

License: Apache License 2.0

Python 99.23% Makefile 0.51% Jupyter Notebook 0.26%

xarray lossy-data-compression zfp

ldcpy's Introduction

Large Data Comparison for Python

ldcpy is a utility for gathering and plotting metrics from NetCDF or Zarr files using the Pangeo stack. It also contains a number of statistical and visual tools for gathering metrics and comparing Earth System Model data files.

AUTHORS: Alex Pinard, Allison Baker, Anderson Banihirwe, Dorit Hammerling
COPYRIGHT: 2020 University Corporation for Atmospheric Research
LICENSE: Apache 2.0

Documentation and usage examples are available here.

Reference to ldcpy paper

Pinard, D. M. Hammerling, and A. H. Baker. Assessing differences in large spatiotemporal climate datasets with a new Python package. In The 2020 IEEE International Workshop on Big Data Reduction, 2020. doi: 10.1109/BigData50022.2020.9378100.

Link to paper: https://doi.org/10.1109/BigData50022.2020.9378100

Installation using Conda (recommended)

Ensure conda is up to date and create a clean Python (3.6+) environment:

conda update conda
conda create --name ldcpy python=3.8
conda activate ldcpy

Now install ldcpy:

conda install -c conda-forge ldcpy

Alternative Installation

Ensure pip is up to date, and your version of python is at least 3.6:

pip install --upgrade pip
python --version

Install cartopy using the instructions provided at https://scitools.org.uk/cartopy/docs/latest/installing.html.

Then install ldcpy:

pip install ldcpy

Accessing the tutorial

If you want access to the tutorial notebook, clone the repository (this will create a local repository in the current directory):

git clone https://github.com/NCAR/ldcpy.git

Start by enabling Hinterland for code completion and code hinting in Jupyter Notebook and then opening the tutorial notebook:

jupyter nbextension enable hinterland/hinterland
jupyter notebook

The tutorial notebook can be found in docs/source/notebooks/TutorialNotebook.ipynb, feel free to gather your own metrics or create your own plots in this notebook!

Other example notebooks that use the sample data in this repository include PopData.ipynb and MetricsNotebook.ipynb.

The AWSDataNotebook grabs data from AWS, so can be run on a laptop with the caveat that the files are large.

The following notebooks asume that you are using NCAR's JupyterHub (https://jupyterhub.hpc.ucar.edu): LargeDataGladenotebook.ipynb, CAMNotebook,ipynb, and error_bias.ipynb

Re-create notebooks with Pangeo Binder

Try the notebooks hosted in this repo on Pangeo Binder. Note that the session is ephemeral. Your home directory will not persist, so remember to download your notebooks if you make changes that you need to use at a later time!

Note: All example notebooks are in docs/source/notebooks (the easiest ones to use in binder first are TutorialNotebook.ipynb and PopData.ipynb)

ldcpy's People

Contributors

Stargazers

Watchers

Forkers

kmpaul mnlevy1981

ldcpy's Issues

Replace errorMetrics with new DiffMetrics class that takes two OverallMetrics

test Tutorial Plots with large dataset and create Large Dataset notebook with specific information for large files

and debug any issues that arise

replace _o and _r with different suffix

they originally stood for original and reconstructed.

move plot title up to avoid overlap with y label exponent

move title up in plots like this:

lp.plot(ds2, "PRECT", 'zfpA1.0', ens_r='orig', metric_type="diff", metric="mean", plot_type="time_series", subset="first50", lat=44.56, lon=-123.26)

Add start and end indices for subsetting data

Notebook idea: tie in to CMIP6 data

It seems like all the pieces are in place to read in some CMIP6 data, run it through zfp, and then create comparison plots.

create "multiplot" function which repeatedly calls "plot"

Use this to replace spatial comparison and get rid of a lot of conditional logic. Also use to add multiple time series datasets to same plot.

Possibly a duplicate of a couple other issues

Reduce package size

Limit on PyPI is 100MB, data files are currently using 443 MB and the rest of the package is 3.1MB. We probably want to removing datafiles like the cam-se data, and reduce the number of time slices on the rest of the data. Also, rename the data directory sample-data

Add ability to see number of timeslices being averaged over in spatial plots

specify chunk argument (to be a single time slice) to open_dataset for Dask

we need to be careful that we don't load the whole dataset, just one time slice at a time if we are summing across time, for example

replace optional arguments with a kwargs dict

default chunk size

What should this be set to?

ErrorMetrics should accept xarray ds

Currently only takes np arrays

Organize plot.py

To be discussed on Wednesday 6/24

Plot multiple spatial comparisons with one plot call

Similar to plotting multiple time-series ensemble members in one plot, plot one spatial comparison plot for each specified ensemble member with one call to plot()

Add more sample plots to SampleNotebook

Add a basic PRECT comparison plot (mean data)
Add fig. 17 sample to notebook

Add support for ocean datasets

put plot code into a class

will eliminate a lot of parameter passing and redundancy

get package on PyPi

We want three ways to install, PyPi with pip, conda-forge with conda, and dev install with conda and environment-dev.yml file. MAY RUN INTO ISSUE WITH CARTOPY

OrderedDict in error metrics?

It would be nice if print_stats() printed out statistics in the same order every time; I think using an OrderedDict underneath the json will accomplish that.

turn requirements.txt into a conda environment yaml file?

Would be nice to ensure environment consistency among developers

Setup notebook in Jupyter Hub

investigate notebook viewer for Jupyter notebooks

we may want one read-only notebook for website documentation and another for user testing.

util.print_stats still relies on ErrorMetrics - make it use DatasetMetrics, DiffMetrics

get ldcpy package on conda-forge

We want three ways to install, PyPi with pip, conda-forge with conda, and dev install with conda and environment-dev.yml file.

remove "raw" from plot titles

Plots with slider bars

One group in the hackathon had some interactive graphics that allowed users to hover over map and get lat, lon-specific data.

fix spatial comparison plot titles

currently, each plot lists both the c0 and c1 labels, and we want the c0 label be in the title of the plot with c0 data.

Add support for se datasets

This will require reworking some of the metrics functionality, and adding some new plot functionality as well.

Interactive Mapping

Add ability to select point on spatial map and get time-series plot for that point. (See Plots with Slider Bars issue)

Figure out if we want to use contourf or pcolormap in general, add ability to specify # levels.

contouf f can show inf values, pcolormap has better colorbar labeling,

normalize error for plots and metrics

rather than (or in addition to?) absolute RMSE, it would be nice to normalize it based on the field being analyzed. And this could carry over to plot.mean_error()

Add getallmetrics to DiffMetrics, DatasetMetrics

decide if we want ds to contain one single data file or several and rework the code to match

figure out why "ds" output looks different when notebook is cloned and before rerunning the code

Figure out why diff plots are running so slowly

Get a python profiler running to check this

figure out level selection for 3d data - lev should go 1-30, lev_km 0-999

New Metrics

From the technote, we want the ability to plot:

Pooled variance ratio (fig 16)
Error lag-1 correlations (fig 19)
Amplitude of Annual Error Harmonic (fig 18)
Min/Max MAE (currently waiting on xarray version 0.15.2 for idxmax() function)

add ability to compress data

let the user actually read the data and apply compression (zfp) and view data

Determine difficulty of adding Dask support to parallelize handling of large datasets

Determine task size so we know which release Dask support will go to

Jupyter notebook autocomplete

Update ReadMe

update README.md with complete list of steps to do a development install

Plotting multiple sets of data on same time-series plot

ability to specify an array of ensemble members to be plotted in a single time-series plot

CircleCI builds failing occasionally

This test works fine on a local machine and in github workflows, so it is probably not an issue with the code.

Output:

test_subset_lat_lon_ratio_time_series - tests.test_plot.TestPlot
tests/test_plot.py
self = <tests.test_plot.TestPlot testMethod=test_subset_lat_lon_ratio_time_series>

def test_subset_lat_lon_ratio_time_series(self):
    ldcpy.plot(
        ds2,
        'PRECT',
        c0='orig',
        metric='mean',
        c1='recon',
        metric_type='ratio',
        group_by=None,
        subset='first50',
        lat=44.76,
        lon=-93.75,

      plot_type='time_series',

tests/test_plot.py:140:

ldcpy/plot.py:622: in plot
mp.time_series_plot(plot_data_c0, title_c0)
ldcpy/plot.py:355: in time_series_plot
self._label_offset(ax)
ldcpy/plot.py:174: in _label_offset
ax.figure.canvas.draw()
/opt/conda/lib/python3.7/site-packages/matplotlib/backends/backend_agg.py:393: in draw
self.figure.draw(self.renderer)
/opt/conda/lib/python3.7/site-packages/matplotlib/artist.py:38: in draw_wrapper
return draw(artist, renderer, *args, **kwargs)
/opt/conda/lib/python3.7/site-packages/matplotlib/figure.py:1736: in draw
renderer, self, artists, self.suppressComposite)
/opt/conda/lib/python3.7/site-packages/matplotlib/image.py:137: in _draw_list_compositing_images
a.draw(renderer)
/opt/conda/lib/python3.7/site-packages/matplotlib/artist.py:38: in draw_wrapper
return draw(artist, renderer, *args, **kwargs)
/opt/conda/lib/python3.7/site-packages/cartopy/mpl/geoaxes.py:479: in draw
return matplotlib.axes.Axes.draw(self, renderer=renderer, **kwargs)
/opt/conda/lib/python3.7/site-packages/matplotlib/artist.py:38: in draw_wrapper
return draw(artist, renderer, *args, **kwargs)
/opt/conda/lib/python3.7/site-packages/matplotlib/axes/_base.py:2630: in draw
mimage._draw_list_compositing_images(renderer, self, artists)
/opt/conda/lib/python3.7/site-packages/matplotlib/image.py:137: in _draw_list_compositing_images
a.draw(renderer)
/opt/conda/lib/python3.7/site-packages/matplotlib/artist.py:38: in draw_wrapper
return draw(artist, renderer, *args, **kwargs)
/opt/conda/lib/python3.7/site-packages/cartopy/mpl/feature_artist.py:155: in draw
geoms = self._feature.intersecting_geometries(extent)
/opt/conda/lib/python3.7/site-packages/cartopy/feature/init.py:302: in intersecting_geometries
return super(NaturalEarthFeature, self).intersecting_geometries(extent)
/opt/conda/lib/python3.7/site-packages/cartopy/feature/init.py:110: in intersecting_geometries
return (geom for geom in self.geometries() if
/opt/conda/lib/python3.7/site-packages/cartopy/feature/init.py:287: in geometries
geometries = tuple(shapereader.Reader(path).geometries())
/opt/conda/lib/python3.7/site-packages/cartopy/io/shapereader.py:166: in geometries
shape = self._reader.shape(i)
/opt/conda/lib/python3.7/site-packages/shapefile.py:854: in shape
return self.__shape()

self = <shapefile.Reader object at 0x7f994232e690>

def __shape(self):
    """Returns the header info and geometry for a single shape."""
    f = self.__getFileObj(self.shp)
    record = Shape()
    nParts = nPoints = zmin = zmax = mmin = mmax = None
    (recNum, recLength) = unpack(">2i", f.read(8))
    # Determine the start of the next record
    next = f.tell() + (2 * recLength)
    shapeType = unpack("<i", f.read(4))[0]
    record.shapeType = shapeType
    # For Null shapes create an empty points list for consistency
    if shapeType == 0:
        record.points = []
    # All shape types capable of having a bounding box
    elif shapeType in (3,5,8,13,15,18,23,25,28,31):
        record.bbox = _Array('d', unpack("<4d", f.read(32)))
    # Shape types with parts
    if shapeType in (3,5,13,15,23,25,31):
        nParts = unpack("<i", f.read(4))[0]
    # Shape types with points
    if shapeType in (3,5,8,13,15,18,23,25,28,31):
        nPoints = unpack("<i", f.read(4))[0]
    # Read parts
    if nParts:
        record.parts = _Array('i', unpack("<%si" % nParts, f.read(nParts * 4)))
    # Read part types for Multipatch - 31
    if shapeType == 31:
        record.partTypes = _Array('i', unpack("<%si" % nParts, f.read(nParts * 4)))
    # Read points - produces a list of [x,y] values
    if nPoints:

      flat = unpack("<%sd" % (2 * nPoints), f.read(16*nPoints))

E struct.error: unpack requires a buffer of 432 bytes

/opt/conda/lib/python3.7/site-packages/shapefile.py:777: error

Compression that clearly fails eyeball norm?

We were hoping the 1e-7 level compression in PRECT would cause noticeable differences in the plot.compare_mean() figures but that wasn't the case.

Add tests for metrics.py, plot.py

Turn SampleNotebook into TutorialNotebook

this will require an overview section, links to the documentation, explanations of plotting options (especially plot_type, metric_type) , listing the required plot arguments, explanations of what the metadata (print_stats, ds commands do

Can we use open_mfdataset?

utils.open_datasets() was built on older code that called xr.open_dataset() several times, but those calls can probably be replaced with a single xr.open_mfdataset() call... we just need to make sure everything is concatenated correctly.

We probably want to add some sort of checking on the input parameters to plot() to make sure the combination of parameters is valid. Alternatively, write time_series_plot, spatial_plot etc functions that fix some parameters and then call plot().

Most of the functions should probably be put in to a class to avoid passing so many parameters around all the time. Also, we probably want to add some sort of checking on the input parameters to plot() to make sure the combination of parameters is valid. Alternatively, write time_series_plot, spatial_plot etc functions that fix some parameters and then call plot().

Originally posted by @pinarda in #40 (comment)

Add units to colorbar

There is a units property in the dataset, but the metrics array returned by a call to get_metrics does not have a units property, so we need to add that property before we return the array.
Then, add the units to the color bar title.