xgcm / xhistogram Goto Github PK

View Code? Open in Web Editor NEW

90.0 90.0 20.0 2.87 MB

Fast, flexible, label-aware histograms for numpy and xarray

Home Page: https://xhistogram.readthedocs.io

License: MIT License

Python 100.00%

xhistogram's Introduction

xgcm: General Circulation Model Postprocessing with xarray

Binder Examples

Link	Provider	Description
	mybinder.org	Basic self-contained example
	Pangeo Binder	More complex examples integrated with other Pangeo tools (dask, zarr, etc.)

Description

xgcm is a python package for working with the datasets produced by numerical General Circulation Models (GCMs) and similar gridded datasets that are amenable to finite volume analysis. In these datasets, different variables are located at different positions with respect to a volume or area element (e.g. cell center, cell face, etc.) xgcm solves the problem of how to interpolate and difference these variables from one position to another.

xgcm consumes and produces xarray data structures, which are coordinate and metadata-rich representations of multidimensional array data. xarray is ideal for analyzing GCM data in many ways, providing convenient indexing and grouping, coordinate-aware data transformations, and (via dask) parallel, out-of-core array computation. On top of this, xgcm adds an understanding of the finite volume Arakawa Grids commonly used in ocean and atmospheric models and differential and integral operators suited to these grids.

xgcm was motivated by the rapid growth in the numerical resolution of ocean, atmosphere, and climate models. While highly parallel supercomputers can now easily generate tera- and petascale datasets, common post-processing workflows struggle with these volumes. Furthermore, we believe that a flexible, evolving, open-source, python-based framework for GCM analysis will enhance the productivity of the field as a whole, accelerating the rate of discovery in climate science. xgcm is part of the Pangeo initiative.

Getting Started

To learn how to install and use xgcm for your dataset, visit the xgcm documentation.

xhistogram's People

Contributors

Stargazers

Watchers

xhistogram's Issues

Support binning over datetime64 objects

The method of adding a small increment to the bin (in core.py) leads to an error when trying to do time binning over a numpy datetime64-indexed coordinate, which are quite common in xarray. It would be really nice if this was supported.

feature: multi-dimensional bins

currently, xhistogram only allows bins to be one-dimensional.

however, when the bin edges vary in time (seasonality) or space (locations of the globe) xhistogram cannot be used with multi-dim bins. there is a hard-coded requirement for bins elements to be 1-d.
One of such multi-dim bin applications is the ranked probability score rps we use in xskillscore.rps, where we want to know how many forecasts fell into which bins. Bins are often defined as terciles of the forecast distribution and the bins for these terciles (forecast_with_lon_lat_time_dims.quantile(q=[.33,.66],dim='time')) depend on lon and lat.

How we solved this in xskillscore.rps:
< gives us CDFs, and diff brings it back to histograms. maybe have to throw away the upper edge

Fc = (forecasts < forecasts_edges).mean('member').diff(bin_dim)
Oc = (observations < observations_edges).astype("int").diff(bin_dim)

https://github.com/xarray-contrib/xskillscore/blob/493f9afd7b5acefb4baa47bec6ad65fca19965bd/xskillscore/core/probabilistic.py#L680

I first implemented rps with xhistogram, then with the snippet above, yields same results.

However, I am not sure whether such multi-dimensional bins would be an interesting addition to xhistogram or are out-of-scope.

make `dask.array` dependency not `dask` to speedup pip install

When installing xhistogram via pip many packages (bokeh etc) are installed due to the dask dependency. https://github.com/xgcm/xhistogram/network/dependencies

However, we actually only/mostly use dask.array. Could we make dask.array the dependency, not dask?
I am unsure whether this makes the difference I want to see

dask.is_dask_collection is used in

xhistogram/xhistogram/core.py

Line 339 in 5c12e17

is_dask_array = any([dask.is_dask_collection(a) for a in args])

, but this could be replaced as in

xhistogram/xhistogram/core.py

Line 34 in 5c12e17

return any(isinstance(a, dsa.core.Array) for a in args)

by isinstance(a, dsa.core.Array)

x-ref to xarray-contrib/xskillscore#359

Alternative dask-powered histogram algorithm using xarray.groupby and numpy_groupies

After @shoyer mentioned earlier today that he had an example of dealing with with the ND-histogram problem in xarray by using xarray.apply_ufunc and numpy_groupies, I made this notebook to try it out for creating histograms in xarray.

The basic idea is that da.groupby_bins(bins).apply(count) essentially creates a histogram, and numpy_groupies can speed up the groupby_bins hugely.

I think its pretty cool that it even works, but you'll see in the notebook that I don't think the performance compares favourably with xhistogram's dask.blockwise implementation (see #49), though I didn't manage to get numba-powered groupies working yet. The dask task graphs are also not as nice.

@rabernat this is the sort of thing I had in mind originally.

@gjoseph92 you might find this interesting as an alternate solution to your blockwise one.

@dcherian and @max-sixty you might find this example interesting as I know you've been working on using numpy_groupies in pydata/xarray#4473 .

DOCS: Add example how to use xhistogram to split a dataset into regions with a mask?

This is based on a brilliant suggestion by @dcherian, which I have used in multiple projects.

I think it would be very nice to document this little 'trick' to efficiently compute a histogram in 'regions' of an array which are encoded in a numbered mask (this is often how e.g. ocean basins are masked in climate models). Instead of masking out one region at a time with something like this:

for num_region in range(4):
    da_masked = da.where(da.mask==num_region)
    histogram(da_masked, ...)

One can simply add the mask as a second input variable and with appropriate bins, the results will be segmented into the appropriate regions (saving n-1 computations steps; where n is the number of regions).

I wrote up a quick example:

import numpy as np
import xarray as xr
from xhistogram.xarray import histogram

# make a quadrant mask
mask = np.concatenate([
    np.concatenate([np.full((2,2), 1), np.full((2,2), 2)]),
    np.concatenate([np.full((2,2), 3), np.full((2,2), 4)]),
],axis=1)
# create data that has similar values but with some added noise
data = (np.random.rand(*mask.shape)-0.5)*0.5+mask

da = xr.DataArray(data, dims=['x','y'], name='data', coords={'mask':xr.DataArray(mask, dims=['x','y'], name='mask')})
da

data_bins = np.linspace(0,5, 50)
mask_bins = np.arange(0.5, 5.5)

hist = histogram(da, da.mask, bins=[data_bins, mask_bins])
hist.plot(col='mask_bin')

You can see that this successfully segemented the regions and computed the histogram for each (the data is slightly spread around the actual mask number).

I think this really is a brilliant application case for so many science applications, but it is not obvious untils one tries it (at least it wasnt for me until @dcherian suggested it). So I think it would make a great example for the docs?

Id be happy to add this if ppl think it provides value. Maybe @shanicetbailey is interested in collaborating on this, since this might be very helpful in her current project?

use ravel_multi_index from dask

there's a note here (https://github.com/xgcm/xhistogram/blob/master/xhistogram/duck_array_ops.py#L40) that dask doesn't have ravel_multi_index.

This recently got added to dask: https://github.com/dask/dask/blob/main/dask/array/routines.py#L1431

Bins argument of type int doesn't work

Currently the documentation suggests that one can pass an int xhist(bins=...) similarly to that of numpy, but it doesn't work and looks like it isn't tested for in the pytest suite.

import numpy as np
import xarray as xr

# Demo data
A = np.arange(100)
da = xr.DataArray(A, dims='time').rename('test')

Case 1:
xhist(da, bins=10)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-1c6580238521> in <module>
----> 1 xhist(da, bins=10)

~/miniconda3/envs/analysis/lib/python3.6/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
     72     # some sanity checks
     73     # TODO: replace this with a more robust function
---> 74     assert len(bins)==N_args
     75     for bin in bins:
     76         assert isinstance(bin, np.ndarray), 'all bins must be numpy arrays'

TypeError: object of type 'int' has no len()

Case 2:
xhist(da, bins=[10])

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-49-85ba491a8442> in <module>
----> 1 xhist(da, bins=[10])

~/miniconda3/envs/analysis/lib/python3.6/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
     74     assert len(bins)==N_args
     75     for bin in bins:
---> 76         assert isinstance(bin, np.ndarray), 'all bins must be numpy arrays'
     77 
     78     for a in args:

AssertionError: all bins must be numpy arrays

Case 3:
xhist(da, bins=[np.array(10)])

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-50-394f2e7f4bcc> in <module>
----> 1 xhist(da, bins=[np.array(10)])

~/miniconda3/envs/analysis/lib/python3.6/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
    127 
    128     h_data = _histogram(*args_data, weights=weights_data, bins=bins, axis=axis,
--> 129                         block_size=block_size)
    130 
    131     # create output dims

~/miniconda3/envs/analysis/lib/python3.6/site-packages/xhistogram/core.py in histogram(bins, axis, weights, density, block_size, *args)
    245     h = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
    246                                  weights=weights_reshaped,
--> 247                                  density=density, block_size=block_size)
    248 
    249     if h.shape[0] == 1:

~/miniconda3/envs/analysis/lib/python3.6/site-packages/xhistogram/core.py in _histogram_2d_vectorized(bins, weights, density, right, block_size, *args)
     95     for a, b in zip(args, bins):
     96         assert a.ndim == 2
---> 97         assert b.ndim == 1
     98         assert a.shape == a0.shape
     99     if weights is not None:

AssertionError:

Versioneer seems to be broken on python3.12

Traceback

error: builder for '/nix/store/klb2gzpffqv0hpwydv2v1rba6yarmlyr-python3.12-xhistogram-0.3.2.drv' failed with exit code 1;
       last 25 log lines:
       >   File "/nix/store/j3xg6ln22ng7b3qcs1dpha12x1kyldzd-python3.12-pyproject-hooks-1.0.0/lib/python3.12/site-packages/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
       >     return hook(config_settings)
       >            ^^^^^^^^^^^^^^^^^^^^^
       >   File "/nix/store/xnd1jdqjqdva8aqcbgw9ynmnfphpkzcy-python3.12-setuptools-70.0.0/lib/python3.12/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
       >     return self._get_build_requires(config_settings, requirements=['wheel'])
       >            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       >   File "/nix/store/xnd1jdqjqdva8aqcbgw9ynmnfphpkzcy-python3.12-setuptools-70.0.0/lib/python3.12/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
       >     self.run_setup()
       >   File "/nix/store/xnd1jdqjqdva8aqcbgw9ynmnfphpkzcy-python3.12-setuptools-70.0.0/lib/python3.12/site-packages/setuptools/build_meta.py", line 487, in run_setup
       >     super().run_setup(setup_script=setup_script)
       >   File "/nix/store/xnd1jdqjqdva8aqcbgw9ynmnfphpkzcy-python3.12-setuptools-70.0.0/lib/python3.12/site-packages/setuptools/build_meta.py", line 311, in run_setup
       >     exec(code, locals())
       >   File "<string>", line 36, in <module>
       >   File "/private/tmp/nix-build-python3.12-xhistogram-0.3.2.drv-0/xhistogram-0.3.2/versioneer.py", line 1524, in get_version
       >     return get_versions()["version"]
       >            ^^^^^^^^^^^^^^
       >   File "/private/tmp/nix-build-python3.12-xhistogram-0.3.2.drv-0/xhistogram-0.3.2/versioneer.py", line 1451, in get_versions
       >     cfg = get_config_from_root(root)
       >           ^^^^^^^^^^^^^^^^^^^^^^^^^^
       >   File "/private/tmp/nix-build-python3.12-xhistogram-0.3.2.drv-0/xhistogram-0.3.2/versioneer.py", line 346, in get_config_from_root
       >     parser = configparser.SafeConfigParser()
       >              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       > AttributeError: module 'configparser' has no attribute 'SafeConfigParser'. Did you mean: 'RawConfigParser'?

axes not used for making hist got shifted

Say we have two xarray.DataArrays with dims (t:4, z:10, x:100, y:100). If we make histagram based on dimensions x and y, what we have in the result is (t:10, z:4, x_bin, y_bin). The t and z get exchanged.

import numpy as np
import xarray as xr
from xhistogram.xarray import histogram


# create data for testing

time_axis = range(4)
depth_axis = range(10)
X_axis = range(30)
Y_axis = range(30)

dat1 = np.random.randint(low = 0, high = 100,size=(len(time_axis),len(depth_axis),len(X_axis),len(Y_axis)))
array1 = xr.DataArray(dat1, coords = [time_axis,depth_axis,X_axis,Y_axis], dims = ['time','depth','X','Y'])

dat2 = np.random.randint(low = 0, high = 50,size=(len(time_axis),len(depth_axis),len(X_axis),len(Y_axis)))
array2 = xr.DataArray(dat2, coords = [time_axis,depth_axis,X_axis,Y_axis], dims = ['time','depth','X','Y'])

# create bins and rename arrays

bins1 = np.linspace(0, 100, 50)
bins2 = np.linspace(0,50,25)

array1 = array1.rename('one')
array2 = array2.rename('two')

result= histogram(array1,array2,dim = ['X','Y'] , bins=[bins1,bins2])

The dimensions of result is (time: 10, depth: 4, one_bin: 49, two_bin: 24) instead of (time: 4, depth: 10, one_bin: 49, two_bin: 24). Is this a bug of the code or just my misusing of the function?

Division by zero error with SOSE data

Hi, I am just rerunning old notebooks but am now getting an error I haven't gotten before with this code. Below I have attached a simplified version using xhistogram's histogram function on regular SOSE salt data and only specifying salt_bins as an argument (no weights or dims specified for simplicity). Below is the error message. I have run this on regular ECCO data and it works with this template...not sure what's going on and would appreciate anyone shedding some light. Thanks!

import xarray as xr
from xhistogram.xarray import histogram
import gcsfs

#Load in data
fs = gcsfs.GCSFileSystem(requester_pays=True)
mapping = fs.get_mapper('gcs://pangeo-ecco-sose')
ds = xr.open_zarr(mapping, consolidated=True)
ds

#create salt bins
salt_bins = np.arange(32, 38.1, 0.1)

test = histogram(ds.SALT, bins=[salt_bins])
test.load() #this is where the error occurs

Error message:

ZeroDivisionError: integer division or modulo by zero

Opaque error when a non-xarray object is passed to xhist

Currently when passing, say, an ndarray to xhist, the error that comes back is as follows:

import numpy as np
from xhistogram.xarray import histogram as xhist

# Demo data
A = np.arange(100)
xhist(A, bins=[np.linspace(0, 100, 11)])

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-75-0c14fe219de4> in <module>
----> 1 xhist(A, bins=[np.linspace(0, 100, 11)])

~/miniconda3/envs/analysis/lib/python3.6/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
     78     for a in args:
     79         # TODO: make this a more robust check
---> 80         assert a.name is not None, 'all arrays must have a name'
     81 
     82     # we drop coords to simplify alignment

AttributeError: 'numpy.ndarray' object has no attribute 'name'

This is minor, but it would be good to have a quick check that sees whether the object is an xarray.DataArray and if not, throw a clear error that that's the issue.

Release v0.3?

We have done a fair bit of work on xhistogram recently. Once #44 is in, do we want to do a release?

Bug with density calculation when NaNs are present

There is a bug in the way histograms are normalised to densities that manifests when there are NaNs in the input data:

import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
from xhistogram.xarray import histogram

data = np.random.normal(size=(10,2))
# Add some nans
N_nans = 6
data.ravel()[np.random.choice(data.size, N_nans, replace=False)] = np.nan
bins = np.linspace(-5,5,5)
bin_centers = 0.5 * (bins[:-1] + bins[1:])

# np.histogram -----
h, _ = np.histogram(data[:,0], bins, density=True)
plt.plot(bin_centers, h, label='numpy histogram')
  
# xhistogram -----
da = xr.DataArray(
    data, dims=['s', 'x'], 
    coords=[range(data.shape[0]), 
            range(data.shape[1])]).rename('test')
h2 = histogram(da, bins=[bins], dim=['s'], density=True)
plt.plot(bin_centers, h2[0,:], linestyle='--', label='xhistogram')

plt.legend()
plt.xlabel('bins')
plt.ylabel('pdf')

This bug comes about when there are dimensions that are not being histogram'd ("bystander" dimensions). Currently we sum over all axis to estimate the area/volume of our histogram and then account bystander dimensions as a secondary step. However, this can produce incorrect results when NaNs are present because there may be a different number of NaNs along each bystander dimension.

Compliance with Dask version 2021.03 ?

First of all thank you for making this useful tool.
It has worked well for me until Dask version 2021.03, after which I am getting error as below when calling histogram. I apologize that this might be not be sufficient information to reproduce.

../../opendrift/models/basemodel.py:3707: in get_density_xarray
    h = histogram(self.ds.lon, self.ds.lat, bins=[lonbin, latbin],
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/xarray.py:133: in histogram
    h_data = _histogram(*args_data, weights=weights_data, bins=bins, axis=axis,
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/core.py:261: in histogram
    bin_counts = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/core.py:133: in _histogram_2d_vectorized
    bin_counts = _dispatch_bincount(bin_indices, weights, N, hist_shapes,
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/core.py:81: in _dispatch_bincount
    return _bincount_2d(bin_indices, weights, N, hist_shapes)
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/xhistogram/core.py:32: in _bincount_2d
    return bc_offset.reshape(final_shape)
../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/dask/array/core.py:1919: in reshape
    return reshape(self, shape, merge_chunks=merge_chunks)

        if reduce(mul, shape, 1) != x.size:
>           raise ValueError("total size of new array must be unchanged")
E           ValueError: total size of new array must be unchanged

../../../../miniconda3/envs/opendrift/lib/python3.9/site-packages/dask/array/reshape.py:206: ValueError

Weighted hist example

It is not completely clear what the weighted histogram does.

If I have data :
x = [0,0,1,1]
weights = [-2,-2, -3,-4]

Is the weighted histogram
the sum (xh = [ -4, -7]) or is it the mean (xh = [-2, -3.5])?

I think it is the sum, but best to be clear.

NaNs in histogram weights

Was trying to plot sigma distribution time series and had trouble understanding how xhistogram was treating NaNs in the input data. Turns out there are NaNs in the input weights and it seems like when there's one NaNs value in a bin, xhist makes the entire bin Nan.

This is also noted in the documentation tutorial. A fix to this was to run .fillna() on the weights. I think clear documentation on this feature will be helpful.

Maintenance work and maintainers needed

We need to do some grunt work to keep this package maintainable

Migrate to github workflows (travis-ci.org will soon shut down)
Add automatic linting to CI (black, pep8)
Set up readthedocs integration for PRs
Start adding mypy type hints

We welcome any and all volunteers who want to help with these unglamorous but important tasks.

More generally, "we" is actually still just me. There are no other maintainers of xhistogram. I would love to add more, especially since my response time can be very slow. Just say the word and I will grant you privs.

Error with multiple dask `args`

As of 0.1.3, xhistogram now fails when provided multiple dask args. This is because the dask version of ravel_multi_index that was introduced into xhistogram in 0.1.3 does not actually replicate the API of the equivalent numpy function. Specifically, the numpy implementation can receive a tuple of integer arrays, whereas the dask version cannot.

We actually don't have a test in our suite for multiple dask args, which is why this bug went uncaught until now.

I've opened an issue with dask, but we probably need to provide a more immediate fix for xhistogram. @rabernat, what do you think about reverting back to using your wrapper of np.ravel_multi_index until the issue is resolved with dask?

TypeError: Cannot cast array data from dtype('O') to dtype('int64') when applying histogram to image time series

I've tried to create a minimal reproducible example (similar to the tutorial, which I'm able to run), but working with a 3D DataArray. But I get an error:

from xhistogram.xarray import histogram 
import numpy as np
import xarray as xr
def tseries_histogram_labels(data_arr, rmin, rmax, nbins):
    bin_arr = np.linspace(rmin, rmax, nbins)
    time_len = data_arr.time.shape[0]
    bin_arrs = list(bin_arr * np.ones((time_len,1)))
    return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
ny, nx = 100, 100
nt = 44
data_arr = xr.DataArray(np.random.randn(nt,ny,nx), dims = ['time', 'y', 'x'], name='blue reflectance')
tseries_histogram_labels(data_arr, -4, 4, 50)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      9 nt = 44
     10 data_arr = xr.DataArray(np.random.randn(nt,ny,nx), dims = ['time', 'y', 'x'], name='blue reflectance')
---> 11 tseries_histogram_labels(data_arr, -4, 4, 50)

 in tseries_histogram_labels(data_arr, rmin, rmax, nbins)
      5     time_len = data_arr.time.shape[0]
      6     bin_arrs = list(bin_arr * np.ones((time_len,1)))
----> 7     return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
      8 ny, nx = 100, 100
      9 nt = 44

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
    127 
    128     h_data = _histogram(*args_data, weights=weights_data, bins=bins, axis=axis,
--> 129                         block_size=block_size)
    130 
    131     # create output dims

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in histogram(bins, axis, weights, density, block_size, *args)
    245     h = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
    246                                  weights=weights_reshaped,
--> 247                                  density=density, block_size=block_size)
    248 
    249     if h.shape[0] == 1:

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in _histogram_2d_vectorized(bins, weights, density, right, block_size, *args)
    124 
    125     bin_counts = _dispatch_bincount(bin_indices, weights, N, hist_shapes,
--> 126                                     block_size=block_size)
    127 
    128     # just throw out everything outside of the bins, as np.histogram does

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in _dispatch_bincount(bin_indices, weights, N, hist_shapes, block_size)
     79     if len(block_chunks) == 1:
     80         # single global chunk, don't need a loop over chunks
---> 81         return _bincount_2d(bin_indices, weights, N, hist_shapes)
     82     else:
     83         return _bincount_loop(bin_indices, weights, N, hist_shapes, block_chunks)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in _bincount_2d(bin_indices, weights, N, hist_shapes)
     28     bin_indices_offset = (bin_indices + (N * np.arange(M)[:, None])).ravel()
     29     bc_offset = bincount(bin_indices_offset, weights=weights,
---> 30                             minlength=N*M)
     31     final_shape = (M,) + tuple(hist_shapes)
     32     return bc_offset.reshape(final_shape)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/duck_array_ops.py in f(*args, **kwargs)
     23             else:
     24                 module = eager_module
---> 25             return getattr(module, name)(*args, **kwargs)
     26     else:
     27         def f(data, *args, **kwargs):

<__array_function__ internals> in bincount(*args, **kwargs)

TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

I'm not sure what is causing this error.

error in keep_coords option in xarray histogram?

Thanks to the developers of this fabulous tool! I'm getting the following error when using histogram from xhistogram.xarray with keep_coords==True:

TypeError: can only concatenate tuple (not "list") to tuple

I'm sending in a single xarray as args and want to retain the coordinates from my weights xarray. The error occurs at line 122 of xarray.py--since args is not a list, the += fails. I'm still a python newbie and don't know what the fix should be (maybe args=[args] in the line above?), but thought I'd bring it to your attention. Thanks again.

"all arrays must have name"

Hi all,

I really appreciate this package, thanks for putting it together!

I am just playing around with it currently, and I keep getting an error message "all arrays must have name" when I play with my data inside the histogram function. Example screenshot below. Would it be an idea to have an option to plot the histogram without the name of the array, so that the code below works? Especially since I can manipulate the x-label after plotting. It would really help for some quick checks without having to redefine an array with a name every time I perform a manipulation on the data (which might be just for testing purposes).

bins = np.linspace(-800, 0, 16)
h_chi = histogram(ds_chi.Jq.isel(timeSeries=2).rolling(time=24*5).mean(dim='time'), bins=[bins], name='test')
h_chi.plot()

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-90-c6841c4a53fa> in <module>
      1 bins = np.linspace(-800, 0, 16)
----> 2 h_chi = histogram(ds_chi.Jq.isel(timeSeries=2).rolling(time=24*5).mean(dim='time'), bins=[bins])
      3 h_chi.plot()

~/miniconda3/envs/dcpy/lib/python3.8/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
     78     for a in args:
     79         # TODO: make this a more robust check
---> 80         assert a.name is not None, 'all arrays must have a name'
     81 
     82     # we drop coords to simplify alignment

AssertionError: all arrays must have a name

Another option of course is to not have the name dropped by xarray when I do a rolling mean.. But I have a feeling the change here might be easier?!

Refactor to use searchsorted instead of digitize

numpy uses searchsorted to compute the bin number each sample falls into (see here). xhistogram currently use digitize (see here).

Refactoring to use searchsorted would come with performance improvements and would simplify how we align our bin definitions with numpy, potentially also solving #25.

Profiling `_bincount` and attempted acceleration with numba

I tried profiling the core _bincount function, and then accelerating it with numba, a naive application of which only made it about 30% faster overall. Notebook here (and gist here in case the notebook cache hasn't updated yet).

However I've never used numba before, and not done much profiling in python, so maybe there is something else I should have tried.

I also profilied to see if the block_size argument has much effect.

Also I noticed that #49 introduced a regression where the block_size argument is no-longer passed down to _bincount, so doesn't actually do anything. Fixed in #62.

@rabernat @gjoseph92 @jrbourbeau

Histogram of weights/ n-D histograms

Is it possible to easily extend the functionality to include the histogram of the weights for a particular bin?
The current weights options gives a sum, but sometimes it would useful to look at the distribution of the points within a certain bin.

numpy / dask verison compatibility bug

There is a bug in xhistogram with numpy version >= 1.17 and dask version < 2.3

import xarray as xr
import numpy as np
from xhistogram.xarray import histogram

nt, nx = 100, 30
da = xr.DataArray(np.random.randn(nt, nx), dims=['time', 'x'],
                  name='foo').chunk({'time': 1})
bins = np.linspace(-4, 4, 20)
h = histogram(da, bins=[bins], dim=['x'])

This should be lazy. However, as reported by @stb2145 in pangeo-data/pangeo#690, certain numpy / dask combination produce the warning

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/array/core.py:1263: FutureWarning: The `numpy.moveaxis` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  FutureWarning,

and evaluates eagerly.

In pangeo-data/pangeo#690 we found a workaround involving setting the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=0. However, in the meantime, the root issues was fixed in dask (dask/dask#2559).

We can avoid this bug by requiring dask >= 2.3.0 as a dependency for xhistogram. I guess that's the easiest way to go.

DOI for citation

Could the developers add a Zenodo DOI so I can cite this package?
Thanks for the work you've put into this :)

TIME SENSITIVE: Zenodo DOI

I need a doi for xhistogram today to cite it in a paper submission. I linked the package to my own zenodo and would need to create a minor release to trigger the archival.

Is everyone ok with me releasing v0.3.1 just for this?

Intermittent? test CI failures

I mentioned this over in #73, but it might be worth raising a separate issue here. It seems that there is a flaky test in the CI suite.

See these three runs:

They should all contain the exact same code, but some of them pass and one fails.

I just restarted the same tests twice in #73 and the issue went away. It seems that this is only happening in the Tests / test (macos-latest, 3.7) (pull_request) run.

cc @TomNicholas

Support for density=True keyword

Numpy histogram has the density keyword, which normalized the outputs by the bin spacing.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

We can easily support this in xhistogram, but it has to .be plugged in. Just opening this issue to make a note of it.

Release v0.3.2?

We've had a couple of bug fixes recently (#12, #78). Shall we do a release?

Trouble creating histogram labels for each 2D array in time series

Hi @rabernat thanks for the suggestion on this stack overflow post: https://stackoverflow.com/questions/57419541/how-to-calculate-histogram-bins-for-each-image-in-an-xarray-dataarray-time-serie. I'm posting here since Stack Overflow's strict character limit cut out some of the code blocks below.

I've tried out the example with what I think are the correct arguments like so

import numpy as np
import xarray as xr
def tseries_histogram_labels(data_arr, rmin, rmax, nbins):
    bin_arr = np.linspace(rmin, rmax, nbins)
    time_len = data_arr.time.shape[0]
    bin_arrs = list(bin_arr * np.ones((time_len,1)))
    return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
t_series['reflectance'].name = 'reflectance'
tseries_histogram_labels(t_series['reflectance'].sel(band=1), -4, 4, 50)

where t_series.sel(blue=1) is a DataArray with time, x, and y dimensions. Calling *data_arr in the return statement unpacks the DataArray along the time dimension into separate DataArrays, if this is not done there is a length mismatch between the bin array list and the DataArray.

The list of bin edge arrays and the list of Data Arrays match, and I want to take the histogram of each Data Array. I've tried with and without specifying dim=['x', 'y'], but I get the same error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
 in 
      7     return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
      8 t_series['reflectance'].name = 'reflectance'
----> 9 tseries_histogram_labels(t_series['reflectance'].sel(band=1), -4, 4, 50)

 in tseries_histogram_labels(data_arr, rmin, rmax, nbins)
      5     time_len = data_arr.time.shape[0]
      6     bin_arrs = list(bin_arr * np.ones((time_len,1)))
----> 7     return histogram(*data_arr, bins=bin_arrs, dim=['x','y'])
      8 t_series['reflectance'].name = 'reflectance'
      9 tseries_histogram_labels(t_series['reflectance'].sel(band=1), -4, 4, 50)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/xarray.py in histogram(bins, dim, weights, density, block_size, bin_dim_suffix, bin_edge_suffix, *args)
    127 
    128     h_data = _histogram(*args_data, weights=weights_data, bins=bins, axis=axis,
--> 129                         block_size=block_size)
    130 
    131     # create output dims

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in histogram(bins, axis, weights, density, block_size, *args)
    245     h = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
    246                                  weights=weights_reshaped,
--> 247                                  density=density, block_size=block_size)
    248 
    249     if h.shape[0] == 1:

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/core.py in _histogram_2d_vectorized(bins, weights, density, right, block_size, *args)
    117     # product of the bins gives the joint distribution
    118     if N_inputs > 1:
--> 119         bin_indices = ravel_multi_index(each_bin_indices, hist_shapes)
    120     else:
    121         bin_indices = each_bin_indices[0]

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xhistogram/duck_array_ops.py in ravel_multi_index(multi_index, dims, order)
     54                                  for n in range(1, len(dims))]
     55     full_index = [of * ix for of, ix in zip(offset_factors, multi_index)]
---> 56     return sum(full_index)
     57 
     58 

TypeError: unsupported operand type(s) for +: 'int' and 'Array'

I'm not sure if this is a bug or if I'm doing something wrong. I tried to create a minimal example that doesn't require a file, but ran into a different issue, #9

Happy to make this an example of how to use xhistogram with dask if we can sort this out and any feedback is appreciated!

CF style 'bounds' for histogram output?

I would find it very useful to preserve the input bins as CF bound coordinates in the output. I can try to put in a PR some time soon if folks think this would be useful.

Add type annotations

We are now type checking xhistogram using mypy (#38). However, none of the code is static typed at this point.

It would be good to type annotate the existing methods.

time-varying bins

Thanks to you guys for this handy package.

I had to use time-varying bins when I tried to calculate the area enclosed by tracer contours, as the tracer changes with time. However, the bins kwarg neither support xarray nor support a mulidimendional numpy array for hist along contour space.

I also need to convert PDF of xhist to CDF for area. It seems that xhist does not directly support this. A way to do this is to use numpy.cumsum. Is it possible to integrate this functionality into xhist?

Carry coordinates

I noticed that all coordinates are lost when passing through xhistogram.

Do you think pydata/xarray#3453, would make it feasible to carry those (see xgcm/xgcm#150 for discussion), or is there a specific reason why the coordinates are dropped by xhistogram?

Align definition of `bins` with numpy.histogram

When bin edges are specified in numpy.histogram, the last bin is closed on both sides. From the numpy docs (https://numpy.org/doc/stable/reference/generated/numpy.histogram.html):

All but the last (righthand-most) bin is half-open. In other words, if bins is:

[1, 2, 3, 4]
then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.

This is not the case for xhistogram, e.g.:

import numpy as np
import xarray as xr
from xhistogram.xarray import histogram as xhist

data = np.ones(10)
da =  xr.DataArray(data, coords=[range(10)], dims=['x']).rename('test')

bins = np.array([0, 0.5, 1])

print(f'xhistogram: {xhist(da, bins=[bins]).values}')
print(f'numpy histogram: {np.histogram(data, bins=bins)[0]}')

xhistogram: [0 0]
numpy histogram: [ 0 10]

Could we make it the case?

divide by zero error

from xhistogram.core import histogram

data = np.zeros(36048471)
bins = np.array([1, 2, 3])
histogram(data, bins=[bins])

gives

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-84-530a0f6cf98c> in <module>
      3 data = np.zeros(36048471)
      4 bins = np.array([1, 2, 3])
----> 5 histogram(data, bins=[bins])

/srv/conda/envs/notebook/lib/python3.7/site-packages/xhistogram/core.py in histogram(bins, axis, weights, density, block_size, *args)
    245     h = _histogram_2d_vectorized(*all_args_reshaped, bins=bins,
    246                                  weights=weights_reshaped,
--> 247                                  density=density, block_size=block_size)
    248 
    249     if h.shape[0] == 1:

/srv/conda/envs/notebook/lib/python3.7/site-packages/xhistogram/core.py in _histogram_2d_vectorized(bins, weights, density, right, block_size, *args)
    124 
    125     bin_counts = _dispatch_bincount(bin_indices, weights, N, hist_shapes,
--> 126                                     block_size=block_size)
    127 
    128     # just throw out everything outside of the bins, as np.histogram does

/srv/conda/envs/notebook/lib/python3.7/site-packages/xhistogram/core.py in _dispatch_bincount(bin_indices, weights, N, hist_shapes, block_size)
     76     # block_chunks is like a dask chunk, a tuple that divides up the first
     77     # axis of bin_indices
---> 78     block_chunks = _determine_block_chunks(bin_indices, block_size)
     79     if len(block_chunks) == 1:
     80         # single global chunk, don't need a loop over chunks

/srv/conda/envs/notebook/lib/python3.7/site-packages/xhistogram/core.py in _determine_block_chunks(bin_indices, block_size)
     64             block_size = min(_MAX_CHUNK_SIZE // N, M)
     65     assert isinstance(block_size, int)
---> 66     num_chunks = M // block_size
     67     block_chunks = num_chunks * (block_size,)
     68     residual = M % block_size

ZeroDivisionError: integer division or modulo by zero

Smaller sized arrays work fine.

overlapping bins

xhistogram is great tool for one of my project that bins tropical cyclone (TC) records into gridded data as the frequency of occurrence. However, it is done using a non-overlapping bins so that the result may appear noisy when TCs are less frequent. I just wonder if there is an option of overlapping bins could be used, so the result of PDF could be smoothed somehow. This could obtain a relatively high-resolution result while keep more data in each bin to ensure statistical significance.