I think it's useful to think through concrete use cases on how the interchange protoco

If we had the ability to get the column as a single chunk </blockquot

This is a nice example, thanks <a class="user-mention notranslate" data-hovercard-type

Interchange protocol use case: getting certain columns as numpy array about dataframe-api HOT 8 OPEN

data-apis commented on August 26, 2024

Interchange protocol use case: getting certain columns as numpy array

from dataframe-api.

Comments (8)

kkraus14 commented on August 26, 2024

I think asking for a column / DataFrame in a single chunk is something reasonable (whether or not it's part of the standard or interchange protocol). If we had the ability to get the column as a single chunk then the utility function becomes really straightforward or just becomes something like a np.asarray call.

from dataframe-api.

jorisvandenbossche commented on August 26, 2024

If we had the ability to get the column as a single chunk

We already have this ability with the currently documented API, I think (the methods get_column()/get_column_by_name() and num_chunks / get_chunks() on the column should be sufficient to get a single-chunk column).

the utility function becomes really straightforward or just becomes something like a np.asarray call.

A np.asarray call only works if we add something like __array__ or __array_interface__ to Column, which we currently don't specify (cfr #48).

In case you meant calling it on the individual Buffer, that in itself will become trivial once numpy supports dlpack, yes.
You still need to handle the different buffers and dtypes etc. A quick attempt at a version with only limited functionality:

def column_to_numpy_array(col):
    assert col.num_chunks == 1  # for now only deal with single chunks
    kind, _, format_str, _ = col.dtype
    if kind not in (0, 1, 2, 22):
        raise TypeError("only numeric and datetime dtypes are supported")
    if col.describe_null[0] not in (0, 1):
        raise NotImplementedError("Null values represented as masks or "
                                  "sentinel values not handled yet")
    buffer, dtype = col.get_buffers()["data"]
    arr  = buffer_to_ndarray(buffer, dtype)  # this can become `np.asarray` or `np.from_dlpack` in the future
    if kind == 22:  # datetime
        unit = format_string.split(":")[-1]
        arr = arr.view(f"datetime64[{unit}]")
    return arr

where buffer_to_ndarray is currently something like

dataframe-api/protocol/pandas_implementation.py

Line 116 in 27b8e1c

def buffer_to_ndarray(_buffer, _dtype) -> np.ndarray:

, but in the future can become a single numpy call once numpy supports DLPack.

That's certainly relatively straightforward code, but also dealing with a lot of details of the protocol, and IMO not something many end users should have to implement themselves.

from dataframe-api.

kkraus14 commented on August 26, 2024

We already have this ability with the currently documented API, I think (the methods get_column()/get_column_by_name() and num_chunks / get_chunks() on the column should be sufficient to get a single-chunk column).

I meant more along the lines of given a column with multiple chunks, requesting the column to combine its chunks into a single chunk so that it has a contiguous buffer under the hood.

from dataframe-api.

rgommers commented on August 26, 2024

This is a nice example, thanks @jorisvandenbossche. I feel like df.get_column_by_name("x").get_buffers() is taking a wrong turn though - an end user library should indeed not need to deal with buffers directly.

xref [the plot() docs, see under "Plotting labelled data"

I think this would work:

df_obj = obj.__dataframe__().get_columns_by_name([x, y])
df = pd.from_dataframe(df)
xvals = df[x].values
yvals = df[y].values

Currently they require obj["x"] to give the desired data

That's not the actual requirement today I think - it's that np.asarray(obj[x])) returns the data as a numpy array. Which is a fairly specific requirement - but even so it can be made to work just fine, on the condition that if the user uses the data=obj syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray call succeeds.

from dataframe-api.

rgommers commented on August 26, 2024

I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))

I'm not sure about Matplotlib, but I do know that Napari would like this and has tried to improve compatibility with PyTorch and other libraries.

from dataframe-api.

jorisvandenbossche commented on August 26, 2024

on the condition that if the user uses the data=obj syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray call succeeds.

IMO that's the big downside of your code snippet. As a pandas maintainer I of course don't mind that people need pandas :), but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?
Also, for the "guarantee that np.asarray call succeeds", that's basically something you can do based on the buffers in the interchange protocol (#66 (comment)), if the original dataframe library doesn't support it directly. But then we get back to the point that ideally library users of the protocol shouldn't get down to the buffer level.

from dataframe-api.

rgommers commented on August 26, 2024

but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?

Well, I see it as: the protocol supports turning one kind of dataframe into another kind, so as a downstream library if you support one specific library, you get all the other ones for free.

Really what Matplotlib wants here is: turn a single column into a numpy.ndarray. But if we support that, it should either be generic (like a potentially non-zero-copy way to use DLPack and/or the buffer protocol on a column), or we should support other array libraries too. Otherwise it's pretty ad-hoc imho.

but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?

Second thought: that is step two in the Consortium efforts - you need the generic public API, not just the interchange protocol. That's also what's said at https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html#progression-timeline.

from dataframe-api.

rgommers commented on August 26, 2024

We discussed this in a call, and the sentiment was that it would be very nice to have this Matplotlib use case work, and not have it wait for another API that is still to be designed.

For a column one can get from the dataframe interchange protocol, it would be very useful if that could be turned into an array (any kind of array which the consuming library - Matplotlib in this case - wants). Options to achieve that include:

inside the protocol, to get an array object from a column (but, we decided against that previously, when for example considering whether __dlpack__ should live on the column or the buffer level, and for `array_interface et al.)
inside each array library, it could have a from_column function there to create its own kind of array
in each consumer library (so Matplotlib would implement a Column -> numpy.ndarray path)
in a separate utility library that is designed to be vendored or depended upon by consumer libraries

The separate utility library likely makes the most sense. Benefits are: this code then only has to be written once, it keeps things outside of the protocol/standard, and it can be made available fairly quickly (no need to wait for multiple array libraries to implement something and then do a release).

To make the code independent of any array or dataframe library, it may have to look something like:

def array_from_column(
    df: DataFrame, 
    column_name: str,
    xp: Any,  # object/namespace implementing the array API
) -> <array>:
    """
    Produces an array from a column, if possible.

    Will raise a ValueError in case the column contains missing data or has a dtype
    that is not supported by the array API standard
    """

It's likely also practical to have a separate column_to_numpy function, given that Matplotlib wants (a) a numpy.ndarray rather than the numpy.array_api array object, and (b) needs things to work with 2 year old numpy releases. If this is in a separate utility library and in no way directly incorporated in the standard, the objections to incorporating numpy-specific things should not apply here.

from dataframe-api.

Interchange protocol use case: getting certain columns as numpy array about dataframe-api HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent