Coder Social home page Coder Social logo

Comments (8)

kkraus14 avatar kkraus14 commented on August 26, 2024

I think asking for a column / DataFrame in a single chunk is something reasonable (whether or not it's part of the standard or interchange protocol). If we had the ability to get the column as a single chunk then the utility function becomes really straightforward or just becomes something like a np.asarray call.

from dataframe-api.

jorisvandenbossche avatar jorisvandenbossche commented on August 26, 2024

If we had the ability to get the column as a single chunk

We already have this ability with the currently documented API, I think (the methods get_column()/get_column_by_name() and num_chunks / get_chunks() on the column should be sufficient to get a single-chunk column).

the utility function becomes really straightforward or just becomes something like a np.asarray call.

A np.asarray call only works if we add something like __array__ or __array_interface__ to Column, which we currently don't specify (cfr #48).

In case you meant calling it on the individual Buffer, that in itself will become trivial once numpy supports dlpack, yes.
You still need to handle the different buffers and dtypes etc. A quick attempt at a version with only limited functionality:

def column_to_numpy_array(col):
    assert col.num_chunks == 1  # for now only deal with single chunks
    kind, _, format_str, _ = col.dtype
    if kind not in (0, 1, 2, 22):
        raise TypeError("only numeric and datetime dtypes are supported")
    if col.describe_null[0] not in (0, 1):
        raise NotImplementedError("Null values represented as masks or "
                                  "sentinel values not handled yet")
    buffer, dtype = col.get_buffers()["data"]
    arr  = buffer_to_ndarray(buffer, dtype)  # this can become `np.asarray` or `np.from_dlpack` in the future
    if kind == 22:  # datetime
        unit = format_string.split(":")[-1]
        arr = arr.view(f"datetime64[{unit}]")
    return arr

where buffer_to_ndarray is currently something like

def buffer_to_ndarray(_buffer, _dtype) -> np.ndarray:
, but in the future can become a single numpy call once numpy supports DLPack.

That's certainly relatively straightforward code, but also dealing with a lot of details of the protocol, and IMO not something many end users should have to implement themselves.

from dataframe-api.

kkraus14 avatar kkraus14 commented on August 26, 2024

We already have this ability with the currently documented API, I think (the methods get_column()/get_column_by_name() and num_chunks / get_chunks() on the column should be sufficient to get a single-chunk column).

I meant more along the lines of given a column with multiple chunks, requesting the column to combine its chunks into a single chunk so that it has a contiguous buffer under the hood.

from dataframe-api.

rgommers avatar rgommers commented on August 26, 2024

This is a nice example, thanks @jorisvandenbossche. I feel like df.get_column_by_name("x").get_buffers() is taking a wrong turn though - an end user library should indeed not need to deal with buffers directly.

xref [the plot() docs, see under "Plotting labelled data"

I think this would work:

df_obj = obj.__dataframe__().get_columns_by_name([x, y])
df = pd.from_dataframe(df)
xvals = df[x].values
yvals = df[y].values

Currently they require obj["x"] to give the desired data

That's not the actual requirement today I think - it's that np.asarray(obj[x])) returns the data as a numpy array. Which is a fairly specific requirement - but even so it can be made to work just fine, on the condition that if the user uses the data=obj syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray call succeeds.

from dataframe-api.

rgommers avatar rgommers commented on August 26, 2024

I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))

I'm not sure about Matplotlib, but I do know that Napari would like this and has tried to improve compatibility with PyTorch and other libraries.

from dataframe-api.

jorisvandenbossche avatar jorisvandenbossche commented on August 26, 2024

on the condition that if the user uses the data=obj syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray call succeeds.

IMO that's the big downside of your code snippet. As a pandas maintainer I of course don't mind that people need pandas :), but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?
Also, for the "guarantee that np.asarray call succeeds", that's basically something you can do based on the buffers in the interchange protocol (#66 (comment)), if the original dataframe library doesn't support it directly. But then we get back to the point that ideally library users of the protocol shouldn't get down to the buffer level.

from dataframe-api.

rgommers avatar rgommers commented on August 26, 2024

but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?

Well, I see it as: the protocol supports turning one kind of dataframe into another kind, so as a downstream library if you support one specific library, you get all the other ones for free.

Really what Matplotlib wants here is: turn a single column into a numpy.ndarray. But if we support that, it should either be generic (like a potentially non-zero-copy way to use DLPack and/or the buffer protocol on a column), or we should support other array libraries too. Otherwise it's pretty ad-hoc imho.

but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?

Second thought: that is step two in the Consortium efforts - you need the generic public API, not just the interchange protocol. That's also what's said at https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html#progression-timeline.

from dataframe-api.

rgommers avatar rgommers commented on August 26, 2024

We discussed this in a call, and the sentiment was that it would be very nice to have this Matplotlib use case work, and not have it wait for another API that is still to be designed.

For a column one can get from the dataframe interchange protocol, it would be very useful if that could be turned into an array (any kind of array which the consuming library - Matplotlib in this case - wants). Options to achieve that include:

  • inside the protocol, to get an array object from a column (but, we decided against that previously, when for example considering whether __dlpack__ should live on the column or the buffer level, and for `array_interface et al.)
  • inside each array library, it could have a from_column function there to create its own kind of array
  • in each consumer library (so Matplotlib would implement a Column -> numpy.ndarray path)
  • in a separate utility library that is designed to be vendored or depended upon by consumer libraries

The separate utility library likely makes the most sense. Benefits are: this code then only has to be written once, it keeps things outside of the protocol/standard, and it can be made available fairly quickly (no need to wait for multiple array libraries to implement something and then do a release).

To make the code independent of any array or dataframe library, it may have to look something like:

def array_from_column(
    df: DataFrame, 
    column_name: str,
    xp: Any,  # object/namespace implementing the array API
) -> <array>:
    """
    Produces an array from a column, if possible.

    Will raise a ValueError in case the column contains missing data or has a dtype
    that is not supported by the array API standard
    """

It's likely also practical to have a separate column_to_numpy function, given that Matplotlib wants (a) a numpy.ndarray rather than the numpy.array_api array object, and (b) needs things to work with 2 year old numpy releases. If this is in a separate utility library and in no way directly incorporated in the standard, the objections to incorporating numpy-specific things should not apply here.

from dataframe-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.