Comments (8)
I think asking for a column / DataFrame in a single chunk is something reasonable (whether or not it's part of the standard or interchange protocol). If we had the ability to get the column as a single chunk then the utility function becomes really straightforward or just becomes something like a np.asarray
call.
from dataframe-api.
If we had the ability to get the column as a single chunk
We already have this ability with the currently documented API, I think (the methods get_column()
/get_column_by_name()
and num_chunks
/ get_chunks()
on the column should be sufficient to get a single-chunk column).
the utility function becomes really straightforward or just becomes something like a np.asarray call.
A np.asarray
call only works if we add something like __array__
or __array_interface__
to Column, which we currently don't specify (cfr #48).
In case you meant calling it on the individual Buffer, that in itself will become trivial once numpy supports dlpack, yes.
You still need to handle the different buffers and dtypes etc. A quick attempt at a version with only limited functionality:
def column_to_numpy_array(col):
assert col.num_chunks == 1 # for now only deal with single chunks
kind, _, format_str, _ = col.dtype
if kind not in (0, 1, 2, 22):
raise TypeError("only numeric and datetime dtypes are supported")
if col.describe_null[0] not in (0, 1):
raise NotImplementedError("Null values represented as masks or "
"sentinel values not handled yet")
buffer, dtype = col.get_buffers()["data"]
arr = buffer_to_ndarray(buffer, dtype) # this can become `np.asarray` or `np.from_dlpack` in the future
if kind == 22: # datetime
unit = format_string.split(":")[-1]
arr = arr.view(f"datetime64[{unit}]")
return arr
where buffer_to_ndarray
is currently something like
That's certainly relatively straightforward code, but also dealing with a lot of details of the protocol, and IMO not something many end users should have to implement themselves.
from dataframe-api.
We already have this ability with the currently documented API, I think (the methods
get_column()
/get_column_by_name()
andnum_chunks
/get_chunks()
on the column should be sufficient to get a single-chunk column).
I meant more along the lines of given a column with multiple chunks, requesting the column to combine its chunks into a single chunk so that it has a contiguous buffer under the hood.
from dataframe-api.
This is a nice example, thanks @jorisvandenbossche. I feel like df.get_column_by_name("x").get_buffers()
is taking a wrong turn though - an end user library should indeed not need to deal with buffers directly.
xref [the plot() docs, see under "Plotting labelled data"
I think this would work:
df_obj = obj.__dataframe__().get_columns_by_name([x, y])
df = pd.from_dataframe(df)
xvals = df[x].values
yvals = df[y].values
Currently they require
obj["x"]
to give the desired data
That's not the actual requirement today I think - it's that np.asarray(obj[x]))
returns the data as a numpy array. Which is a fairly specific requirement - but even so it can be made to work just fine, on the condition that if the user uses the data=obj
syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray
call succeeds.
from dataframe-api.
I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))
I'm not sure about Matplotlib, but I do know that Napari would like this and has tried to improve compatibility with PyTorch and other libraries.
from dataframe-api.
on the condition that if the user uses the data=obj syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray call succeeds.
IMO that's the big downside of your code snippet. As a pandas maintainer I of course don't mind that people need pandas :), but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?
Also, for the "guarantee that np.asarray call succeeds", that's basically something you can do based on the buffers in the interchange protocol (#66 (comment)), if the original dataframe library doesn't support it directly. But then we get back to the point that ideally library users of the protocol shouldn't get down to the buffer level.
from dataframe-api.
but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?
Well, I see it as: the protocol supports turning one kind of dataframe into another kind, so as a downstream library if you support one specific library, you get all the other ones for free.
Really what Matplotlib wants here is: turn a single column into a numpy.ndarray
. But if we support that, it should either be generic (like a potentially non-zero-copy way to use DLPack and/or the buffer protocol on a column), or we should support other array libraries too. Otherwise it's pretty ad-hoc imho.
but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?
Second thought: that is step two in the Consortium efforts - you need the generic public API, not just the interchange protocol. That's also what's said at https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html#progression-timeline.
from dataframe-api.
We discussed this in a call, and the sentiment was that it would be very nice to have this Matplotlib use case work, and not have it wait for another API that is still to be designed.
For a column one can get from the dataframe interchange protocol, it would be very useful if that could be turned into an array (any kind of array which the consuming library - Matplotlib in this case - wants). Options to achieve that include:
- inside the protocol, to get an array object from a column (but, we decided against that previously, when for example considering whether
__dlpack__
should live on the column or the buffer level, and for `array_interface et al.) - inside each array library, it could have a
from_column
function there to create its own kind of array - in each consumer library (so Matplotlib would implement a
Column
->numpy.ndarray
path) - in a separate utility library that is designed to be vendored or depended upon by consumer libraries
The separate utility library likely makes the most sense. Benefits are: this code then only has to be written once, it keeps things outside of the protocol/standard, and it can be made available fairly quickly (no need to wait for multiple array libraries to implement something and then do a release).
To make the code independent of any array or dataframe library, it may have to look something like:
def array_from_column(
df: DataFrame,
column_name: str,
xp: Any, # object/namespace implementing the array API
) -> <array>:
"""
Produces an array from a column, if possible.
Will raise a ValueError in case the column contains missing data or has a dtype
that is not supported by the array API standard
"""
It's likely also practical to have a separate column_to_numpy
function, given that Matplotlib wants (a) a numpy.ndarray
rather than the numpy.array_api
array object, and (b) needs things to work with 2 year old numpy releases. If this is in a separate utility library and in no way directly incorporated in the standard, the objections to incorporating numpy-specific things should not apply here.
from dataframe-api.
Related Issues (20)
- Column reductions should return 1-row Column HOT 5
- What's the deal with Scalars? HOT 15
- __parameters__ should not be a documented property of the API HOT 1
- Dealing with `if scalar` HOT 2
- Roadmap for Dataframe API HOT 1
- Rename entrypoint to `__consortium_api__`? HOT 7
- Duration/timedelta not supported by dataframe interchange protocol? HOT 2
- Move `year` to `.dt.year`, and other namespace-specific function HOT 1
- Joins, and joining columns HOT 3
- How will cudf handle `df.assign(df.col('a').sort())`? HOT 10
- namespace.coalesce
- Expressions - another attempt HOT 10
- Remove DataFrame.take HOT 6
- Unclear Memory Ownership and Lifetimes
- Nullability Sentinel Very Hard to Use
- Instructions for libraries implementing PEP 249 – Python Database API Specification v2.0
- Row ordering design choice HOT 1
- Example of non-dictionary categorical?
- Question: Dfspec? HOT 2
- Date data type?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframe-api.