data-apis / dataframe-api Goto Github PK
View Code? Open in Web Editor NEWRFC document, tooling and other content related to the dataframe API standard
Home Page: https://data-apis.org/dataframe-api/draft/index.html
License: MIT License
RFC document, tooling and other content related to the dataframe API standard
Home Page: https://data-apis.org/dataframe-api/draft/index.html
License: MIT License
One of the main usefulness of the interchangeable protocol is to consume a DataFrameObject
from
other dataframe libraries:
myDFLibrary.from_dataframe(anotherDFLibrary.__dataframe__())
Currently, we can only write tests with DataFrameObject
from the same library we are implementing the protocol in:
myDFLibrary.from_dataframe(myDFLibrary.__dataframe__())
In the case of cudf
(GPU dataframe), we can't write tests for the scenario where a DataFrameObject
is on the CPU like for pandas. This is a very important use case of the protocol specifications to test as it makes sure the device transfer is properly handled.
I don't how to go about it. The objective of this issue is to think collectively about a way to mock up these kinds of use cases.
An example where mocking up is relatively simple, is testing chunks
feature as cudf
does not support it like pandas. To mock up chunks, we can create many DataFrameObject
from chunks of few rows from a given DataFrameObject
.
This is not even half-baked, but I wanted to gauge interest/feasibility for the spec to encapsulate n-dimensional "columns" of data, equivalent to xarray's DataArrays. In that case, the currently-envisioned columns would be the 1D specific case of a higher-D general case. We've found that in some use cases we need these in napari (napari/napari#2592, napari/napari#2917), and it would be awesome to conform to the dataframe API and be compatible with both xarray and pandas.
Of course the other way around this is to ignore the higher-D libraries, and have them conform to the API once it's settled. That might be more reasonable, in which case, I'm perfectly happy for this to be closed. ๐
Validity mask is a missing value representation that depends on the Column in the protocol. If describe_null()
is meant to describe missing values at the column level for a given dtype.
In the case there is no missing values, shall we still provide a validity array with 1 (valid) at all entries or
shall we raise an exception ?
From my perspective, the later is better because we can just check null_count == 0
without allocating and filling the whole array with the same value. That is how it works in cudf dataframe. If there is no missing value, accessing the attribute nullmask
(which holds the validity array) raises an exception with message: "Column has no null mask".
This was just asked about at https://twitter.com/__AlexMonahan__/status/1430522318854377475. I'd say we should have a similar argument as https://data-apis.org/array-api/latest/design_topics/copies_views_and_mutation.html. We cannot prevent mutations in the protocol itself, and existing libraries already may have APIs that do mutate in-place. So we should recommend against it (but of course power users that really understand what they are doing could go right ahead and mutate to their hearts' content - this is the same as for the buffer protocol, DLPack et al.).
There's an earlier discussion at #10 (which was unrelated to the protocol). The general sentiment was that in-place operations must be avoided for a full dataframe API.
Right now https://data-apis.org/dataframe-protocol/latest/ says nothing about mutability, this should be added.
I have spent a lot of time trying to understand users and their behaviors in order to optimize for them. As a part of this work, I have done numerous studies on what gets used in pandas.
This will be extremely useful when it comes to defining a dataframe standard, because what people are using can help inform us on what behaviors to support.
For this study, we scraped the top 6000 notebooks from Kaggle by upvote.
Repo here, reproduction script included: https://github.com/modin-project/study_kaggle_usage
Results here: results.csv
Based on what it's defined in wesm/dataframe-protocol#1, the idea is to not support a single format to exchange data, but support multiple (e.g. arrow, numpy).
Using a code example here, to see what this approach implies.
1. Dataframe implementations should implement the __dataframe__
, returning the exchange format we are defining
For example, let's assume Vaex is using Arrow, and it wants to offer its data in Arrow format to consumers:
import pyarrow
class VaexExchangeDataFrame:
"""
The format defined by our spec.
Besides `to_arrow`, `to_numpy` it should implement the rest of
the spec `num_rows`, `num_columns`, `column_names`...
"""
def __init__(self, arrow_data):
self.arrow_data = arrow_data
def to_arrow(self):
return self.arrow_data
def to_numpy(self):
raise NotImplementedError('numpy format not implemented')
class VaexDataFrame:
"""
The public Vaex dataframe class.
For simplicity of the example, this just wraps an arrow object received in the constructor,
but this would be the whole `vaex.DataFrame`.
"""
def __init__(self, arrow_data):
self.arrow_data = arrow_data
def __dataframe__(self):
return VaexExchangeDataFrame(self.arrow_data)
# Creating an instance of the Vaex public dataframe
vaex_df = VaexDataFrame(pyarrow.RecordBatch.from_arrays([pyarrow.array(['pandas', 'vaex', 'modin'],
type='string'),
pyarrow.array([26_300, 4_900, 5_200],
type='uint32')],
['name', 'github_stars']))
Other implementations could use formats different from Arrow, for example, let's assume Modin wants to offer its data as numpy arrays:
import numpy
class ModinExchangeDataFrame:
def __init__(self, numpy_data):
self.numpy_data = numpy_data
def to_arrow(self):
raise NotImplementedError('arrow format not implemented')
def to_numpy(self):
return self.numpy_data
class ModinDataFrame:
def __init__(self, numpy_data):
self.numpy_data = numpy_data
def __dataframe__(self):
return ModinExchangeDataFrame(self.numpy_data)
modin_df = ModinDataFrame({'name': numpy.array(['pandas', 'vaex', 'modin'], dtype='object'),
'github_stars': numpy.array([26_300, 4_900, 5_200], dtype='uint32')})
2. Direct consumers should be able to understand all formats
For example, pandas could implement a from_dataframe
function to create a pandas dataframe from different formats:
import pandas
def from_dataframe(dataframe):
known_formats = {'numpy': lambda df: pandas.DataFrame(df),
'arrow': lambda df: df.to_pandas()}
exchange_dataframe = dataframe.__dataframe__()
for format_ in known_formats:
try:
data = getattr(exchange_dataframe, f'to_{format_}')()
except NotImplementedError:
pass
else:
return known_formats[format_](data)
raise RuntimeError('Dataframe does not support any known format')
pandas.from_dataframe = from_dataframe
This would allow pandas user to load data from other formats:
pandas_df_1 = pandas.from_dataframe(vaex_df)
pandas_df_2 = pandas.from_dataframe(modin_df)
Vaex, Modin and any other implementation could implement an equivalent function to load data from other
libraries into their formats.
3. Indirect consumers can pick an implementation, and use it to standardize its input
For example, Seaborn may want to accept any dataframe implementation, but wants to write its code in pandas (the access to the data). It could convert any dataframe to pandas, using from_dataframe
from the previous section:
def seaborn_bar_plot(any_dataframe, x, y):
pandas_df = pandas.from_dataframe(any_dataframe)
return pandas_df.plot(kind='bar', x=x, y=y)
seaborn_bar_plot(vaex_df, x='name', y='github_stars')
Are people happy with this approach?
CC: @rgommers
In #10, it's been discussed that it would be convenient if the dataframe API allows method chaining. For example:
import pandas
(pandas.read_csv('countries.csv')
.rename(columns={'name': 'country'})
.assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
.query('(continent.str.lower() != "antarctica") | (population < area_km2)'))
This implies that most functionality is implemented as methods of the dataframe class. Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace. pandas uses a mixed approach, with different techniques to try to organize the API.
df.sum()
df.astype()
Many of the methods are simply implemented directly as methods of dataframe.
df.to_csv()
df.to_parquet()
Some of the methods are grouped with a common prefix.
df.str.lower()
df.dt.hour()
Accessors are a property of dataframe (or series, but assuming only one dataframe class for simplicity) that groups some methods under it.
pandas.wide_to_long(df)
pandas.melt(df)
In some cases, functions are used instead of methods.
df.apply(func)
df.applymap(func)
pandas also provides a more functional API, where functions can be passed as parameters
I guess we will agree, that a uniform and consistent API would be better for the standard. That should make things easier to implement, and also a more intuitive experience for the user.
Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:
@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
def my_custom_method(self):
return True
df.my_accessor.my_custom_method()
df.apply(my_custom_function)
df.apply(numpy.sum)
Conceptually, I think there are some methods that should go together, more than by topic, by the API they follow. The clearest example is reductions, and there was some discussion in #11 (comment).
I think no solution will be perfect, and the options that we have are (feel free to add to the list if I'm missing any option worth considering):
df.sum()
df.reduce_sum()
df.reduce.sum()
mod.reductions.sum(df)
mod represents the implementation module (e.g. pandas
)
df.reduce(mod.reductions.sum)
Personally, my preference is the functional API. I think it's the simplest that keeps things organized, and the simplest to extend. The main drawback is its readability, it may be too verbose. There is the option to allow using a string instead of the function for known functions (e.g. df.reduce('sum')
).
Thoughts? Other ideas?
One optional feature to consider including in our specification is nested "DataFrames"/Tables (or whatever name we decide to use there).
riptide does not currently support this concept, but I've been thinking recently that maybe it ought to since it provides a cleaner, more elegant solution for supporting "super columns".
Such "super columns" arise (for example) when performing multiple aggregates on a DataFrame, especially in the common case of grouping by the values/keys in one or more columns then calculating per-group reductions over some subset of the remaining columns.
pandas currently handles this scenario using the concepts of index "levels" and row labeling. This solves the problem but adds a good bit of complexity to the API, including having a stateful DataFrame
class. (One could argue the statefulness of DataFrame
is a pandas implementation detail of the approach and not inherently a drawback of the approach itself.)
riptide currently handles this scenario by having a Multiset
class derived from our Dataset
class, where Multiset
is basically just a dictionary of named Datasets
. This works ok and isn't that far removed from the notion of having nested Dataset
instances -- and if you squint just right when you look at Multiset
, it's not really that different from the pandas system of index levels + row labels. However, Multiset
has it's own drawbacks. Most notably, deriving from our Dataset
class means any function that knows how to operate on a Dataset
also (in many cases) needs to know how to work with a Multiset
in order to produce the "correct" output (in terms of the type, dimensions, and data) -- what's the expected output if one calls a 'merge'-type function with a Multiset
and a Dataset
, or with two Multiset
s?
pyarrow's documentation for pandas interoperability says it's Table
class already supports nested DataFrames / column groups, although that's the only mention of this behavior I can find.
Nested DataFrames provide a clean solution for representing the "super columns" resulting from these multi-aggregation operations; specifically, they:
Multiset
class. Each "column" in a DataFrame is either a 1D array (or maybe an array scalar?) or another DataFrame of the same length..sum()
on a DataFrame can distribute that function call over it's columns by just iterating over them and calling .sum()
on each of them.One way the nesting makes things more complicated (maybe) is what to return from a property like DataFrame.num_cols
-- should it be the number of columns as seen by that DataFrame instance, or should it be a flattened value (so counting the columns from any nested DataFrames as well). I think this could be disambiguated by having two properties like DataFrame.num_cols
and DataFrame.num_cols_flat
.
I think there is consensus (correct me if I'm wrong), on having a 2-D structure where (at least) columns are labelled, and where a whole column share a type. More specific discussions about this structure can be made in #2.
In this issue, I'd like to discuss how we should name the class representing this structure. We've been using dataframe for the concept so far, and it's how the class is named in pandas, vaex, Modin, R and others. But in #14 (comment) it was proposed that we consider other names. I list here the proposed options in the comment and couple more. I propose that people write their username next to their preferred option, and use the comments to expand on why if needed.
Also, I think we should decide about capitalization, I guess these are the only options (using dataframe as example, but applied to the preferred option from the above list):
int
, datetime.datetime
, numpy.array
)
Hello everyone!
I've been mulling over introducing the Dataframe Exchange protocol in Pandas and Modin, and I think it would be beneficial for every end library implementing the protocol to have the exact same base.
Right now the protocol interface is defined by code, but said code is not "published" as a ready to use Python source.
I would like to make it a real PyPI package to use it in type hinting and (ideally) mypy
type checking and to enable other libraries to do the same.
I propose to publish the package as dataframe-protocol
or df-protocol
and rename protocol/
directory to df_protocol
turning it in a real Python package.
That way any library which would be implementing the protocol would just from df_protocol import exchange
and use it for type hints (and for enum values - as now they're embedded in docstrings which just look really weird to me).
Am I missing something here? Are there any objections?
I can make the PR with necessary changes if there is the agreement, and can keep it published both in PyPI and conda-forge (or can turn the publishing to someone else in the consortium).
P.S. Keeping the top-level df_protocol
would allow us to add another subpackage for cross-operation API if/when we feel ready for that (so keeping this future-proof).
Probably a bit early to the discussion, but I think this will need to be discussed eventually.
Is a separate object representing a single column needed? Like having Series, instead of just using one column DataFrame.
Having two separate objects, IMO adds a decent amount of complexity, both in the implementation and for the user. Whether this complexity is worth or not, I don't know. But I think this shouldn't be replicated from pandas without a discussion.
For dataframe interchange, the smallest building block is a "buffer" (see gh-35, gh-38) - a block of memory. Interpreting that is nontrivial, especially if the goal is to build an interchange protocol in Python. That's why DLPack, buffer protocol, __array_interface__
, __cuda_array_interface__
, __array__
and __arrow_array__
all exist, and are still complicated.
For what a buffer is, currently it's only a data pointer (ptr
) and a size (bufsize
) which together describe a contiguous block of memory, plus a device attribute (__dlpack_device__
) and optionally DLPack support (__dlpack__
). One open question is:
The other, larger question is how to make buffers nice to deal with for implementers of the protocol. The current Pandas prototype shows the issue:
def convert_column_to_ndarray(col : ColumnObject) -> np.ndarray:
"""
"""
if col.offset != 0:
raise NotImplementedError("column.offset > 0 not handled yet")
if col.describe_null not in (0, 1):
raise NotImplementedError("Null values represented as masks or "
"sentinel values not handled yet")
# Handle the dtype
_dtype = col.dtype
kind = _dtype[0]
bitwidth = _dtype[1]
if _dtype[0] not in (0, 1, 2, 20):
raise RuntimeError("Not a boolean, integer or floating-point dtype")
_ints = {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}
_uints = {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}
_floats = {32: np.float32, 64: np.float64}
_np_dtypes = {0: _ints, 1: _uints, 2: _floats, 20: {8: bool}}
column_dtype = _np_dtypes[kind][bitwidth]
# No DLPack yet, so need to construct a new ndarray from the data pointer
# and size in the buffer plus the dtype on the column
_buffer = col.get_data_buffer()
ctypes_type = np.ctypeslib.as_ctypes_type(column_dtype)
data_pointer = ctypes.cast(_buffer.ptr, ctypes.POINTER(ctypes_type))
# NOTE: `x` does not own its memory, so the caller of this function must
# either make a copy or hold on to a reference of the column or
# buffer! (not done yet, this is pretty awful ...)
x = np.ctypeslib.as_array(data_pointer,
shape=(_buffer.bufsize // (bitwidth//8),))
return x
From #38 (review) (@kkraus14 & @rgommers):
In
__cuda_array_interface__
we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.
Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.
__cuda_array_interface__
is directly attached to the object you need to hold on to, which is not the case for this Buffer
.
I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.
Yep, for numerical data types the solution can simply be: hurry up with implementing __dlpack__
, and the problem goes away. The dtypes that DLPack does not support are more of an issue.
From #38 (comment) (@jorisvandenbossche):
I personally think it would be useful to keep those existing interface methods (or array, or arrow_array). For people that are using those interface, that will be easier to interface with the interchange protocol than manually converting the buffers.
We could change the plain memory description + __dlpack__
to:
ptr
, bufsize
, and devicenative
enum attribute, and if both producer and consumer happen to use that native format, they can call the corresponding protocol - __arrow_array__
or __array__
)__cuda_array_interface__
, buffer protocol, __array_interface__
).(1) is required for any implementation to be able to talk to any other implementation, but also the most clunky to support because it needs to solve the "who owns this memory and how do you prevent it from being freed" all over again. What is needed there is
The advantage of (2) and (3) are that they have the most hairy issue already solved, and will likely be faster.
And the MUST/MAY should address @kkraus14's concern that people will just standardize on the lowest common denominator (numpy).
A summary of why this is hard is:
So what we are aiming for (ambitiously) is:
The "holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well" seems necessary for supporting the raw memory description. This probably means that (a) the Buffer
object should include the right Python object to keep a reference to (for Pandas that would typically be a 1-D numpy array), and (b) there must be some machinery to keep this reference alive (TBD what that looks like, likely not pure Python) in the implementation.
This is a first version of the analysis of pandas usage in Kaggle notebooks.
We've fetched Python notebooks from Kaggle and we run them using record_api to analyze the number of calls to the main objects of the pandas API. A total of 895 notebooks could be analyzed.
In a separate column, information about the page views in the pandas documentation has been added. The page views are normalized by 1,000 (so the page with more views in the pandas documentation would have a value of 1,000 in the column).
For simplicity, only the attributes of DataFrame
, Series
and the pandas
top-level module have been merged. So, pandas.sum()
, Series.sum()
and DataFrame.sum()
would appear in the list as simply sum
.
The different sections are to help reading the document, and not an "official" categorization of the API. Feedback is welcome if something feels misplaced.
The source code to generate the table is available at this repo.
Notes:
__add__
) are merged with their equivalent method (e.g. add
)__getitem__
is both used to access a column df[col]
and to filter df[condition]
__getattr__
(e.g. df.col_name
), but this has not been capturedObject | Kaggle calls |
---|---|
__getitem__ |
143992 |
__setitem__ |
40059 |
eq |
3018 |
mul |
2799 |
add |
2768 |
groupby |
2267 |
loc |
1667 |
drop |
1618 |
fillna |
1609 |
columns |
1583 |
head |
1575 |
truediv |
1442 |
shape |
1267 |
sub |
1144 |
isnull |
1057 |
sort_values |
1015 |
and |
957 |
values |
953 |
sum |
898 |
astype |
728 |
value_counts |
706 |
index |
664 |
gt |
622 |
apply |
538 |
to_frame |
479 |
Object | Kaggle calls | Docs views |
---|---|---|
info |
275 | 22 |
empty |
0 | 32 |
describe |
303 | 146 |
value_counts |
706 | 161 |
dtypes |
175 | 64 |
memory_usage |
83 | 2 |
ndim |
0 | 1 |
shape |
1267 | 17 |
size |
3 | 45 |
values |
953 | 113 |
attrs |
0 | 0 |
array |
0 | 0 |
unique |
193 | 106 |
dtype |
149 | 8 |
nbytes |
0 | 0 |
Object | Kaggle calls | Docs views |
---|---|---|
__getitem__ |
143992 | 0 |
__setitem__ |
40059 | 0 |
axes |
0 | 4 |
columns |
1583 | 31 |
set_index |
72 | 278 |
swapaxes |
0 | 0 |
select_dtypes |
180 | 36 |
lookup |
0 | 11 |
xs |
5 | 16 |
loc |
1667 | 232 |
iloc |
427 | 122 |
index |
664 | 164 |
reindex |
11 | 136 |
reindex_like |
0 | 2 |
reset_index |
305 | 279 |
add_prefix |
16 | 6 |
add_suffix |
0 | 3 |
get |
0 | 16 |
iat |
1 | 17 |
keys |
13 | 16 |
at |
4 | 40 |
filter |
3 | 170 |
rename |
401 | 355 |
rename_axis |
0 | 13 |
idxmax |
7 | 49 |
idxmin |
0 | 10 |
droplevel |
0 | 0 |
truncate |
0 | 7 |
swaplevel |
0 | 7 |
take |
0 | 5 |
reorder_levels |
0 | 5 |
sort_index |
32 | 90 |
set_axis |
0 | 1 |
pop |
14 | 9 |
searchsorted |
0 | 3 |
name |
113 | 13 |
item |
0 | 3 |
argmax |
0 | 2 |
argmin |
0 | 1 |
argsort |
0 | 3 |
Object | Kaggle calls | Docs views |
---|---|---|
nlargest |
25 | 17 |
nsmallest |
1 | 8 |
head |
1575 | 108 |
tail |
60 | 12 |
drop_duplicates |
20 | 194 |
sort_values |
1015 | 457 |
sample |
63 | 102 |
query |
12 | 69 |
Object | Kaggle calls | Docs views |
---|---|---|
add |
2768 | 104 |
div |
2 | 10 |
dot |
0 | 9 |
eq |
3018 | 1 |
equals |
0 | 35 |
floordiv |
3 | 0 |
ge |
68 | 1 |
gt |
622 | 1 |
le |
197 | 0 |
lt |
8 | 0 |
mod |
11 | 1 |
mul |
2799 | 4 |
ne |
163 | 1 |
pow |
29 | 2 |
product |
0 | 3 |
radd |
0 | 6 |
rdiv |
0 | 0 |
rfloordiv |
0 | 0 |
rmod |
0 | 0 |
rmul |
0 | 2 |
rpow |
0 | 0 |
rsub |
0 | 2 |
rtruediv |
0 | 2 |
sub |
1144 | 7 |
truediv |
1442 | 0 |
Object | Kaggle calls | Docs views |
---|---|---|
isnull |
1057 | 90 |
notnull |
60 | 40 |
dropna |
193 | 346 |
fillna |
1609 | 248 |
interpolate |
3 | 39 |
isna |
108 | 27 |
notna |
5 | 11 |
hasnans |
0 | 0 |
Object | Kaggle calls | Docs views |
---|---|---|
cut |
59 | 84 |
eval |
0 | 12 |
corrwith |
1 | 11 |
applymap |
2 | 49 |
astype |
728 | 234 |
rank |
2 | 34 |
clip |
4 | 13 |
where |
10 | 105 |
mask |
14 | 25 |
combine |
0 | 12 |
combine_first |
0 | 11 |
isin |
86 | 138 |
abs |
25 | 12 |
replace |
463 | 216 |
apply |
538 | 379 |
round |
14 | 68 |
transform |
10 | 39 |
factorize |
3 | 15 |
map |
420 | 91 |
between |
1 | 12 |
Object | Kaggle calls | Docs views |
---|---|---|
cov |
0 | 9 |
quantile |
47 | 78 |
var |
4 | 11 |
skew |
88 | 5 |
std |
140 | 39 |
sum |
898 | 114 |
kurt |
60 | 1 |
kurtosis |
23 | 3 |
count |
109 | 107 |
max |
131 | 70 |
mean |
390 | 107 |
median |
228 | 21 |
min |
107 | 26 |
mode |
205 | 18 |
prod |
1 | 1 |
nunique |
15 | 27 |
all |
9 | 16 |
any |
87 | 22 |
mad |
3 | 2 |
sem |
0 | 2 |
corr |
239 | 105 |
is_monotonic |
0 | 0 |
is_monotonic_decreasing |
0 | 0 |
is_monotonic_increasing |
0 | 0 |
is_unique |
0 | 1 |
cov |
0 | 9 |
autocorr |
0 | 7 |
quantile |
47 | 78 |
Object | Kaggle calls | Docs views |
---|---|---|
iterrows |
39 | 102 |
style |
84 | 76 |
itertuples |
0 | 36 |
bool |
0 | 5 |
squeeze |
0 | 2 |
update |
8 | 56 |
pipe |
3 | 7 |
__iter__ |
0 | 1 |
items |
1 | 6 |
iteritems |
3 | 37 |
view |
0 | 0 |
Object | Kaggle calls | Docs views |
---|---|---|
get_dummies |
258 | 152 |
crosstab |
58 | 40 |
concat |
432 | 315 |
merge_asof |
0 | 16 |
merge_ordered |
0 | 4 |
wide_to_long |
0 | 7 |
pivot |
29 | 95 |
pivot_table |
54 | 144 |
join |
159 | 225 |
melt |
18 | 75 |
stack |
0 | 36 |
transpose |
9 | 76 |
assign |
19 | 74 |
insert |
17 | 57 |
merge |
425 | 413 |
drop |
1618 | 625 |
explode |
0 | 0 |
align |
3 | 10 |
append |
439 | 515 |
T |
55 | 6 |
unstack |
17 | 58 |
repeat |
0 | 5 |
ravel |
0 | 5 |
Object | Kaggle calls | Docs views |
---|---|---|
agg |
0 | 16 |
aggregate |
3 | 58 |
groupby |
2267 | 719 |
Object | Kaggle calls | Docs views |
---|---|---|
cummax |
0 | 2 |
cummin |
0 | 0 |
cumprod |
0 | 5 |
cumsum |
8 | 29 |
pct_change |
0 | 34 |
rolling |
42 | 140 |
ewm |
0 | 33 |
expanding |
0 | 11 |
duplicated |
14 | 90 |
diff |
1 | 54 |
On the call yesterday, the topic of mutability came up in the vaex demo.
The short version is that it may be difficult or impossible for some systems to implement inplace mutation of dataframes. For example, I believe that neither vaex nor Dask implement the following:
In [8]: df = pd.DataFrame({"A": [1, 2]})
In [9]: df
Out[9]:
A
0 1
1 2
In [10]: df.loc[0, 'A'] = 0
In [11]: df
Out[11]:
A
0 0
1 2
I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.
There is one mutation-adjacent area that might be considered: using DataFrame.__setitem__
to add an additional column
In [12]: df['B'] = [1, 2]
In [13]: df
Out[13]:
A B
0 0 1
1 2 2
Or perhaps to update the contents of an entire column
In [14]: df['B'] = [3, 4]
In [15]: df
Out[15]:
A B
0 0 3
1 2 4
In these case, no values are actually being mutated inplace. Is that acceptable?
This issue is to discuss how to obtain the size of a dataframe. I'll show with an example, and base it in the pandas API.
Given a dataframe:
import pandas
data = {'col1': [1, 2, 3, 4],
'col2': [5, 6, 7, 8]}
df = pandas.DataFrame(data)
I think the Pythonic and simpler way to get the number of rows and columns is to just use Python's len
, what pandas does:
>>> len(df) # number of rows
4
>>> len(df.columns) # number of columns
2
I guess an alternative could be to use df.num_rows
and df.num_columns
, but IMHO it doesn't add much value, and just makes the API more complex.
One thing to note, is that pandas mostly implements the dict
API for a dataframe (as if it was a dictionary of lists, like in the example data
). But when returning the number of rows with len(df)
, this is inconsistent with the dict
API, which would return the number of columns (keys). So, with the proposed API len(data) != len(df)
. I think being fully consistent with the dict
API would be misleading, but worth considering it.
Then, pandas offers some extra properties:
df.ndim == 2
df.shape == len(df), len(df.columns)
df.size == len(df) * len(df.columns)
I guess the reason for the first two is that pandas originally implemented Panel
, a three dimensional data structure, and ndim
and shape
made sense with it. But I don't think they add much value now.
I don't think size
is that commonly used (will check once we have the data of analyzing pandas usage), and it's trivial for the users to implement it, so I wouldn't add it to the API.
len(df)
returning the number of rowslen(df.columns)
returning the number of columnsAnd nothing else regarding the shape of a dataframe.
This issues is dedicated to discussing the large topic of "missing" data.
First, a bit on names. I think we can reasonably choose between NA
, null
, or missing
as a general name for "missing" values. We'd use that to inform decisions on method names like DataFrame.isna()
vs. DataFrame.isnull()
vs. ...
Pandas favors NA
, databases might favor null
, Julia uses missing
. I don't have a strong opinion here.
Some topics of discussion:
I think we'd like that the introduction of missing data should not fundamentally change the dtype of a column.
This is not the case with pandas:
In [5]: df1 = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})
In [6]: df2 = pd.DataFrame({"A": ['a', 'c'], "C": [3, 4]})
In [7]: df1.dtypes
Out[7]:
A object
B int64
dtype: object
In [8]: pd.merge(df1, df2, on="A", how="outer")
Out[8]:
A B C
0 a 1.0 3.0
1 b 2.0 NaN
2 c NaN 4.0
In [9]: _.dtypes
Out[9]:
A object
B float64
C float64
In pandas, for int-dtype data NaN
is used as the missing value indicator. NaN
is a float, and so the column is cast to float64 dtype.
Ideally Out[9]
would preserve the int dtype for B
and C
. At this moment, I don't have a strong opinion on whether the dtype for B
should be a plain int64
, or something like a Union[int64, NA]
.
In general, missing values should propagate in arithmetic and comparison operations (using <NA>
as a marker for a missing value)`.
>>> df1 = DataFrame({"A": [1, None, 3]})
>>> df1 + 1
A
0 2
1 <NA>
2 4
>>> df1 == 1
A
0 True
1 <NA>
2 False
There might be a few exceptions. For example 0 ** NA
might be 1 rather than NA
, since it doesn't matter exactly what value NA
takes on.
For boolean logical operations (and, or, xor), libraries should implement three-value or Kleene Logic. The pandas docs has a table
The short-version is that the result should be NA
if it depends on whether the NA
operand being True or False. For example, True | NA
is True
, since it doesn't matter whether that NA
is "really" True or False.
Libraries might need to implement a scalar NA
value, but I'm not sure. As a user, you would get this from indexing to get a scalar, or in an operation that produces an NA result.
>>> df = pd.DataFrame({"A": [None]})
>>> df.iloc[0, 0] # no comment on the indexing API
<NA>
What semantics should this scalar NA have? In particular, should it be typed? This is something we've struggled with in recent versions of pandas. There's a desire to preserve a property along the lines of the following
(arr1 + arr2)[0].dtype == (arr1 + arr2[0]).dtype
Where the first value in the second array is NA
. If you have a single NA
without any dtype, you can't implement that property.
There's a long thread on this at pandas-dev/pandas#28095.
There was a question on the sync call today about defining "what is a data frame?". People may have different perspectives, but I wanted to offer mine:
A "data frame" is a programming interface for expressing data manipulations and analytical operations on tabular datasets (a dataset, in turn, is a collection of columns each having their own logical data type) in a general purpose programming language. The interface often exposes imperative, composable constructs where operations consist of multiple statements or function calls. This contrasts with the declarative interface of query languages like SQL.
Things that IMHO should not be included in the definition, and are implementation-specific concerns, and any given "data frame library" may work differently:
Hopefully one objective of this group will be to define a standardized programming interface that avoids commingling implementation-specific details into the interface.
That said, there may be people that want to create "RPandas" (see RPython) -- i.e. to provide for substituting new objects into existing code that uses pandas. If that's what some people want, we will need to clarify that up front.
We've already got several useful discussions open, on different topics. To try to give a bit of structure to the conversations, I propose we try to start with an initial MVP (minimum viable product), and we build iterating over it.
This is a draft of the topics that we may want to discuss, and a possible order to discuss them:
df[col]
, df[col1, col2]
), and calling methods in 1 vs N columnsThe idea would be to discuss and decide about each topic incrementally, and keep defining an API that can be used end to end (with very limited functionality at the beginning). So, focusing on being able to write code with the API, we should be identifying for each topic, the questions that need to be answered to construct the API. And then add to the RFC the API definition based on the agreements.
Next there is a very simple example of dataframe usage. And the questions that need to be answered to define a basic API for them.
>>> from whatever import dataframe
>>> data = {'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}
>>> df = dataframe.load(data, format='dict')
>>> df
a b c
-----
1 3 5
2 4 6
>>> len(df)
2
>>> len(df.columns)
3
>>> df.dtypes
[int, int, int]
>>> df.columns
['a', 'b', 'c']
>>> df.columns = 'x', 'y', 'z'
>>> df.columns
['x', 'y', 'z']
>>> df
x y z
-----
1 3 5
2 4 6
>>> df['q'] = [7, 8]
>>> df
x y z q
-------
1 3 5 7
2 4 6 8
>>> df['y']
y
-
3
4
>>> df['z', 'x']
z x
---
5 1
6 2
>>> df.dump(format='dict')
{'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'q': [7, 8]}
The simpler questions that need to be answered to define this MVP API are:
DataFrame
or Dataframe
, to be consistent with Python class capitalizationdataframe
, using Python type capitalization (as in int
, bool
, datetime.datetime
...num_columns
, num_rows
)len
: len(df)
, len(df.columns)
shape
(it allows for N dimensions, which for a dataframe is not needed, since it's always 2D)dtypes
property enough?)columns
, column_names
...The next two questions are also needed, but they are more complex, so I'll be creating separate issues for them:
Loading and exporting data
pandas.read_csv
...) and for loading data from memory (DataFrame.from_dict
)? Or a standard way for all loading/exporting is preferred?How to access and set columns in a dataframe
__getittem__
directly (df[col]
/ df[col] = foo
)__getitem__
over a property (df.col[col]
/ df.col[col] = foo
)df.get(col)
/ df.set(col=foo)
)The bulk of the dataframe interchange protocol was done in gh-38. There were still a number of TODOs however, and more will likely pop up once we have multiple implementations so we can actually turn one type of dataframe into another type. This is the tracking issue for those TODOs and issues:
null
as a category; it should not have a specified meaning, it's just another category that should (e.g.) roundtrip correctly. See conversation in 8 Apr meeting.(kind, bitwidth, format_str, endianness)
, with categorical being a value of the kind
enum. Is making a 5th element in the dtype, with that element being another dtype 4-tuple, thereby allowing for nesting, sensible?metadata
attribute that can be used to store library-specific things. For example, Vaex should be able to store expressions for its virtual columns there. See PR gh-43offsets
and a data
buffer, see #38 (comment)). _See PR gh-45from_dataframe
protocol? See #42 and meeting of 20 May.owner
attribute is perhaps needed. See meeting minutes 4 March, #39, and comments on this PR.Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?
It can be the case than a given column has more more than 99% of its values that are null or missing (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.
nanmean
and nanstd
of a sparse column with more then 99% are missing(incomplete, feel free to edit or comment)
fill_value
param)From my own experience, there are (at least) two very different use cases to use dataframes:
While I think pandas (and later Vaex, Dask, Modin...) did a very reasonable job at building a single tool that solves both use cases. There are trade offs, that IMO will bias any API towards one or the other.
Some specific examples:
I wrote a post about this that describes this point of view in more detail.
I think it can help the discussions to keep in mind that there are at least two main use cases, and that there will be trade offs among them.
Feedback here very welcome.
What data types should be part of the standard? For the array API, the types have been discussed here.
A good reference for data types for data frames is the Arrow data types documentation. The page probably contains many more types than the ones we want to support in the standard.
Topics to make decisions on:
These are IMO the main types (feel free to disagree):
Some other types that could be considered:
And also types based on other types that could be considered:
This is a follow up of the discussions in:
pandas has parameters (bool_only, numeric_only) to let only apply the operation over columns of certain types only. Do we want it?
)See this example:
>>> df[['name', 'population']].mean()
population 2.729748e+07
dtype: float64
Even if the name
column is selected, it is being ignored, since the mean of a string columns does not make sense. As opposed to raising an exception.
Many reductions implement a parameter to let control this behavior:
df[['name', 'population']].mean(numeric_only=False)
TypeError: could not convert string to float:
If we consider more methods to be applied directly over a dataframe, for example:
>>> df[['first_'name', 'last_name']].str.lower()
We may end up with a huge amount of string_only
, bool_only
, numeric_only
parameters. All meaning something similar, but IMO adding a decent amount of complexity, and being difficult to keep the behavior consistent.
My preference would be to always raise, but being a software engineer I'm biased, and I guess many users may want this "magic".
So, I guess implementing an option, for example: pandas.options.mode.invalid_dtype
{raise
or skip
} could make more sense.
The main problem with this approach is probably that it's not as easy to define the behavior for each operation:
(df.mean(numeric_only=True)
.mean(numeric_only=False))
Personally, I don't see this as an issue. IMO, the behavior depends more on the user than on the operation. I'd say for production code, having to be explicit, and selecting the columns to operate with, makes more sense. While in a notebook, avoiding exceptions with this sort of "magic" seems to be more useful.
I guess for Series
/1-column DataFrame
(see #6) it always makes sense to raise an exception.
Thoughts?
In March'20 there was a very detailed discussion about introduction a new __dataframe__
protocol: https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267. The purpose of it is being able to exchange data between different implementations, or export data to (e.g.) Apache Arrow or NumPy.
There's a strawman implementation at wesm/dataframe-protocol#1.
The discussion went a little all over the place, with many people misunderstanding the main purpose was data exchange rather than providing an API to manipulate or do computations with a dataframe. That would be a much larger topic, and something this consortium aims to deliver an RFC for.
That said, the __dataframe__
topic is very much related, and is also a potentially interesting example of a cross-dataframe-library topic that could really benefit from having a detailed RFC with requirements and use cases. We should consider picking up that topic, and consider lessons from it in community engagement.
When researching all possible dtypes with missing values in Vaex and observing how this is handled in Pandas implementation I found that there is a BooleanDtype in Pandas that gives an error.
def test_bool():
df = pd.DataFrame({"A": [True, False, False, True]})
df["B"] = pd.array([True, False, pd.NA, True], dtype="boolean")
df2 = from_dataframe(df)
tm.assert_frame_equal(df, df2)
My question is: when thinking of all possible entries into Vaex dataframe should one stick to the common or should one dissect all possibilities on this level?
I couldn't work out if the interchange dataframe (i.e. the dataframe returned from __dataframe__()
) should also have a __dataframe__()
method, e.g.
>>> import pandas as pd
>>> df = pd.DataFrame() # i.e. the top-level dataframe
>>> interchange_df1 = df.__dataframe__()
>>> interchange_df2 = interchange_df1.__dataframe__()
With the upstream of current adopters, we have a split on whether the interchange dataframe has this method.
Library | Top-level | Interchange |
---|---|---|
pandas | โ๏ธ | โ๏ธ |
vaex | โ๏ธ | โ |
modin | โ๏ธ | โ๏ธ |
cuDF | โ๏ธ | โ |
I had assumed that interchange dataframes should have __dataframe__()
by virtue of it being a method in the DataFrame API object. I think it makes sense, as then from_dataframe()
-like functions only need to check for __dataframe__()
to support interchanging both top-level and interchange dataframes of different libraries.
If there is explicit specification somewhere in this regard then please give me a pointer! In any case, it might be worth clarifying in the __dataframe__()
docstring where this method should be residing.
In order to not lose information that is encoded in DataFrames and Columns that is not covered by our API, we may want to provide extra metadata slots for these.
One may argue that this should be covered in the API, and this defeats the purpose of a standard, but I think it's a very pragmatic approach to guarantee lossless roundtripping for information outside of this standard and help adoption (because there is an escape hatch).
Example metadata for a dataframe
Example metadata for a column:
'km/s'
, 'parsec'
, 'furlong'
)index
in Pandas.This could also help to round trip Arrow extension types: https://arrow.apache.org/docs/python/extending_types.html and I guess the same holds for Pandas.
An implementation could be a def get_metadata(self) -> dict[str, Any]
where we recommend prefixing keys with implementation specific names, like 'arrow.extention_type'
, 'vaex.unit'
, 'pandas.extension_type_name'
etc.
Commonly used keys could be upgraded to be part of the API in the future (non-prefixed keys) that we formalize and document.
FYI: metadata is a first-class citizen in the Clojure language https://clojure.org/reference/metadata
Next are listed the reductions over numerical types defined in pandas. These can be applied:
Series
DataFrame
pandas is not consistent, in letting any reduction to be applied to any of the above. Each method is
independent (Series.sum
, GroupBy.sum
, Window.sum
...). Some reductions are not implemented for
some of the classes. And the signatures can change (e.g. Series.var(ddof)
vs EWM.var(bias)
)
I propose to have standard signatures for the reductions, and have all reductions available to all classes.
all()
any()
count()
nunique()
# may be the name could be count_unique
, count_distinct
...?mode()
# what to do if there is more than one mode? Ideally we would like all reductions to return a scalarmin()
max()
median()
quantile(q, interpolation='linear')
# in pandas q
is by default 0.5
, but I think it's better to require it; interpolation can be {โlinearโ, โlowerโ, โhigherโ, โmidpointโ, โnearestโ}sum()
prod()
mean()
var(ddof=1)
# delta degrees of freedom (for some classes bias
is used)std(ddof=1)
skew()
kurt()
# pandas has also the alias kurtosis
sem(ddof=1)
# standard error of the meanmad()
# mean absolute deviationautocorr(lag=1)
is_unique()
# in pandas is a propertyis_monotonic()
# in pandas is a propertyis_monotonic_decreasing()
# in pandas is a propertyis_monotonic_increasing()
# in pandas is a propertyReductions that may depend on row labels (and could potentially return a list, like mode
):
idxmax()
/ argmax()
idxmin()
/ argmin()
These need an extra column other
:
cov(other, ddof=1)
corr(other, method='pearson')
# method can be {โpearsonโ, โkendallโ, โspearmanโ}bool_only
, numeric_only
) to let only apply the operation over columns of certain types only. Do we want it?
df.select_columns_by_dtype(int).sum()
would be preferrable than a parameter to all or some reductionslevel
parameter in many reductions, for MultiIndex. If Indexing/MultiIndexing is part of the API, do we want to have it?min_count
/min_periods
parameter in some reductions (e.g. sum
, min
), to return NA
if less than min_count
values are present. Do we want to keep it?df[col].sum()
)df[col].reduce.sum()
)reduce
function, and passing the specific functions as a parameter (e.g. df[col].reduce(sum)
)One of the uncontroversial points from #2 is that DataFrames have column labels / names. I'd like to discuss two specific points on this before merging the results into that issue.
I'm a bit unsure whether these are getting too far into the implementation side of things. Should we just take no stance on either of these?
My responses:
Operations like crosstab
/ pivot
places a column from the input dataframe into the column labels of the output.
We'll need to be careful with how this interacts with the indexing API, since a label like the tuple ('my', 'label')
might introduce ambiguities (e.g. the full list of labels is ['my', 'label', ('my', 'label')]
.
Is it reasonable to require each label to be hashable? Pandas requires this, to facilitate lookup in a hashtable.
dataframes are commonly used to wrangle real-world data into shape, and real-world data is messy. If an implementation wants to ensure uniqueness (perhaps on a per-object basis) then is can offer that separately. But the API should at least allow for it.
We currenty list "datetime support" in the design document, and also listed it in the dtype docstring:
dataframe-api/protocol/dataframe_protocol.py
Line 142 in 27b8e1c
But at the moment the spec doesn't say anything about how the datetime is stored (which resolution, or whether it supports multiple resolutions with some parametrization).
Updating the spec to mention it should be nanoseconds might be the obvious solution (since that's the only resolution pandas currently supports), but I think we should make this more flexible and allow different units (hopefully pandas will support non-nanosecond resolutions in the future, and other systems might use other resolutions by default).
In the describe_null
we currently list the following options:
While looking at the pandas implementation, I was wondering if we shouldn't treat NaT differently from NaN and see it as a sentinel value (option 2 in the list above).
While NaN could also be seen as a kind of sentinel value, there are some clear differences: NaN is a floating point concept backed by the IEEE754 standard (while as far as I know "NaT" is quite numpy specific? eg Arrow doesn't support it). NaNs also evaluate as non-equal (following the standard), and while for datetime64 with NaT that's also the case in numpy, if you view the data as int64 it's not (and eg for dlpack those values will be regarded as int64? And the actual Buffer object might be agnostic to it)
This issue supersedes #1 and #14. As agreed in 6 Aug call the first milestone in the definition of the dataframe API will be the part to interchange data. As an sample use case, Matplotlib being able to receive dataframes from different implementations (e.g. pandas, vaex, modin, cudf, etc.).
This work was originally discussed in OSSData, and an initial draft was later proposed here: wesm/dataframe-protocol#1.
The topics to discuss and decide on are next:
__dataframe__
)The procedure to include this part on the standard RFC will be as follows:
In today's meeting it was discussed what's the goal of the API, and which are its target users.
@maartenbreddels and @devin-petersohn, if I understood correctly, see the API we're defining here, as something they'd like to implement internally in Vaex and Modin, but not making it their public API. Not sure what's pandas point of view on that.
I think that's perfectly fine, and it makes sense. But I have the question of whether would make sense if those public API's would be independent wrappers, in the same way Seaborn wraps Matplotlib, or HoloViews wraps Bokeh. Let me expand on what I mean here.
For the discussions we had, I think people mentioned that they were interested in defining a more "pure" and less "magic" API, than the existing one. Not sure if the previous sentence makes a lot of sense, but I guess some of the principles for the API could be:
Personally, I think this API should be great for software developers. Like developers of libraries like us, who want to build on top of it. Or developers of downstream software. And I'd say, also to data engineers, and people who want to write production code with dataframes.
Then, I understand that some users (e.g. data analysts) prefer more "magic" API's, that automatically fix problems they don't want to care about. As an example, let's think of the dataframe constructor.
As a data analyst, or non-software people, I think the next code working is very reasonable/convenient:
DataFrame({'a': [1, 2], 'b': [3, 4]})
DataFrame([{'a': 1, 'b': 3}, {'a': 3, 'b': 4}])
DataFrame(json.loads(value))
But as software engineer, I may want to have a more explicit and less magic syntax, for example:
DataFrame.load(kind='dict', {'a': [1, 2], 'b': [3, 4]})
DataFrame.load(kind='list_of_dict', [{'a': 1, 'b': 3}, {'a': 3, 'b': 4}])
DataFrame.load(kind='dict', json.loads(value))
Correct me if I'm wrong, but I think there is mostly agreement that what we want to focus in the consortium API in the latter style. If Vaex, Modin, pandas... provide this API, then there is easy compatibility in the ecosystem. For example, Scikit-learn or Matplotlib can get a "dataframe" as a parameter, and operate with it, since they know it will follow the standard API.
But then, implementations like Modin, Vaex, or pandas, may want to keep their existing API's. Or provide a different user API, more targeted to specific users (e.g. data analysts, who want the library making guesses, that make their lives easier).
Then my question is, does it make sense that this alternative API live in the implementations? For example, let's consider I see pandas as this API on top of numpy, Vaex on top of memory maps, and Modin on top of Ray (excuse the simplification). Then, if Modin wants to implement an SQLite-like API. Could make sense that this is an independent project, of an SQLite-like API that wraps the standard API? Instead of a Modin API? I guess that could make sense.
Then, I guess there is the case, of an implementation, let's say pandas, which is planning to expose the API to users, but it's going to add some extra magic (let's say that the standard for filter is df.filter(condition)
but pandas wants to keeps supporting df[condition]
for backward compatibility. Or Vaex having some specific syntax for expressions in top of the standard API.
I see there is a whole range between these options:
Would be great to know other people thoughts. I think most people have an idea on how this API is expected to be used, but not sure if we're all in the same page.
hi all, great to see some continued work on this project after the original discussion from last year. I still think it's useful to allow libraries to "throw data over the wall" without forcing eager serialization to a particular format (like pandas or Arrow)
TBD: Arrow has a separate "null" dtype, and has no separate mask concept.
Instead, it seems to use "children" for both columns with a bit mask,
and for nested dtypes. Unclear whether this is elegant or confusing.
This design requires checking the null representation explicitly.
Could you clarify what is confusing? I do not understand the statement 'Instead, it seems to use "children" for both columns with a bit and for nested dtypes.'
Later
The Arrow design requires checking:
1. the ARROW_FLAG_NULLABLE (for sentinel values)
2. if a column has two children, combined with one of those children
having a null dtype.
Making the mask concept explicit seems useful. One null dtype would
not be enough to cover both bit and byte masks, so that would mean
even more checking if we did it the Arrow way.
You mean the Arrow C interface here. Could you clarify what these other things mean?
Re: "One null dtype would not be enough to cover both bit and byte masks, so that would even more checking if we did it the Arrow way.", I don't know what this means, could you clarify?
@property
def null_count(self) -> Optional[int]:
"""
Number of null elements, if known.
Note: Arrow uses -1 to indicate "unknown", but None seems cleaner.
"""
pass
Here you should indicate that you mean that the Arrow C interface (where the null_count is an int64
)
offsets
buffer, requiring serialization always for variable-size binary data. I haven't thought through what would be alternatives for string data that do not necessarily force this serialization (similar to the API proposal from wesm/dataframe-protocol#1) โ if the goal of this API is to reduce the need to serialize, having some alternative here might be worthwhileto_*
methods on Column (like Column.to_arrow
) โ I guess that pyarrow
could implement a built-in implementation of this interface, but some producers might be able to produce Arrow or NumPy arrays directly and skip the lower-level memory export that's provided here.dtype
docstring, but I would encourage you to think about them up front rather than bolting on later.In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on pandas
.
Top 10 listed:
Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of isinstance
checks (on pd.DataFrame
, pd.Series
).
seaborn/_core.py
: Series
, to_numeric
seaborn/matrix.py
: DataFrame
, isnull
, .index.equals
, .column.equals
,seaborn/utils.py
: DataFrame
, Categorical
, notnull
seaborn/regression.py
: only pd.notnull
seaborn/distributions.py
: .values
, .copy
, .iloc
, .loc
, .reset_index
, .index
, set_index
, MultiIndex.from_arrays
, Index
, Series
, concat
, merge
seaborn/relational.py
: DataFrame
, merge
, .rename
seaborn/categorical.py
: DataFrame
, iteritems
, Series
, notnull
, option_context
, isnull
, groupby
, get_group
,seaborn/_statistics.py
: only Series
just a single non-test usage, in pd.py:
def validate_location(location): # noqa: C901
"...J
if isinstance(location, np.ndarray) \
or (pd is not None and isinstance(location, pd.DataFrame)):
location = np.squeeze(location).tolist()
def if_pandas_df_convert_to_numpy(obj):
"""Return a Numpy array from a Pandas dataframe.
Iterating over a DataFrame has weird side effects, such as the first
row being the column names. Converting to Numpy is more safe.
"""
if pd is not None and isinstance(obj, pd.DataFrame):
return obj.values
else:
return obj
Interesting/unusual common pattern, which extends pd.DataFrame
through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):
import pandas as pd
import pandas_flavor as pf
@pf.register_dataframe_method
def join_fasta(
df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
"""
Convenience method to join in a FASTA file as a column.
"""
...
return df
A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.
Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:
def to_networkx_graph(data, create_using=None, multigraph_input=False):
"""Make a NetworkX graph from a known data structure."""
# Pandas DataFrame
try:
import pandas as pd
if isinstance(data, pd.DataFrame):
if data.shape[0] == data.shape[1]:
try:
return nx.from_pandas_adjacency(data, create_using=create_using)
except Exception as err:
msg = "Input is not a correct Pandas DataFrame adjacency matrix."
raise nx.NetworkXError(msg) from err
else:
try:
return nx.from_pandas_edgelist(
data, edge_attr=True, create_using=create_using
)
except Exception as err:
msg = "Input is not a correct Pandas DataFrame edge-list."
raise nx.NetworkXError(msg) from err
except ImportError:
warnings.warn("pandas not found, skipping conversion test.", ImportWarning)
def from_pandas_adjacency(df, create_using=None):
try:
df = df[df.index]
except Exception as err:
missing = list(set(df.index).difference(set(df.columns)))
msg = f"{missing} not in columns"
raise nx.NetworkXError("Columns must match Indices.", msg) from err
A = df.values
G = from_numpy_array(A, create_using=create_using)
nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
return G
And using the .drop
method in group.py:
def prominent_group(
G, k, weight=None, C=None, endpoints=False, normalized=True, greedy=False
):
import pandas as pd
...
betweenness = pd.DataFrame.from_dict(PB)
if C is not None:
for node in C:
# remove from the betweenness all the nodes not part of the group
betweenness.drop(index=node, inplace=True)
betweenness.drop(columns=node, inplace=True)
CL = [node for _, node in sorted(zip(np.diag(betweenness), nodes), reverse=True)]
A multi-language (streaming) viz and analytics library. The Python version uses pandas in core/pd.py
. It uses a small but nontrivial amount of the API, including MultiIndex
, CategoricalDtype
, and time series functionality.
TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.
Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.
Split from the discussions in #2.
To avoid the trap of "let's just match pandas", let's collect a list of specific problems with the pandas API, which we'll intentionally deviate from. To the extent possible we should limit this discussoin to issues with the API, rather than implementation.
collections.abc.Mapping
because .values
is a property returning an array, rather than a method. (added by @TomAugspurger)__getitem__
or loc
/iloc
. It is not explicitly clear to new users the difference between df[["a","b","c"]]
and df[slice(5)]
and df[lambda idx: idx % 5 == 0]
. (added by @devin-petersohn)__getattr__
) to get columns, which causes problems for columns that share names with other APIs. (added by @devin-petersohn)isna
and isnull
, multiply
and mul
, etc.(added by @devin-petersohn)query("a > b")
and df[df["a"] > df["b"]]
(added by @devin-petersohn)__getitem__
, __getattr__
, loc
, iloc
, apply
, drop
(added by @devin-petersohn)merge
and join
call each other and are confusing for new users (added by @devin-petersohn)Series
). Creating all the complexity of having to reimplement most functionality of dataframe. And not providing a consistent way of applying operations to N columns (including 1). #6 @datapythonistaDataFrame
, Series
) to hard-code support for these types. This makes pandas work well with these libraries but means it's not easy (or even possible) for other DataFrame
implementations to be supported. Lack of interop support between alternative DataFrame
implementations and these libraries can be a small but constant annoyance for users, and in some cases a performance issue as well (if data needs to be converted to a pandas object just to get something to work).Dataset
. This provides more-detailed data compared to what a "static" tool like Jedi can return; compared to dir
, our protocol allows our Dataset
class to control which columns, properties, etc. are returned for display in autocomplete dropdowns.dtype
or array subclass name; for Categoricals, we can provide the number of labels/categories.DataFrame
and/or Array
APIs (rather than a protocol with a method that returns a more-complex data structure / dictionary).xref #20 (comment)
It was discussed that the API should be agnostic of execution, including eager/lazy evaluation. I think this is easy when operations return data frames (or columns). For example:
df['value'] + 1
If df['value']
is an in-memory representation, or a lazy expression, the result will likely be the same, and no assumptions need to be made.
But if instead, the result is a scalar:
df['value'].sum()
The output type defined in the API can make force certain executions and prevent others. For example, if the return type defined in the API is a Python int
or float
. See this example:
df['value'] + df['value'].sum()
While an implementation could want to keep the result of df['value'].sum()
as its C representation for the next operation (the addition), making sum()
return a Python object would force the conversion from C to Python and then back to C.
Another example could be Ibis or other SQL-backed implementations. Returning a Python object would cause them to execute a first query for df['value'].sum()
and use the result in a second query. While in the example is likely that a single SQL query could be enough if the computation is delayed until the end.
For the array API it was discussed to use a 0-dimensional array to prevent a similar problem. Assuming we want to do the same for data frames (and not retun a Python object directly), I see two main options:
df['value']
example could make sense, not so sure in other cases like df.count_rows()
(see #20) where we could possibly be interested in applyingscalar
type/class that wraps a scalar and can be used by implementations to decide how the data is represented, when it is converted to Python objects... For example, an toy implementation storing the data as a numpy object could look like:>>> import numpy
>>> class scalar:
... def __init__(self, value, dtype):
... self.value = numpy.array(value, dtype=dtype)
...
... def __repr__(self):
... return str(self.value)
...
... def __add__(self, other):
... return self.value + other
>>> result = scalar(12, dtype='int64')
>>> result
12
>>> result + 3
15
This question got asked recently by @mmccarty (and others have brought it up before), so it's worth taking a stab at an answer. Note that this is slightly speculative, given that we only have fragments of a dataframe API rather than a mostly complete syntax + semantics.
A future API, or individual design elements of it, will certainly have (a) new API surface, and (b) backwards-incompatible changes compared to what dataframe libraries already implement. So how should it be made available?
Options include:
.array_api
in NumPy/CuPy,__array_namespace__
,__array_function__
and more recently with dtype casting rules changes),from __future__ import new_behavior
type import (i.e., new features on a per-module basis),One important difference between arrays and dataframes is that for the former we only have to think about functions, for the latter we're dealing with methods on the main dataframe objects. Hiding/unhiding methods is a little more tricky of course - can be done based on an environment variable set at import time, but it's more annoying with a context manager.
For behavior it's kind of the opposite: likely not all code will work with new behavior, so granular control helps, and a context manager is probably better.
The short summary of this is:
numpy
namespace converge to the array API standard; this takes time because of backwards compatibility constraints, but will avoid the "double namespaces" problem and have multiple other benefits, for example solving long-standing issues that Numba, CuPy etc. are running into.Therefore, using a separate namespace to implement dataframe API standard features/compatibility should likely not be the preferred solution.
Pandas already has a context manager, namely pandas.option_context
. This is used for existing options, see pd.describe_option()
. While most features are related to display, styling and I/O, some features that can be controlled are quite large and similar in style to what we'd expect to see in a dataframe API standard. Examples:
mode.chained_assignment
(raise, warn, or ignore)mode.data_manager
("block"
or "array"
)mode.use_inf_as_null
(bool)It could be used similarly to currently available options, one option per feature:
with pd.option_context('mode.casting_rules', 'api-standard'):
do_stuff()
Or there could be a single option to switch to "API-compliant mode":
with pd.option_context('mode.api_standard', True):
do_stuff()
Or both of those together.
Question: do other dataframe libraries have a similar context manager?
from __future__
importIt looks like it's possible to implement features with a from __future__
itself, via import hooks (see Reference 3 below). That way the spelling would be uniform across libraries, which is nice. Alternatively, a from dflib.__future__ import X
is easier (no import hooks), however it runs into the problem also described in Ref 3: it is not desirable to propagate options to nested scopes:
from pandas.__future__ import api_standard_unique
# should use the `unique` behavior described in the API standard
df.unique()
from other_lib import do_stuff
# should NOT use the `unique` behavior described in the API standard,
# because that other library is likely not prepared for that.
do_stuff(df)
Now of course this scope propagation is also what a context manager does. However, the point of a from __future__
import and jumping through the hoops required to make that work (= more esoteric than a context manager) is to gain a switch that is local to the Python module in which it is used.
from __future__
importFor new functions, methods and objects both are pretty much equivalent, since they will only be used on purpose (the scope propagation issue above is irrelevant)
For changes to existing functions or methods, both will work too. The module-local behavior of a from __future__
import is probably preferred, because code that's imported from another library that happens to use the same functionality under the hood may not expect the different result/behavior.
For behavior changes there's an issue with the from __future__
import. The import hooks will rely on AST transforms, so there must be some syntax to trigger on. With something that's very implicit, like casting rules, there is no such syntax. So it seems like there will be no good way to toggle that behavior on a module-scope level.
from __future__ import xxx
is perhaps best for adoption of changes to existing functions or methods, it has a configurable level of granularity and is explicit so should be more robust there than a context manager.Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?
If that is the case, this can always be 0 for numpy and primitive Arrow arrays (except for Arrow-boolean since they are bits), since we can always slice them right?
Regarding column names, the next proposal, similar to what pandas currently does, uses a columns
property to set and get columns names.
In #7, the preference is to restrict column names to string, and not allow duplicates.
The proposed API with an example is:
>>> df = dataframe({'col1': [1, 2], 'col2': [3, 4]})
>>> df.columns = 'foo', 'bar'
>>> df.columns = ['foo', 'bar']
>>> df.columns = map(str.upper, df.columns)
>>> df.columns
['FOO', 'BAR']
And the next cases would fail:
>>> df.columns = 1
TypeError: Columns must be an iterable, not int
>>> df.columns = 'foo'
TypeError: Columns must be an iterable, not str
>>> df.columns = 'foo', 1
TypeError: Column names must be str, int found
>>> df.columns = 'foo', 'bar', 'foobar'
ValueError: Expected 2 column names, found 3
>>> df.columns = 'foo', 'foo'
ValueError: Column names cannot be duplicated. Found duplicates: foo
Some things that people may want to discuss:
column_names
)df.columns[0] = 'foo'
(the proposal don't allow it)df.columns = 'foo'
(the proposal requires an iterable, so df.columns = ['foo']
or equivalent is needed).In case it's useful, this is the implementation of the examples:
import collections
import typing
class dataframe:
def __init__(self, data):
self._columns = list(data)
@property
def columns(self) -> typing.List[str]:
return self._columns
@columns.setter
def columns(self, names: typing.Iterable[str]):
if not isinstance(names, collections.abc.Iterable) or isinstance(names, str):
raise TypeError(f'Columns must be an iterable, not {type(names).__name__}')
names = list(names)
for name in names:
if not isinstance(name, str):
raise TypeError(f'Column names must be str, {type(name).__name__} found')
if len(names) != len(self._columns):
raise ValueError(f'Expected {len(self._columns)} column names, found {len(names)}')
if len(set(names)) != len(self._columns):
duplicates = set(name for name in names if names.count(name) > 1)
raise ValueError(f'Column names cannot be duplicated. Found duplicates: {", ".join(duplicates)}')
self._columns = names
One of the "to be decided" items at https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md#to-be-decided is:
Should there be a standard from_dataframe constructor function? This isn't completely necessary, however it's expected that a full dataframe API standard will have such a function. The array API standard also has such a function, namely from_dlpack. Adding at least a recommendation on syntax for this function would make sense, e.g., from_dataframe(df, stream=None). Discussion at #29 (comment) is relevant.
In the announcement blog post draft I tentatively answered that with "yes", and added an example. The question is what the desired signature should be. The Pandas prototype currently has the most basic signature one can think of:
def from_dataframe(df : DataFrameObject) -> pd.DataFrame:
"""
Construct a pandas DataFrame from ``df`` if it supports ``__dataframe__``
"""
if isinstance(df, pd.DataFrame):
return df
if not hasattr(df, '__dataframe__'):
raise ValueError("`df` does not support __dataframe__")
return _from_dataframe(df.__dataframe__())
The above just takes any dataframe supporting the protocol, and turns the whole things in the "library-native" dataframe. Now of course, it's possible to add functionality to it, to extract only a subset of the data. Most obviously, named columns:
def from_dataframe(df : DataFrameObject, *, colnames : Optional[Iterable[str]]= None) -> pd.DataFrame:
Other things we may or may not want to support:
My personal feeling is:
col_indices=None
__dataframe__
first, then inspect some metadata, and only then decide what chunks to get.Thoughts?
Interactive users will want to control how the data is displayed. This might include sorting the view; coloring cells, columns, or rows; precision digits; or moving columns to the left. It may also interact with auto complete.
It is common practice to separate the view from the data (many applications can display data in a SQL database in different ways).
I believe that we need to define an interface to the display data class (for instance, an ordered dictionary of strings containing arrays is the simplest interface. additional kwargs might include display attributes for rows or columns, there might be header or footer information).
Thus, believe it is in scope to define an interface so that multiple developers can write their own display data class. Almost every demonstration needs a way to display large amount of data well.
This was brought up by @jorisvandenbossche: if two libraries both use the same library for in-memory data storage (e.g. buffers/columns are backed by NumPy or Arrow arrays), can we avoid iterating through each buffer on each column by directly handing over that native representation?
This is a similar question to https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol_summary.md#what-is-wrong-with-to_numpy-and-to_arrow - but it's not the same, there is one important difference. The key point of that FAQ entry is that it's consumers who should rely on NumPy/Arrow, and not producers. Having a to_numpy()
method somewhere is at odds with that. Here is an alternative:
Column
instance may define __array__
or __arrow_array__
if and only if the column itself is backed by a single NumPy or an Arrow array.DataFrame
and Buffer
instance must not define __array__
or __arrow_array__
.(1) is motivated by wanting a simple shortcut like this:
# inside `from_dataframe` constructor
for name in df.column_names():
col = df.get_column_by_name(name)
# say my library natively uses Arrow:
if hasattr(col, '__arrow_array__'):
# apparently we're both using Arrow, take the shortcut
columns[name] = col.__arrow_array__()
elif ...: # continue parsing dtypes, null values, etc.
However, there are other constraints then. For __array__
this then also implies:
NaN
or a sentinel value for nulls (and this needs checking first in the code above - otherwise the consumer may still misinterpret the data)For __arrow_array__
I cannot think of issues right away. Of course the producer should also be careful to ensure that there are no differences in behavior due to adding one of these methods. For example, if there's a dataframe with a nested dtype that is supported by Arrow but not by the protocol, calling __dataframe__()
should raise because of the unsupported dtype.
The main pro of doing this is:
The main con is:
My impression is: this may be useful to do for __arrow_array__
, I don't think it's a good idea for __array__
because the gain is fairly limited and there's too many constraints or ways to get it wrong (e.g. describe_null
must always be checked before using __array__
). If __array__
is to be added, then maybe at the Buffer
level where it plays the same role as __dlpack__
.
xref gh-26 for some discussion on categorical dtypes.
The dtype is called category
there. See pandas.Categorical docs:
>>> df = pd.DataFrame({"A": [1, 2, 5, 1]})
>>> df["B"] = df["A"].astype("category")
>>> df.dtypes
A int64
B category
dtype: object
>>> col = df['B']
>>> col.dtype
CategoricalDtype(categories=[1, 2, 5], ordered=False)
>>> col.values.ordered
False
>>> col.values.codes
array([0, 1, 2, 0], dtype=int8)
>>> col.values.categories
Int64Index([1, 2, 5], dtype='int64')
>>> col.values.categories.values
array([1, 2, 5])
The dtype is called _"dictionary-encoded" in Arrow - so a dataframe with a categorical dtype is called a "dictionary-encoded array" there.
See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions for details.
A practical example (from @kkraus14 in gh-38), for a categorical column of
['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold']
with categories of
['gold' < 'silver' < 'bronze']
:
categorical column: {
mask_buffer: [119], # 01110111 in binary
data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
children: [
string column: {
mask_buffer: None,
offsets_buffer: [0, 4, 10, 16],
data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
}
]
}
struct ArrowSchema {
// Array type description
const char* format;
const char* name;
const char* metadata;
int64_t flags;
int64_t n_children;
struct ArrowSchema** children;
struct ArrowSchema* dictionary; // the categories
...
};
struct ArrowArray {
// Array data description
int64_t length;
int64_t null_count;
int64_t offset;
int64_t n_buffers;
int64_t n_children;
const void** buffers;
struct ArrowArray** children;
struct ArrowArray* dictionary;
...
};
Also see https://arrow.apache.org/docs/python/data.html#dictionary-arrays for what PyArrow does - it matches the current exchange protocol more closely than the Arrow C Data Interface. E.g., it uses an actual Python dictionary for the mapping of values to categories.
EDIT: Vaex's API was done pre Arrow integration, and will change to match Arrow in the future.
>>> import vaex
... >>> df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
... >>> df = df.categorize('year', min_value=2020, max_value=2019)
... >>> df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fr
... i', 'Sat', 'Sun'])
>>>
>>> df.dtypes
year int64
weekday int64
dtype: object
>>> df.is_category('year')
True
>>> df.is_category('weekday')
True
>>> df._categories
{'year': {'labels': [], 'N': 0, 'min_value': 2020}, 'weekday': {'labels': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 'N': 7, 'min_value': 0}}
This is the current form in gh-38 for the Pandas implementation of the exchange protocol:
>>> col = df.__dataframe__().get_column_by_name('B')
>>> col
<__main__._PandasColumn object at 0x7f0202973211>
>>> col.dtype # kind, bitwidth, format-string, endianness
(23, 64, '|O08', '=')
>>> col.describe_categorical # is_ordered, is_dictionary, mapping
(False, True, {0: 1, 1: 2, 2: 5})
>>> col.describe_null # kind (2 = sentinel value), value
(2, -1)
What we already determined needs changing:
get_children()
method, and store the mapping
that is now in Column.describe_categorical
in a child column instead. Note that child columns are also needed for variable-length strings.To discuss:
dtype
is the logical dtype for the column, where to store how to interpret the actual data buffer? Right now this is done not in a static attribute but by returning the dtype along with the buffer when accessing it: def get_data_buffer(self) -> Tuple[_PandasBuffer, _Dtype]:
"""
Return the buffer containing the data.
"""
_k = _DtypeKind
if self.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
buffer = _PandasBuffer(self._col.to_numpy())
dtype = self.dtype
elif self.dtype[0] == _k.CATEGORICAL:
codes = self._col.values.codes
buffer = _PandasBuffer(codes)
dtype = self._dtype_from_pandasdtype(codes.dtype)
else:
raise NotImplementedError(f"Data type {self._col.dtype} not handled yet")
return buffer, dtype
What goes in the data buffer on the column? The category-encoded data makes sense, because the buffer needs to be the same size as the column (number of elements), otherwise it would be inconsistent with other dtypes.
This issue is meant to collect libraries that we should be aware of and perhaps take into account (data on how their API looks, impact of choices on those libraries, etc.).
See data-apis/array-api#3 for relevant array libraries.
The following is a list of API candidates for standardization.
abs
floordiv
pow
round
truediv
add
diff
div
mod
mul
sub
Any need for the r*
variants? (e.g., radd
, rmul
, etc)
corr
count
cov
eq
ge
gt
le
lt
ne
isin
isna
notna
isna or isnull? notna or notnull?
where
append
assign
copy
drop
drop_duplicates
dropna
fillna
head
join
pop
rename
replace
set_index
tail
take
sort_values
all
any
In #2 there seems to be some agreement that row-labels are an important component of a dataframe. Pandas takes this a step further by using them for alignment in many operations involving multiple dataframes.
In [10]: a = pd.DataFrame({"A": [1, 2, 3]}, index=['a', 'b', 'c'])
In [11]: b = pd.DataFrame({"A": [2, 3, 1]}, index=['b', 'c', 'a'])
In [12]: a
Out[12]:
A
a 1
b 2
c 3
In [13]: b
Out[13]:
A
b 2
c 3
a 1
In [14]: a + b
Out[14]:
A
a 2
b 4
c 6
In the background there's an implicit a.align(b)
, which reindexes the dataframes to a common index. The resulting index will be the union of the two indices.
A few other places this occurs
pd.concat
Do we want to adopt this behavior for the standard?
I think it's useful to think through concrete use cases on how the interchange protocol could be used, to see if it covers those use cases / the desired APIs are available.
One example use case could be matplotlib's plot("x", "y", data=obj)
, where matplotlib already supports getting the x and y column of any "indexable" object. Currently they require obj["x"]
to give the desired data, but so in theory this support could be extended to any object that supports the dataframe interchange protocol. But at the same time, matplotlib currently also needs those data (AFAIK) as numpy arrays because the low-level plotting code is implemented in such a way.
With the current API, matplotlib could do something like:
df = obj.__dataframe__()
x_values = some_utility_func(df.get_column_by_name("x").get_buffers())
where some_utility_func
can convert the dict of Buffer
objects to a numpy array (once numpy supports dlpack, converting the Buffer objects to numpy will become easy, but the function will then still need to handle potentially multiple buffers returned from get_buffers()
).
That doesn't seem ideal: 1) writing the some_utility_func
to do the conversion to numpy is non-trivial to implement for all different cases, 2) should an end-user library have to go down to the Buffer objects?
This isn't a pure interchange from one dataframe library to another, so we could also say that this use case is out-of-scope at the moment. But on the other hand, it seems a typical use case example, and could in theory already be supported right now (it only needs the "dataframe api" to get a column, which is one of the few things we already provide).
(disclaimer: I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))
The code now uses NumPy format strings, while the docs for Column.dtype specify it must use the format string from the Apache Arrow C Data Interface (similar but slightly different). So we need a utility to map NumPy to Arrow format here.
Example - should say 'b' not |b1':
df = pd.DataFrame({"A": [True, False, False, True]})
>>> df.__dataframe__().get_column_by_name('A').dtype
(<_DtypeKind.BOOL: 20>, 8, '|b1', '|')
Source:
https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface
https://numpy.org/doc/stable/reference/generated/numpy.dtype.itemsize.html
https://numpy.org/doc/stable/reference/generated/numpy.dtype.byteorder.html
In some cases users like to use Array API functions (for example where
) on DataFrame objects (in particular Series). Is this something that we would like to support in the API? If not, how would we recommend users approach these kinds of problems.
For an example of this please see issue ( dask/distributed#5224 )
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.