Coder Social home page Coder Social logo

data-apis / dataframe-api Goto Github PK

View Code? Open in Web Editor NEW
98.0 34.0 20.0 2.03 MB

RFC document, tooling and other content related to the dataframe API standard

Home Page: https://data-apis.org/dataframe-api/draft/index.html

License: MIT License

Makefile 1.02% Python 97.71% Batchfile 1.27%
api dataframes protocol standard

dataframe-api's Issues

Testing a `DataFrameObject` consumption from another dataframe library

One of the main usefulness of the interchangeable protocol is to consume a DataFrameObject from
other dataframe libraries:

myDFLibrary.from_dataframe(anotherDFLibrary.__dataframe__())

Currently, we can only write tests with DataFrameObject from the same library we are implementing the protocol in:

myDFLibrary.from_dataframe(myDFLibrary.__dataframe__())

In the case of cudf (GPU dataframe), we can't write tests for the scenario where a DataFrameObject is on the CPU like for pandas. This is a very important use case of the protocol specifications to test as it makes sure the device transfer is properly handled.

I don't how to go about it. The objective of this issue is to think collectively about a way to mock up these kinds of use cases.
An example where mocking up is relatively simple, is testing chunks feature as cudf does not support it like pandas. To mock up chunks, we can create many DataFrameObject from chunks of few rows from a given DataFrameObject.

Higher-dimensional "columns"

This is not even half-baked, but I wanted to gauge interest/feasibility for the spec to encapsulate n-dimensional "columns" of data, equivalent to xarray's DataArrays. In that case, the currently-envisioned columns would be the 1D specific case of a higher-D general case. We've found that in some use cases we need these in napari (napari/napari#2592, napari/napari#2917), and it would be awesome to conform to the dataframe API and be compatible with both xarray and pandas.

Of course the other way around this is to ignore the higher-D libraries, and have them conform to the API once it's settled. That might be more reasonable, in which case, I'm perfectly happy for this to be closed. ๐Ÿ˜Š

Null description in case of no missing value

Validity mask is a missing value representation that depends on the Column in the protocol. If describe_null() is meant to describe missing values at the column level for a given dtype.

In the case there is no missing values, shall we still provide a validity array with 1 (valid) at all entries or
shall we raise an exception ?
From my perspective, the later is better because we can just check null_count == 0 without allocating and filling the whole array with the same value. That is how it works in cudf dataframe. If there is no missing value, accessing the attribute nullmask (which holds the validity array) raises an exception with message: "Column has no null mask".

Document the mutability assumptions for the interop protocol

This was just asked about at https://twitter.com/__AlexMonahan__/status/1430522318854377475. I'd say we should have a similar argument as https://data-apis.org/array-api/latest/design_topics/copies_views_and_mutation.html. We cannot prevent mutations in the protocol itself, and existing libraries already may have APIs that do mutate in-place. So we should recommend against it (but of course power users that really understand what they are doing could go right ahead and mutate to their hearts' content - this is the same as for the buffer protocol, DLPack et al.).

There's an earlier discussion at #10 (which was unrelated to the protocol). The general sentiment was that in-place operations must be avoided for a full dataframe API.

Right now https://data-apis.org/dataframe-protocol/latest/ says nothing about mutability, this should be added.

Study on the pandas API: What is the most commonly used?

I have spent a lot of time trying to understand users and their behaviors in order to optimize for them. As a part of this work, I have done numerous studies on what gets used in pandas.

This will be extremely useful when it comes to defining a dataframe standard, because what people are using can help inform us on what behaviors to support.

For this study, we scraped the top 6000 notebooks from Kaggle by upvote.

Repo here, reproduction script included: https://github.com/modin-project/study_kaggle_usage

Results here: results.csv

Data exchange formats

Based on what it's defined in wesm/dataframe-protocol#1, the idea is to not support a single format to exchange data, but support multiple (e.g. arrow, numpy).

Using a code example here, to see what this approach implies.

1. Dataframe implementations should implement the __dataframe__, returning the exchange format we are defining

For example, let's assume Vaex is using Arrow, and it wants to offer its data in Arrow format to consumers:

import pyarrow


class VaexExchangeDataFrame:
    """
    The format defined by our spec.
    
    Besides `to_arrow`, `to_numpy` it should implement the rest of
    the spec `num_rows`, `num_columns`, `column_names`...
    """
    def __init__(self, arrow_data):
        self.arrow_data = arrow_data

    def to_arrow(self):
        return self.arrow_data

    def to_numpy(self):
        raise NotImplementedError('numpy format not implemented')
    
class VaexDataFrame:
    """
    The public Vaex dataframe class.

    For simplicity of the example, this just wraps an arrow object received in the constructor,
    but this would be the whole `vaex.DataFrame`.
    """
    def __init__(self, arrow_data):
        self.arrow_data = arrow_data
        
    def __dataframe__(self):
        return VaexExchangeDataFrame(self.arrow_data)

# Creating an instance of the Vaex public dataframe
vaex_df = VaexDataFrame(pyarrow.RecordBatch.from_arrays([pyarrow.array(['pandas', 'vaex', 'modin'],
                                                                       type='string'),
                                                         pyarrow.array([26_300, 4_900, 5_200],
                                                                       type='uint32')],
                                                        ['name', 'github_stars']))

Other implementations could use formats different from Arrow, for example, let's assume Modin wants to offer its data as numpy arrays:

import numpy


class ModinExchangeDataFrame:
    def __init__(self, numpy_data):
        self.numpy_data = numpy_data

    def to_arrow(self):
        raise NotImplementedError('arrow format not implemented')

    def to_numpy(self):
        return self.numpy_data


class ModinDataFrame:
    def __init__(self, numpy_data):
        self.numpy_data = numpy_data

    def __dataframe__(self):
        return ModinExchangeDataFrame(self.numpy_data)


modin_df = ModinDataFrame({'name': numpy.array(['pandas', 'vaex', 'modin'], dtype='object'),
                           'github_stars': numpy.array([26_300, 4_900, 5_200], dtype='uint32')})

2. Direct consumers should be able to understand all formats

For example, pandas could implement a from_dataframe function to create a pandas dataframe from different formats:

import pandas

def from_dataframe(dataframe):
    known_formats = {'numpy': lambda df: pandas.DataFrame(df),
                     'arrow': lambda df: df.to_pandas()}

    exchange_dataframe = dataframe.__dataframe__()
    for format_ in known_formats:
        try:
            data = getattr(exchange_dataframe, f'to_{format_}')()
        except NotImplementedError:
            pass
        else:
            return known_formats[format_](data)

    raise RuntimeError('Dataframe does not support any known format')

pandas.from_dataframe = from_dataframe

This would allow pandas user to load data from other formats:

pandas_df_1 = pandas.from_dataframe(vaex_df)
pandas_df_2 = pandas.from_dataframe(modin_df)

Vaex, Modin and any other implementation could implement an equivalent function to load data from other
libraries into their formats.

3. Indirect consumers can pick an implementation, and use it to standardize its input

For example, Seaborn may want to accept any dataframe implementation, but wants to write its code in pandas (the access to the data). It could convert any dataframe to pandas, using from_dataframe from the previous section:

def seaborn_bar_plot(any_dataframe, x, y):
    pandas_df = pandas.from_dataframe(any_dataframe)
    return pandas_df.plot(kind='bar', x=x, y=y)

seaborn_bar_plot(vaex_df, x='name', y='github_stars')

Are people happy with this approach?

CC: @rgommers

Dataframe namespaces

In #10, it's been discussed that it would be convenient if the dataframe API allows method chaining. For example:

import pandas

(pandas.read_csv('countries.csv')
       .rename(columns={'name': 'country'})
       .assign(area_km2=lambda df: df['area_m2'].astype(float) / 100)
       .query('(continent.str.lower() != "antarctica") | (population < area_km2)'))

This implies that most functionality is implemented as methods of the dataframe class. Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace. pandas uses a mixed approach, with different techniques to try to organize the API.

Approaches

Top-level methods

df.sum()
df.astype()

Many of the methods are simply implemented directly as methods of dataframe.

Prefixed methods

df.to_csv()
df.to_parquet()

Some of the methods are grouped with a common prefix.

Accessors

df.str.lower()
df.dt.hour()

Accessors are a property of dataframe (or series, but assuming only one dataframe class for simplicity) that groups some methods under it.

Functions

pandas.wide_to_long(df)
pandas.melt(df)

In some cases, functions are used instead of methods.

Functional API

df.apply(func)
df.applymap(func)

pandas also provides a more functional API, where functions can be passed as parameters

Standard API

I guess we will agree, that a uniform and consistent API would be better for the standard. That should make things easier to implement, and also a more intuitive experience for the user.

Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:

@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
    def my_custom_method(self):
        return True

df.my_accessor.my_custom_method()
df.apply(my_custom_function)
df.apply(numpy.sum)

Conceptually, I think there are some methods that should go together, more than by topic, by the API they follow. The clearest example is reductions, and there was some discussion in #11 (comment).

I think no solution will be perfect, and the options that we have are (feel free to add to the list if I'm missing any option worth considering):

Top-level methods

df.sum()

Prefixed methods

df.reduce_sum()

Accessors

df.reduce.sum()

Functions

mod.reductions.sum(df)

mod represents the implementation module (e.g. pandas)

Functional API

df.reduce(mod.reductions.sum)

Personally, my preference is the functional API. I think it's the simplest that keeps things organized, and the simplest to extend. The main drawback is its readability, it may be too verbose. There is the option to allow using a string instead of the function for known functions (e.g. df.reduce('sum')).

Thoughts? Other ideas?

Super columns as nested dataframes

One optional feature to consider including in our specification is nested "DataFrames"/Tables (or whatever name we decide to use there).

riptide does not currently support this concept, but I've been thinking recently that maybe it ought to since it provides a cleaner, more elegant solution for supporting "super columns".

Such "super columns" arise (for example) when performing multiple aggregates on a DataFrame, especially in the common case of grouping by the values/keys in one or more columns then calculating per-group reductions over some subset of the remaining columns.

pandas currently handles this scenario using the concepts of index "levels" and row labeling. This solves the problem but adds a good bit of complexity to the API, including having a stateful DataFrame class. (One could argue the statefulness of DataFrame is a pandas implementation detail of the approach and not inherently a drawback of the approach itself.)

riptide currently handles this scenario by having a Multiset class derived from our Dataset class, where Multiset is basically just a dictionary of named Datasets. This works ok and isn't that far removed from the notion of having nested Dataset instances -- and if you squint just right when you look at Multiset, it's not really that different from the pandas system of index levels + row labels. However, Multiset has it's own drawbacks. Most notably, deriving from our Dataset class means any function that knows how to operate on a Dataset also (in many cases) needs to know how to work with a Multiset in order to produce the "correct" output (in terms of the type, dimensions, and data) -- what's the expected output if one calls a 'merge'-type function with a Multiset and a Dataset, or with two Multisets?

pyarrow's documentation for pandas interoperability says it's Table class already supports nested DataFrames / column groups, although that's the only mention of this behavior I can find.

Nested DataFrames provide a clean solution for representing the "super columns" resulting from these multi-aggregation operations; specifically, they:

  1. Remove the complexity of index levels / row labels, or at least the need to support them. This means the spec will be simpler and easier to implement.
  2. Fix the messy OO / inheritance situation I described above with riptide's Multiset class. Each "column" in a DataFrame is either a 1D array (or maybe an array scalar?) or another DataFrame of the same length.
  3. Allow for (some) unification of the array and DataFrame APIs. For example, implementing a function like .sum() on a DataFrame can distribute that function call over it's columns by just iterating over them and calling .sum() on each of them.
  4. (maybe) Could be used as a way to interoperate with other DataFrame implementations which do allow multiple occurrences of a column with the same name. When converted to the nested representation, those columns might (e.g.) get names like "1" and "2", "x" and "y", or use the name of the DataFrame they came from, then they're added to a new DataFrame containing just the two of columns, then that's added as a (nested) column to the resulting/outer DataFrame.

One way the nesting makes things more complicated (maybe) is what to return from a property like DataFrame.num_cols -- should it be the number of columns as seen by that DataFrame instance, or should it be a flattened value (so counting the columns from any nested DataFrames as well). I think this could be disambiguated by having two properties like DataFrame.num_cols and DataFrame.num_cols_flat.

Dataframe class name

I think there is consensus (correct me if I'm wrong), on having a 2-D structure where (at least) columns are labelled, and where a whole column share a type. More specific discussions about this structure can be made in #2.

In this issue, I'd like to discuss how we should name the class representing this structure. We've been using dataframe for the concept so far, and it's how the class is named in pandas, vaex, Modin, R and others. But in #14 (comment) it was proposed that we consider other names. I list here the proposed options in the comment and couple more. I propose that people write their username next to their preferred option, and use the comments to expand on why if needed.

Also, I think we should decide about capitalization, I guess these are the only options (using dataframe as example, but applied to the preferred option from the above list):

Release the spec as PyPI package

Hello everyone!

I've been mulling over introducing the Dataframe Exchange protocol in Pandas and Modin, and I think it would be beneficial for every end library implementing the protocol to have the exact same base.

Right now the protocol interface is defined by code, but said code is not "published" as a ready to use Python source.

I would like to make it a real PyPI package to use it in type hinting and (ideally) mypy type checking and to enable other libraries to do the same.

I propose to publish the package as dataframe-protocol or df-protocol and rename protocol/ directory to df_protocol turning it in a real Python package.
That way any library which would be implementing the protocol would just from df_protocol import exchange and use it for type hints (and for enum values - as now they're embedded in docstrings which just look really weird to me).

Am I missing something here? Are there any objections?

I can make the PR with necessary changes if there is the agreement, and can keep it published both in PyPI and conda-forge (or can turn the publishing to someone else in the consortium).

P.S. Keeping the top-level df_protocol would allow us to add another subpackage for cross-operation API if/when we feel ready for that (so keeping this future-proof).

Separate object for a dataframe colum? (is Series needed?)

Probably a bit early to the discussion, but I think this will need to be discussed eventually.

Is a separate object representing a single column needed? Like having Series, instead of just using one column DataFrame.

Having two separate objects, IMO adds a decent amount of complexity, both in the implementation and for the user. Whether this complexity is worth or not, I don't know. But I think this shouldn't be replicated from pandas without a discussion.

How to consume a single buffer & connection to array interchange

For dataframe interchange, the smallest building block is a "buffer" (see gh-35, gh-38) - a block of memory. Interpreting that is nontrivial, especially if the goal is to build an interchange protocol in Python. That's why DLPack, buffer protocol, __array_interface__, __cuda_array_interface__, __array__ and __arrow_array__ all exist, and are still complicated.

For what a buffer is, currently it's only a data pointer (ptr) and a size (bufsize) which together describe a contiguous block of memory, plus a device attribute (__dlpack_device__) and optionally DLPack support (__dlpack__). One open question is:

  • Should a buffer support strides? This helps describe actual memory layout, e.g. Pandas can have strided columns (see #38 (comment))

The other, larger question is how to make buffers nice to deal with for implementers of the protocol. The current Pandas prototype shows the issue:

def convert_column_to_ndarray(col : ColumnObject) -> np.ndarray:
    """
    """
    if col.offset != 0:
        raise NotImplementedError("column.offset > 0 not handled yet")

    if col.describe_null not in (0, 1):
        raise NotImplementedError("Null values represented as masks or "
                                  "sentinel values not handled yet")

    # Handle the dtype
    _dtype = col.dtype
    kind = _dtype[0]
    bitwidth = _dtype[1]
    if _dtype[0] not in (0, 1, 2, 20):
        raise RuntimeError("Not a boolean, integer or floating-point dtype")

    _ints = {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}
    _uints = {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}
    _floats = {32: np.float32, 64: np.float64}
    _np_dtypes = {0: _ints, 1: _uints, 2: _floats, 20: {8: bool}}
    column_dtype = _np_dtypes[kind][bitwidth]

    # No DLPack yet, so need to construct a new ndarray from the data pointer
    # and size in the buffer plus the dtype on the column
    _buffer = col.get_data_buffer()
    ctypes_type = np.ctypeslib.as_ctypes_type(column_dtype)
    data_pointer = ctypes.cast(_buffer.ptr, ctypes.POINTER(ctypes_type))

    # NOTE: `x` does not own its memory, so the caller of this function must
    #       either make a copy or hold on to a reference of the column or
    #       buffer! (not done yet, this is pretty awful ...)
    x = np.ctypeslib.as_array(data_pointer,
                              shape=(_buffer.bufsize // (bitwidth//8),))

    return x

From #38 (review) (@kkraus14 & @rgommers):

In __cuda_array_interface__ we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well.

Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.

__cuda_array_interface__ is directly attached to the object you need to hold on to, which is not the case for this Buffer.

I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there.

Yep, for numerical data types the solution can simply be: hurry up with implementing __dlpack__, and the problem goes away. The dtypes that DLPack does not support are more of an issue.

From #38 (comment) (@jorisvandenbossche):

I personally think it would be useful to keep those existing interface methods (or array, or arrow_array). For people that are using those interface, that will be easier to interface with the interchange protocol than manually converting the buffers.

Alternative/extension to the current design

We could change the plain memory description + __dlpack__ to:

  1. Implementations MUST support a memory description with ptr, bufsize, and device
  2. Implementations MAY support buffers in their native format (e.g. add a native enum attribute, and if both producer and consumer happen to use that native format, they can call the corresponding protocol - __arrow_array__ or __array__)
  3. Implementations MAY support any exchange protocol (DLPack, __cuda_array_interface__, buffer protocol, __array_interface__).

(1) is required for any implementation to be able to talk to any other implementation, but also the most clunky to support because it needs to solve the "who owns this memory and how do you prevent it from being freed" all over again. What is needed there is

The advantage of (2) and (3) are that they have the most hairy issue already solved, and will likely be faster.

And the MUST/MAY should address @kkraus14's concern that people will just standardize on the lowest common denominator (numpy).

What is missing for dealing with memory buffers

A summary of why this is hard is:

  1. Underlying implementations are not compatible. E.g., NumPy doesn't support variable length strings or bit masks, Arrow does not support strided arrays or byte masks.
  2. DLPack is the only protocol with device support, but it does not support all dtypes that are needed.

So what we are aiming for (ambitiously) is:

  • Something flexible enough to be a superset of NumPy and Arrow, with full device support.
  • In pure Python

The "holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well" seems necessary for supporting the raw memory description. This probably means that (a) the Buffer object should include the right Python object to keep a reference to (for Pandas that would typically be a 1-D numpy array), and (b) there must be some machinery to keep this reference alive (TBD what that looks like, likely not pure Python) in the implementation.

Kaggle notebooks analysis

This is a first version of the analysis of pandas usage in Kaggle notebooks.

We've fetched Python notebooks from Kaggle and we run them using record_api to analyze the number of calls to the main objects of the pandas API. A total of 895 notebooks could be analyzed.

In a separate column, information about the page views in the pandas documentation has been added. The page views are normalized by 1,000 (so the page with more views in the pandas documentation would have a value of 1,000 in the column).

For simplicity, only the attributes of DataFrame, Series and the pandas top-level module have been merged. So, pandas.sum(), Series.sum() and DataFrame.sum() would appear in the list as simply sum.

The different sections are to help reading the document, and not an "official" categorization of the API. Feedback is welcome if something feels misplaced.

The source code to generate the table is available at this repo.

Top 25 called methods

Notes:

  • Operators (e.g. __add__) are merged with their equivalent method (e.g. add)
  • __getitem__ is both used to access a column df[col] and to filter df[condition]
  • Accessing a call is also possible via __getattr__ (e.g. df.col_name), but this has not been captured
Object Kaggle calls
__getitem__ 143992
__setitem__ 40059
eq 3018
mul 2799
add 2768
groupby 2267
loc 1667
drop 1618
fillna 1609
columns 1583
head 1575
truediv 1442
shape 1267
sub 1144
isnull 1057
sort_values 1015
and 957
values 953
sum 898
astype 728
value_counts 706
index 664
gt 622
apply 538
to_frame 479

Main items by category

Data summary and info

Object Kaggle calls Docs views
info 275 22
empty 0 32
describe 303 146
value_counts 706 161
dtypes 175 64
memory_usage 83 2
ndim 0 1
shape 1267 17
size 3 45
values 953 113
attrs 0 0
array 0 0
unique 193 106
dtype 149 8
nbytes 0 0

Indexing

Object Kaggle calls Docs views
__getitem__ 143992 0
__setitem__ 40059 0
axes 0 4
columns 1583 31
set_index 72 278
swapaxes 0 0
select_dtypes 180 36
lookup 0 11
xs 5 16
loc 1667 232
iloc 427 122
index 664 164
reindex 11 136
reindex_like 0 2
reset_index 305 279
add_prefix 16 6
add_suffix 0 3
get 0 16
iat 1 17
keys 13 16
at 4 40
filter 3 170
rename 401 355
rename_axis 0 13
idxmax 7 49
idxmin 0 10
droplevel 0 0
truncate 0 7
swaplevel 0 7
take 0 5
reorder_levels 0 5
sort_index 32 90
set_axis 0 1
pop 14 9
searchsorted 0 3
name 113 13
item 0 3
argmax 0 2
argmin 0 1
argsort 0 3

Filter, select, sort

Object Kaggle calls Docs views
nlargest 25 17
nsmallest 1 8
head 1575 108
tail 60 12
drop_duplicates 20 194
sort_values 1015 457
sample 63 102
query 12 69

Operators

Object Kaggle calls Docs views
add 2768 104
div 2 10
dot 0 9
eq 3018 1
equals 0 35
floordiv 3 0
ge 68 1
gt 622 1
le 197 0
lt 8 0
mod 11 1
mul 2799 4
ne 163 1
pow 29 2
product 0 3
radd 0 6
rdiv 0 0
rfloordiv 0 0
rmod 0 0
rmul 0 2
rpow 0 0
rsub 0 2
rtruediv 0 2
sub 1144 7
truediv 1442 0

Missing values

Object Kaggle calls Docs views
isnull 1057 90
notnull 60 40
dropna 193 346
fillna 1609 248
interpolate 3 39
isna 108 27
notna 5 11
hasnans 0 0

Map

Object Kaggle calls Docs views
cut 59 84
eval 0 12
corrwith 1 11
applymap 2 49
astype 728 234
rank 2 34
clip 4 13
where 10 105
mask 14 25
combine 0 12
combine_first 0 11
isin 86 138
abs 25 12
replace 463 216
apply 538 379
round 14 68
transform 10 39
factorize 3 15
map 420 91
between 1 12

Reduce

Object Kaggle calls Docs views
cov 0 9
quantile 47 78
var 4 11
skew 88 5
std 140 39
sum 898 114
kurt 60 1
kurtosis 23 3
count 109 107
max 131 70
mean 390 107
median 228 21
min 107 26
mode 205 18
prod 1 1
nunique 15 27
all 9 16
any 87 22
mad 3 2
sem 0 2
corr 239 105
is_monotonic 0 0
is_monotonic_decreasing 0 0
is_monotonic_increasing 0 0
is_unique 0 1
cov 0 9
autocorr 0 7
quantile 47 78

Misc

Object Kaggle calls Docs views
iterrows 39 102
style 84 76
itertuples 0 36
bool 0 5
squeeze 0 2
update 8 56
pipe 3 7
__iter__ 0 1
items 1 6
iteritems 3 37
view 0 0

Reshape / Join / Concat...

Object Kaggle calls Docs views
get_dummies 258 152
crosstab 58 40
concat 432 315
merge_asof 0 16
merge_ordered 0 4
wide_to_long 0 7
pivot 29 95
pivot_table 54 144
join 159 225
melt 18 75
stack 0 36
transpose 9 76
assign 19 74
insert 17 57
merge 425 413
drop 1618 625
explode 0 0
align 3 10
append 439 515
T 55 6
unstack 17 58
repeat 0 5
ravel 0 5

Group

Object Kaggle calls Docs views
agg 0 16
aggregate 3 58
groupby 2267 719

Window

Object Kaggle calls Docs views
cummax 0 2
cummin 0 0
cumprod 0 5
cumsum 8 29
pct_change 0 34
rolling 42 140
ewm 0 33
expanding 0 11
duplicated 14 90
diff 1 54

Mutability

On the call yesterday, the topic of mutability came up in the vaex demo.

The short version is that it may be difficult or impossible for some systems to implement inplace mutation of dataframes. For example, I believe that neither vaex nor Dask implement the following:

In [8]: df = pd.DataFrame({"A": [1, 2]})

In [9]: df
Out[9]:
   A
0  1
1  2

In [10]: df.loc[0, 'A'] = 0

In [11]: df
Out[11]:
   A
0  0
1  2

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

There is one mutation-adjacent area that might be considered: using DataFrame.__setitem__ to add an additional column

In [12]: df['B'] = [1, 2]

In [13]: df
Out[13]:
   A  B
0  0  1
1  2  2

Or perhaps to update the contents of an entire column

In [14]: df['B'] = [3, 4]

In [15]: df
Out[15]:
   A  B
0  0  3
1  2  4

In these case, no values are actually being mutated inplace. Is that acceptable?

Get number of rows and columns

This issue is to discuss how to obtain the size of a dataframe. I'll show with an example, and base it in the pandas API.

Given a dataframe:

import pandas

data = {'col1': [1, 2, 3, 4],
        'col2': [5, 6, 7, 8]}

df = pandas.DataFrame(data)

I think the Pythonic and simpler way to get the number of rows and columns is to just use Python's len, what pandas does:

>>> len(df)  # number of rows
4
>>> len(df.columns)  # number of columns
2

I guess an alternative could be to use df.num_rows and df.num_columns, but IMHO it doesn't add much value, and just makes the API more complex.

One thing to note, is that pandas mostly implements the dict API for a dataframe (as if it was a dictionary of lists, like in the example data). But when returning the number of rows with len(df), this is inconsistent with the dict API, which would return the number of columns (keys). So, with the proposed API len(data) != len(df). I think being fully consistent with the dict API would be misleading, but worth considering it.

Then, pandas offers some extra properties:

df.ndim == 2

df.shape == len(df), len(df.columns)

df.size == len(df) * len(df.columns)

I guess the reason for the first two is that pandas originally implemented Panel, a three dimensional data structure, and ndim and shape made sense with it. But I don't think they add much value now.

I don't think size is that commonly used (will check once we have the data of analyzing pandas usage), and it's trivial for the users to implement it, so I wouldn't add it to the API.

Proposal

  • len(df) returning the number of rows
  • len(df.columns) returning the number of columns

And nothing else regarding the shape of a dataframe.

Missing Data

This issues is dedicated to discussing the large topic of "missing" data.

First, a bit on names. I think we can reasonably choose between NA, null, or missing as a general name for "missing" values. We'd use that to inform decisions on method names like DataFrame.isna() vs. DataFrame.isnull() vs. ...
Pandas favors NA, databases might favor null, Julia uses missing. I don't have a strong opinion here.

Some topics of discussion:

  1. data types should be nullable

I think we'd like that the introduction of missing data should not fundamentally change the dtype of a column.
This is not the case with pandas:

In [5]: df1 = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})

In [6]: df2 = pd.DataFrame({"A": ['a', 'c'], "C": [3, 4]})

In [7]: df1.dtypes
Out[7]:
A    object
B     int64
dtype: object

In [8]: pd.merge(df1, df2, on="A", how="outer")
Out[8]:
   A    B    C
0  a  1.0  3.0
1  b  2.0  NaN
2  c  NaN  4.0

In [9]: _.dtypes
Out[9]:
A     object
B    float64
C    float64

In pandas, for int-dtype data NaN is used as the missing value indicator. NaN is a float, and so the column is cast to float64 dtype.

Ideally Out[9] would preserve the int dtype for B and C. At this moment, I don't have a strong opinion on whether the dtype for B should be a plain int64, or something like a Union[int64, NA].

  1. Semantics in arithmetic and comparison operations

In general, missing values should propagate in arithmetic and comparison operations (using <NA> as a marker for a missing value)`.

>>> df1 = DataFrame({"A": [1, None, 3]})
>>> df1 + 1
      A
0     2
1  <NA>
2     4

>>> df1 == 1
       A
0   True
1   <NA>
2  False

There might be a few exceptions. For example 0 ** NA might be 1 rather than NA, since it doesn't matter exactly what value NA takes on.

  1. Semantics in logical operations

For boolean logical operations (and, or, xor), libraries should implement three-value or Kleene Logic. The pandas docs has a table
The short-version is that the result should be NA if it depends on whether the NA operand being True or False. For example, True | NA is True, since it doesn't matter whether that NA is "really" True or False.

  1. The need for a scalar NA?

Libraries might need to implement a scalar NA value, but I'm not sure. As a user, you would get this from indexing to get a scalar, or in an operation that produces an NA result.

>>> df = pd.DataFrame({"A": [None]})
>>> df.iloc[0, 0]  # no comment on the indexing API
<NA>

What semantics should this scalar NA have? In particular, should it be typed? This is something we've struggled with in recent versions of pandas. There's a desire to preserve a property along the lines of the following

(arr1 + arr2)[0].dtype == (arr1 + arr2[0]).dtype

Where the first value in the second array is NA. If you have a single NA without any dtype, you can't implement that property.
There's a long thread on this at pandas-dev/pandas#28095.

Trying to define "data frame"

There was a question on the sync call today about defining "what is a data frame?". People may have different perspectives, but I wanted to offer mine:


A "data frame" is a programming interface for expressing data manipulations and analytical operations on tabular datasets (a dataset, in turn, is a collection of columns each having their own logical data type) in a general purpose programming language. The interface often exposes imperative, composable constructs where operations consist of multiple statements or function calls. This contrasts with the declarative interface of query languages like SQL.


Things that IMHO should not be included in the definition, and are implementation-specific concerns, and any given "data frame library" may work differently:

  • Presumptions of data representation (for example, most data frame libraries in Python have a bespoke / custom representation based on lower-level data containers). This includes specific type-specific questions like "how is a timestamp represented" or "how is categorical data reprepresented", since these are implementation dependent. Also, just because the programming interface has "columns" does not guarantee that the data representation is columnar.
  • Presumptions of data locality (in-memory, distributed in-memory, out-of-core, remote)
  • Presumptions of execution semantics (eager vs. deferred)

Hopefully one objective of this group will be to define a standardized programming interface that avoids commingling implementation-specific details into the interface.

That said, there may be people that want to create "RPandas" (see RPython) -- i.e. to provide for substituting new objects into existing code that uses pandas. If that's what some people want, we will need to clarify that up front.

Dataframe MVP

We've already got several useful discussions open, on different topics. To try to give a bit of structure to the conversations, I propose we try to start with an initial MVP (minimum viable product), and we build iterating over it.

This is a draft of the topics that we may want to discuss, and a possible order to discuss them:

  • Dataframe class name #17
  • Get number of rows and columns #20
  • Get and set column names #21
  • Selecting/accessing columns (df[col], df[col1, col2]), and calling methods in 1 vs N columns
  • Filter data
  • Indexing, row labels #12
  • Sorting data
  • Missing data #9
  • Constructor and loading/dumping data
  • Map operations (abs, isin, clip, str.lower,...)
  • Map operations with Python operators (+, *,...)
  • Reductions (sum, mean,...) #11
  • Aggregating data and window functions
  • Joining dataframes
  • Reshaping data (pivot, stack, get dummies...)
  • Setting data (mutability, adding new columns) #10
  • Displaying data, visualization, plotting #15
  • Time series operations
  • Sparse data

The idea would be to discuss and decide about each topic incrementally, and keep defining an API that can be used end to end (with very limited functionality at the beginning). So, focusing on being able to write code with the API, we should be identifying for each topic, the questions that need to be answered to construct the API. And then add to the RFC the API definition based on the agreements.

Next there is a very simple example of dataframe usage. And the questions that need to be answered to define a basic API for them.

>>> from whatever import dataframe

>>> data = {'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}

>>> df = dataframe.load(data, format='dict')
>>> df
a b c
-----
1 3 5
2 4 6

>>> len(df)
2
>>> len(df.columns)
3

>>> df.dtypes
[int, int, int]

>>> df.columns
['a', 'b', 'c']

>>> df.columns = 'x', 'y', 'z'
>>> df.columns
['x', 'y', 'z']

>>> df
x y z
-----
1 3 5
2 4 6

>>> df['q'] = [7, 8]
>>> df
x y z q
-------
1 3 5 7
2 4 6 8

>>> df['y']
y
-
3
4

>>> df['z', 'x']
z x
---
5 1
6 2

>>> df.dump(format='dict')
{'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'q': [7, 8]}

The simpler questions that need to be answered to define this MVP API are:

  • Name of the dataframe class. I can think of two main options (feel free to propose more):
    • DataFrame or Dataframe, to be consistent with Python class capitalization
    • dataframe, using Python type capitalization (as in int, bool, datetime.datetime...
  • How to obtain the size of the dataframe?
    • Properties (num_columns, num_rows)
    • Using Python len: len(df), len(df.columns)
    • shape (it allows for N dimensions, which for a dataframe is not needed, since it's always 2D)
  • How to obtain the dtypes (is a dtypes property enough?)
  • Setting and getting column names
    • Is using a Python property enough?
    • What should be the name? columns, column_names...

The next two questions are also needed, but they are more complex, so I'll be creating separate issues for them:

  • Loading and exporting data

    • Should the dataframe class provide a constructor? If it does, should support different formats (like pandas)?
    • Should we have different syntax (as in pandas) for loading data from disk (pandas.read_csv...) and for loading data from memory (DataFrame.from_dict)? Or a standard way for all loading/exporting is preferred?
  • How to access and set columns in a dataframe

    • With __getittem__ directly (df[col] / df[col] = foo)
    • With __getitem__ over a property (df.col[col] / df.col[col] = foo)
    • With methods (df.get(col) / df.set(col=foo))
    • Is more than one way needed/preferred?

Tracking issue: dataframe protocol implementation

The bulk of the dataframe interchange protocol was done in gh-38. There were still a number of TODOs however, and more will likely pop up once we have multiple implementations so we can actually turn one type of dataframe into another type. This is the tracking issue for those TODOs and issues:

  • Categorical dtypes: we should allow having null as a category; it should not have a specified meaning, it's just another category that should (e.g.) roundtrip correctly. See conversation in 8 Apr meeting.
  • Categorical dtypes: should they be a dtype in themselves, or should they be a part of the dtype tuple? Currently dtype is (kind, bitwidth, format_str, endianness), with categorical being a value of the kind enum. Is making a 5th element in the dtype, with that element being another dtype 4-tuple, thereby allowing for nesting, sensible?
  • Add a metadata attribute that can be used to store library-specific things. For example, Vaex should be able to store expressions for its virtual columns there. See PR gh-43
  • Add a flag to throw an exception if the export cannot be zero-copy. (e.g. for pandas, possible due to block manager where rows are contiguous and columns are not - add a test for that). See PR gh-44
  • Add a string dtype, with variable-length strings implemented with the same scheme as Arrow uses (an offsets and a data buffer, see #38 (comment)). _See PR gh-45
  • Signature of the from_dataframe protocol? See #42 and meeting of 20 May.
  • What can be reused between implementations in different libraries, and can/should we have a reference implementation? --> question needs answering somewhere.
  • What is the ownership for buffers, who owns the memory? This should be clearly spelled out in the docs. An owner attribute is perhaps needed. See meeting minutes 4 March, #39, and comments on this PR.

Sparse columns

Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?

Context

It can be the case than a given column has more more than 99% of its values that are null or missing (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.

Use cases

  • efficient computation: e.g. computing the mean and standard deviation of a sparse column with more then 99% of zeros
  • efficient computation: e.g. computing the nanmean and nanstd of a sparse column with more then 99% are missing
  • some machine learning estimators have special treatments of sparse columns (e.g. for memory efficient representation of one-hot encoded categorical data), but often they could (in theory) be changed to handle categorical variables using a different representation if explicitly tagged as such.

Limitations

  • treating sparsity at the single column levels can be limiting. some machine learning algorithms that leverage sparsity can only do so when considering many sparse columns together as a sparse matrix using a Compressed-Sparse-Rows (CSR) representation (e.g. logistic regression with non-coordinate-based gradient-based solvers (SGD, L-BFGS...) and kernel machines (support vector machines, Gaussian processes, kernel approximation methods...)
  • other can leverage sparsity in a column-wise manner, typically by accepting Compressed Sparse Columns (CSC) data (e.g. coordinate descent solvers for the Lasso, random forests, gradient boosting trees...)

Survey of existing support

(incomplete, feel free to edit or comment)

Questions:

  • Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the fill_value param)
  • Should this be some kind of optional module / extension of the main dataframe API spec?

APIs for both building pipelines and data analysis

From my own experience, there are (at least) two very different use cases to use dataframes:

  1. Doing some real-time data analysis in notebooks
  2. Building production pipelines

While I think pandas (and later Vaex, Dask, Modin...) did a very reasonable job at building a single tool that solves both use cases. There are trade offs, that IMO will bias any API towards one or the other.

Some specific examples:

  • Eager/lazy modes: In case 1 (data analysis) eager mode is probably preferable, while in case 2 (pipelines), lazy mode has more advantages
  • Automatic type inference/casting: I see the advantage of having all sorts of magic on inferring types and guessing when the user is able to to check the results of the operations at every step. But when instead of executing cell by cell, there is a big pipeline that is executed as a batch process, I see this problematic. I think it's worth having to be explicit and avoid any magic. It helps prevent bugs, and the errors are not propagated to later stages in the pipeline, making them difficult to identify.

I wrote a post about this that describes this point of view in more detail.

I think it can help the discussions to keep in mind that there are at least two main use cases, and that there will be trade offs among them.

Feedback here very welcome.

Data types to support

What data types should be part of the standard? For the array API, the types have been discussed here.

A good reference for data types for data frames is the Arrow data types documentation. The page probably contains many more types than the ones we want to support in the standard.

Topics to make decisions on:

  • Which data types should be supported by the standard?
  • Are implementation expected to provided extra data types? Should we have a list of optional types, or consider out of scope types not part of the standard?
  • Missing data is discussed separately in #9

These are IMO the main types (feel free to disagree):

  • boolean
  • int8 / uint8
  • int16 / uint16
  • int32 / uint32
  • int64 / uint64
  • float32
  • float64
  • string (I guess the main use cases is variable length strings, but should we consider fixed length strings?)
  • categorical (would make sense to have categorical8, categorical16,... for different representations of the categories with uint8, uint16...?)
  • datetime64 (requires discussion, pandas uses nanoseconds as unit since epoch, which can represent from years 1677 to 2262)

Some other types that could be considered:

  • decimal
  • python object
  • binary
  • date
  • time
  • timedelta
  • period
  • complex

And also types based on other types that could be considered:

  • date + timezone
  • numeric + unit
  • interval
  • struct
  • list
  • mapping

Calling methods on invalid types

This is a follow up of the discussions in:

  • #6 (comment)
  • #11 (question: pandas has parameters (bool_only, numeric_only) to let only apply the operation over columns of certain types only. Do we want it?)

See this example:

>>> df[['name', 'population']].mean()
population    2.729748e+07
dtype: float64

Even if the name column is selected, it is being ignored, since the mean of a string columns does not make sense. As opposed to raising an exception.

Many reductions implement a parameter to let control this behavior:

df[['name', 'population']].mean(numeric_only=False)
TypeError: could not convert string to float:

If we consider more methods to be applied directly over a dataframe, for example:

>>> df[['first_'name', 'last_name']].str.lower() 

We may end up with a huge amount of string_only, bool_only, numeric_only parameters. All meaning something similar, but IMO adding a decent amount of complexity, and being difficult to keep the behavior consistent.

My preference would be to always raise, but being a software engineer I'm biased, and I guess many users may want this "magic".

So, I guess implementing an option, for example: pandas.options.mode.invalid_dtype {raise or skip} could make more sense.

The main problem with this approach is probably that it's not as easy to define the behavior for each operation:

(df.mean(numeric_only=True)
   .mean(numeric_only=False))

Personally, I don't see this as an issue. IMO, the behavior depends more on the user than on the operation. I'd say for production code, having to be explicit, and selecting the columns to operate with, makes more sense. While in a notebook, avoiding exceptions with this sort of "magic" seems to be more useful.

I guess for Series/1-column DataFrame (see #6) it always makes sense to raise an exception.

Thoughts?

Related topic: dataframe protocol for data interchange/export

In March'20 there was a very detailed discussion about introduction a new __dataframe__ protocol: https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267. The purpose of it is being able to exchange data between different implementations, or export data to (e.g.) Apache Arrow or NumPy.

There's a strawman implementation at wesm/dataframe-protocol#1.

The discussion went a little all over the place, with many people misunderstanding the main purpose was data exchange rather than providing an API to manipulate or do computations with a dataframe. That would be a much larger topic, and something this consortium aims to deliver an RFC for.

That said, the __dataframe__ topic is very much related, and is also a potentially interesting example of a cross-dataframe-library topic that could really benefit from having a detailed RFC with requirements and use cases. We should consider picking up that topic, and consider lessons from it in community engagement.

Pandas implementation and BooleanDtype

When researching all possible dtypes with missing values in Vaex and observing how this is handled in Pandas implementation I found that there is a BooleanDtype in Pandas that gives an error.

def test_bool():
    df = pd.DataFrame({"A": [True, False, False, True]})
    df["B"] = pd.array([True, False, pd.NA, True], dtype="boolean")
    df2 = from_dataframe(df)
    tm.assert_frame_equal(df, df2)

My question is: when thinking of all possible entries into Vaex dataframe should one stick to the common or should one dissect all possibilities on this level?

Clarify if interchange dataframes should also have `__dataframe__()`

I couldn't work out if the interchange dataframe (i.e. the dataframe returned from __dataframe__()) should also have a __dataframe__() method, e.g.

>>> import pandas as pd
>>> df = pd.DataFrame()  # i.e. the top-level dataframe
>>> interchange_df1 = df.__dataframe__()
>>> interchange_df2 = interchange_df1.__dataframe__()

With the upstream of current adopters, we have a split on whether the interchange dataframe has this method.

Library Top-level Interchange
pandas โœ”๏ธ โœ”๏ธ
vaex โœ”๏ธ โŒ
modin โœ”๏ธ โœ”๏ธ
cuDF โœ”๏ธ โŒ

I had assumed that interchange dataframes should have __dataframe__() by virtue of it being a method in the DataFrame API object. I think it makes sense, as then from_dataframe()-like functions only need to check for __dataframe__() to support interchanging both top-level and interchange dataframes of different libraries.

If there is explicit specification somewhere in this regard then please give me a pointer! In any case, it might be worth clarifying in the __dataframe__() docstring where this method should be residing.

Meta data for DataFrame and Column

In order to not lose information that is encoded in DataFrames and Columns that is not covered by our API, we may want to provide extra metadata slots for these.

One may argue that this should be covered in the API, and this defeats the purpose of a standard, but I think it's a very pragmatic approach to guarantee lossless roundtripping for information outside of this standard and help adoption (because there is an escape hatch).

Example metadata for a dataframe

  • path: for when it's backed by a file or remote
  • description: metadata describing the dataframe
  • license: CC0, MIT
  • history: log of how the data was produced

Example metadata for a column:

  • unit: string that describes the unit ('km/s', 'parsec', 'furlong')
  • description: metadata describing the column
  • expression: in vaex, this is the expression in string form
  • is_index: an indicator that this column is the index in Pandas.

This could also help to round trip Arrow extension types: https://arrow.apache.org/docs/python/extending_types.html and I guess the same holds for Pandas.

An implementation could be a def get_metadata(self) -> dict[str, Any] where we recommend prefixing keys with implementation specific names, like 'arrow.extention_type', 'vaex.unit', 'pandas.extension_type_name' etc.

Commonly used keys could be upgraded to be part of the API in the future (non-prefixed keys) that we formalize and document.

FYI: metadata is a first-class citizen in the Clojure language https://clojure.org/reference/metadata

Reductions

Next are listed the reductions over numerical types defined in pandas. These can be applied:

  • To Series
  • To N columns of a DataFrame
  • To group by operations
  • As window functions (window, rolling, expanding or ewm)
  • In resample operations

pandas is not consistent, in letting any reduction to be applied to any of the above. Each method is
independent (Series.sum, GroupBy.sum, Window.sum...). Some reductions are not implemented for
some of the classes. And the signatures can change (e.g. Series.var(ddof) vs EWM.var(bias))

I propose to have standard signatures for the reductions, and have all reductions available to all classes.

Reductions for numerical data types and proposed signatures

  • all()
  • any()
  • count()
  • nunique() # may be the name could be count_unique, count_distinct...?
  • mode() # what to do if there is more than one mode? Ideally we would like all reductions to return a scalar
  • min()
  • max()
  • median()
  • quantile(q, interpolation='linear') # in pandas q is by default 0.5, but I think it's better to require it; interpolation can be {โ€˜linearโ€™, โ€˜lowerโ€™, โ€˜higherโ€™, โ€˜midpointโ€™, โ€˜nearestโ€™}
  • sum()
  • prod()
  • mean()
  • var(ddof=1) # delta degrees of freedom (for some classes bias is used)
  • std(ddof=1)
  • skew()
  • kurt() # pandas has also the alias kurtosis
  • sem(ddof=1) # standard error of the mean
  • mad() # mean absolute deviation
  • autocorr(lag=1)
  • is_unique() # in pandas is a property
  • is_monotonic() # in pandas is a property
  • is_monotonic_decreasing() # in pandas is a property
  • is_monotonic_increasing() # in pandas is a property

Reductions that may depend on row labels (and could potentially return a list, like mode):

  • idxmax() / argmax()
  • idxmin() / argmin()

These need an extra column other:

  • cov(other, ddof=1)
  • corr(other, method='pearson') # method can be {โ€˜pearsonโ€™, โ€˜kendallโ€™, โ€˜spearmanโ€™}

Questions

  • Allow reductions over rows, or only over columns?
  • What to do with NA?
  • pandas has parameters (bool_only, numeric_only) to let only apply the operation over columns of certain types only. Do we want it?
    • I think something like df.select_columns_by_dtype(int).sum() would be preferrable than a parameter to all or some reductions
  • pandas has a level parameter in many reductions, for MultiIndex. If Indexing/MultiIndexing is part of the API, do we want to have it?
  • pandas has a min_count/min_periods parameter in some reductions (e.g. sum, min), to return NA if less than min_count values are present. Do we want to keep it?
  • How should reductions be applied?
    • In the top-level namespace, as pandas (e.g. df[col].sum())
    • Using an accessor (e.g. df[col].reduce.sum())
    • Having a reduce function, and passing the specific functions as a parameter (e.g. df[col].reduce(sum))
    • Other ideas
  • Would it make sense to have a third-party package implementing reductions that can be reused by projects?

Frequency of usage

pandas_reductions

Restrictions on column labels

One of the uncontroversial points from #2 is that DataFrames have column labels / names. I'd like to discuss two specific points on this before merging the results into that issue.

  1. What type can the column labels be? Should they be limited to just strings?
  2. Do we require uniqueness of column labels?

I'm a bit unsure whether these are getting too far into the implementation side of things. Should we just take no stance on either of these?


My responses:

  1. We should probably labels to be any type.

Operations like crosstab / pivot places a column from the input dataframe into the column labels of the output.

We'll need to be careful with how this interacts with the indexing API, since a label like the tuple ('my', 'label') might introduce ambiguities (e.g. the full list of labels is ['my', 'label', ('my', 'label')].

Is it reasonable to require each label to be hashable? Pandas requires this, to facilitate lookup in a hashtable.

  1. We cannot require uniqueness.

dataframes are commonly used to wrangle real-world data into shape, and real-world data is messy. If an implementation wants to ensure uniqueness (perhaps on a per-object basis) then is can offer that separately. But the API should at least allow for it.

DataFrame interchange protocol: datetime units

We currenty list "datetime support" in the design document, and also listed it in the dtype docstring:

But at the moment the spec doesn't say anything about how the datetime is stored (which resolution, or whether it supports multiple resolutions with some parametrization).

Updating the spec to mention it should be nanoseconds might be the obvious solution (since that's the only resolution pandas currently supports), but I think we should make this more flexible and allow different units (hopefully pandas will support non-nanosecond resolutions in the future, and other systems might use other resolutions by default).

DataFrame interchange protocol: should NaT be like NaN or a sentinel?

In the describe_null we currently list the following options:

  • 0 : non-nullable
  • 1 : NaN/NaT
  • 2 : sentinel value
  • 3 : bit mask
  • 4 : byte mask

While looking at the pandas implementation, I was wondering if we shouldn't treat NaT differently from NaN and see it as a sentinel value (option 2 in the list above).

While NaN could also be seen as a kind of sentinel value, there are some clear differences: NaN is a floating point concept backed by the IEEE754 standard (while as far as I know "NaT" is quite numpy specific? eg Arrow doesn't support it). NaNs also evaluate as non-equal (following the standard), and while for datetime64 with NaT that's also the case in numpy, if you view the data as int64 it's not (and eg for dlpack those values will be regarded as int64? And the actual Buffer object might be agnostic to it)

Dataframe interchange protocol

This issue supersedes #1 and #14. As agreed in 6 Aug call the first milestone in the definition of the dataframe API will be the part to interchange data. As an sample use case, Matplotlib being able to receive dataframes from different implementations (e.g. pandas, vaex, modin, cudf, etc.).

This work was originally discussed in OSSData, and an initial draft was later proposed here: wesm/dataframe-protocol#1.

The topics to discuss and decide on are next:

  • Access dataframe properties/metadata:
    • Get number of rows and columns #20
    • Get column names #21
    • Get column data types
  • Selecting a column and accessing the underlying data (I think it'll require a decision on #6, whether we want a separate column/Series object)
  • Data types (which are part of the standard, how they are represented, etc.) #26
  • How downstream libraries will access this API (implemented in the dataframe directly, or returned object via __dataframe__)
  • Should row labels be part of the standard?

The procedure to include this part on the standard RFC will be as follows:

  1. Define the goals, requirements, target audience, scope and use cases, and include them in the RFC
  2. Discuss and build an standalone document specific to the data interchange based on the above topics
  3. Review internally, and post publicly for additional feedback
  4. Update the prototype with the agreed API
  5. Finalize and approve the API and the prototype, and add them to the RFC document

How the API is expected to be used

In today's meeting it was discussed what's the goal of the API, and which are its target users.

@maartenbreddels and @devin-petersohn, if I understood correctly, see the API we're defining here, as something they'd like to implement internally in Vaex and Modin, but not making it their public API. Not sure what's pandas point of view on that.

I think that's perfectly fine, and it makes sense. But I have the question of whether would make sense if those public API's would be independent wrappers, in the same way Seaborn wraps Matplotlib, or HoloViews wraps Bokeh. Let me expand on what I mean here.

For the discussions we had, I think people mentioned that they were interested in defining a more "pure" and less "magic" API, than the existing one. Not sure if the previous sentence makes a lot of sense, but I guess some of the principles for the API could be:

  • Explicit is better than implicit
  • There should be one-- and preferably only one --obvious way to do it
  • Avoid ambiguity
  • In general, avoid making the library having to make guesses

Personally, I think this API should be great for software developers. Like developers of libraries like us, who want to build on top of it. Or developers of downstream software. And I'd say, also to data engineers, and people who want to write production code with dataframes.

Then, I understand that some users (e.g. data analysts) prefer more "magic" API's, that automatically fix problems they don't want to care about. As an example, let's think of the dataframe constructor.

As a data analyst, or non-software people, I think the next code working is very reasonable/convenient:

DataFrame({'a': [1, 2], 'b': [3, 4]})
DataFrame([{'a': 1, 'b': 3}, {'a': 3, 'b': 4}])
DataFrame(json.loads(value))

But as software engineer, I may want to have a more explicit and less magic syntax, for example:

DataFrame.load(kind='dict', {'a': [1, 2], 'b': [3, 4]})
DataFrame.load(kind='list_of_dict', [{'a': 1, 'b': 3}, {'a': 3, 'b': 4}])
DataFrame.load(kind='dict', json.loads(value))

Correct me if I'm wrong, but I think there is mostly agreement that what we want to focus in the consortium API in the latter style. If Vaex, Modin, pandas... provide this API, then there is easy compatibility in the ecosystem. For example, Scikit-learn or Matplotlib can get a "dataframe" as a parameter, and operate with it, since they know it will follow the standard API.

But then, implementations like Modin, Vaex, or pandas, may want to keep their existing API's. Or provide a different user API, more targeted to specific users (e.g. data analysts, who want the library making guesses, that make their lives easier).

Then my question is, does it make sense that this alternative API live in the implementations? For example, let's consider I see pandas as this API on top of numpy, Vaex on top of memory maps, and Modin on top of Ray (excuse the simplification). Then, if Modin wants to implement an SQLite-like API. Could make sense that this is an independent project, of an SQLite-like API that wraps the standard API? Instead of a Modin API? I guess that could make sense.

Then, I guess there is the case, of an implementation, let's say pandas, which is planning to expose the API to users, but it's going to add some extra magic (let's say that the standard for filter is df.filter(condition) but pandas wants to keeps supporting df[condition] for backward compatibility. Or Vaex having some specific syntax for expressions in top of the standard API.

I see there is a whole range between these options:

  • All implementations offer exactly the same API
  • Implementations offer the standard API, but add some functionality to it (for their target users, or specific to the backens)
  • Backends (e.g. dataframes over numpy, over ray, over memory maps, over Arrow...) implement the same API, but users use libraries built on top of it. For example, the existing pandas API, could be a layer on top of the standard API, and work on top of Vaex, Modin...

Would be great to know other people thoughts. I think most people have an idea on how this API is expected to be used, but not sure if we're all in the same page.

Some comments on interchange API from an Arrow developer

hi all, great to see some continued work on this project after the original discussion from last year. I still think it's useful to allow libraries to "throw data over the wall" without forcing eager serialization to a particular format (like pandas or Arrow)


From Column docstring:

    TBD: Arrow has a separate "null" dtype, and has no separate mask concept.
         Instead, it seems to use "children" for both columns with a bit mask,
         and for nested dtypes. Unclear whether this is elegant or confusing.
         This design requires checking the null representation explicitly.

Could you clarify what is confusing? I do not understand the statement 'Instead, it seems to use "children" for both columns with a bit and for nested dtypes.'

Later

         The Arrow design requires checking:
         1. the ARROW_FLAG_NULLABLE (for sentinel values)
         2. if a column has two children, combined with one of those children
            having a null dtype.

         Making the mask concept explicit seems useful. One null dtype would
         not be enough to cover both bit and byte masks, so that would mean
         even more checking if we did it the Arrow way.

You mean the Arrow C interface here. Could you clarify what these other things mean?

  • If ARROW_FLAG_NULLABLE is set, then the first buffer (the validity bitmap) governs nullness.
  • The 2nd item here "if a column has two children, combined with one of those children having a null dtype." does not make sense to me, because nulls in Arrow are exclusively determined by the validity bitmap. Arrow does have a "Null" type, but this represents data whose values are always all null and has no associated buffers.

Re: "One null dtype would not be enough to cover both bit and byte masks, so that would even more checking if we did it the Arrow way.", I don't know what this means, could you clarify?


Column methods

    @property
    def null_count(self) -> Optional[int]:
        """
        Number of null elements, if known.
        Note: Arrow uses -1 to indicate "unknown", but None seems cleaner.
        """
        pass

Here you should indicate that you mean that the Arrow C interface (where the null_count is an int64)


General comments / questions

  • It feels a little unfortunate to go "halfway to Arrow" with adding the string offsets buffer, requiring serialization always for variable-size binary data. I haven't thought through what would be alternatives for string data that do not necessarily force this serialization (similar to the API proposal from wesm/dataframe-protocol#1) โ€” if the goal of this API is to reduce the need to serialize, having some alternative here might be worthwhile
  • It seems like it would be useful to provide some to_* methods on Column (like Column.to_arrow) โ€” I guess that pyarrow could implement a built-in implementation of this interface, but some producers might be able to produce Arrow or NumPy arrays directly and skip the lower-level memory export that's provided here.
  • Is there interest in supporting nested / non-flat data in this interchange API? I notice that they are explicitly excluded from the dtype docstring, but I would encourage you to think about them up front rather than bolting on later.

potentially relevant usage patterns / targets for a developer-focused API

In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on pandas.

Top 10 listed:

image

Seaborn

Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of isinstance checks (on pd.DataFrame, pd.Series).

Folium

just a single non-test usage, in pd.py:

def validate_location(location):  # noqa: C901
    "...J
    if isinstance(location, np.ndarray) \
            or (pd is not None and isinstance(location, pd.DataFrame)):
        location = np.squeeze(location).tolist()


def if_pandas_df_convert_to_numpy(obj):
    """Return a Numpy array from a Pandas dataframe.
    Iterating over a DataFrame has weird side effects, such as the first
    row being the column names. Converting to Numpy is more safe.
    """
    if pd is not None and isinstance(obj, pd.DataFrame):
        return obj.values
    else:
        return obj

PyJanitor

Interesting/unusual common pattern, which extends pd.DataFrame through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):

import pandas as pd
import pandas_flavor as pf

@pf.register_dataframe_method
def join_fasta(
    df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
    """
    Convenience method to join in a FASTA file as a column.
    """
    ...
    return df

Statsmodels

A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.

NetworkX

Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:

def to_networkx_graph(data, create_using=None, multigraph_input=False):
    """Make a NetworkX graph from a known data structure."""
        # Pandas DataFrame
    try:
        import pandas as pd

        if isinstance(data, pd.DataFrame):
            if data.shape[0] == data.shape[1]:
                try:
                    return nx.from_pandas_adjacency(data, create_using=create_using)
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame adjacency matrix."
                    raise nx.NetworkXError(msg) from err
            else:
                try:
                    return nx.from_pandas_edgelist(
                        data, edge_attr=True, create_using=create_using
                    )
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame edge-list."
                    raise nx.NetworkXError(msg) from err
    except ImportError:
        warnings.warn("pandas not found, skipping conversion test.", ImportWarning)


def from_pandas_adjacency(df, create_using=None):
    try:
        df = df[df.index]
    except Exception as err:
        missing = list(set(df.index).difference(set(df.columns)))
        msg = f"{missing} not in columns"
        raise nx.NetworkXError("Columns must match Indices.", msg) from err

    A = df.values
    G = from_numpy_array(A, create_using=create_using)

    nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
    return G

And using the .drop method in group.py:

def prominent_group(
    G, k, weight=None, C=None, endpoints=False, normalized=True, greedy=False
):
    import pandas as pd
    ...
    betweenness = pd.DataFrame.from_dict(PB)
    if C is not None:
        for node in C:
            # remove from the betweenness all the nodes not part of the group
            betweenness.drop(index=node, inplace=True)
            betweenness.drop(columns=node, inplace=True)
    CL = [node for _, node in sorted(zip(np.diag(betweenness), nodes), reverse=True)]

Perspective

A multi-language (streaming) viz and analytics library. The Python version uses pandas in core/pd.py. It uses a small but nontrivial amount of the API, including MultiIndex, CategoricalDtype, and time series functionality.

Scikit-learn

TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.

Matplotlib

Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.

Avoiding the "pandas trap"

Split from the discussions in #2.

To avoid the trap of "let's just match pandas", let's collect a list of specific problems with the pandas API, which we'll intentionally deviate from. To the extent possible we should limit this discussoin to issues with the API, rather than implementation.


  • pandas.DataFrame can't implement collections.abc.Mapping because .values is a property returning an array, rather than a method. (added by @TomAugspurger)
  • The "groupby-apply" pattern of passing opaque functions for non-trivial aggregations that are otherwise able to be expressed easily in e.g. SQL (consider: any aggregation expression that involves more than one column) (added by @wesm)
  • Indexing in pandas accepts a variety of different inputs, which each have their own semantics, e.g. passing a function to the __getitem__ or loc/iloc. It is not explicitly clear to new users the difference between df[["a","b","c"]] and df[slice(5)] and df[lambda idx: idx % 5 == 0]. (added by @devin-petersohn)
  • pandas allows the dot operator (__getattr__) to get columns, which causes problems for columns that share names with other APIs. (added by @devin-petersohn)
  • Duplicate APIs:
    • Simple aliases: e.g. isna and isnull, multiply and mul, etc.(added by @devin-petersohn)
    • More complex duplication: e.g. query("a > b") and df[df["a"] > df["b"]] (added by @devin-petersohn)
    • Indexing, there are 7 or 8 ways to get one or more columns in pandas: e.g. __getitem__, __getattr__, loc, iloc, apply, drop (added by @devin-petersohn)
    • merge and join call each other and are confusing for new users (added by @devin-petersohn)
  • Having a separate object to represent a one column dataframe (i.e. Series). Creating all the complexity of having to reimplement most functionality of dataframe. And not providing a consistent way of applying operations to N columns (including 1). #6 @datapythonista
  • "missing" APIs / extension points:
    • These are APIs or extension points that pandas and/or numpy lacks, and which -- for one reason or another -- has led libraries needing to consume pandas objects (e.g. DataFrame, Series) to hard-code support for these types. This makes pandas work well with these libraries but means it's not easy (or even possible) for other DataFrame implementations to be supported. Lack of interop support between alternative DataFrame implementations and these libraries can be a small but constant annoyance for users, and in some cases a performance issue as well (if data needs to be converted to a pandas object just to get something to work).
    • Introspection API for autocomplete / "IntelliSense" APIs.
      • In riptide we've implemented a hook + protocol and implemented it on our dataframe class Dataset. This provides more-detailed data compared to what a "static" tool like Jedi can return; compared to dir, our protocol allows our Dataset class to control which columns, properties, etc. are returned for display in autocomplete dropdowns.
      • Our protocol also allows as well as to provide richer metadata for data columns. For example, the dtype or array subclass name; for Categoricals, we can provide the number of labels/categories.
      • The features mentioned above could alternatively be implemented through some property(ies) on the standardized DataFrame and/or Array APIs (rather than a protocol with a method that returns a more-complex data structure / dictionary).
        ...

Scalar representation

xref #20 (comment)

It was discussed that the API should be agnostic of execution, including eager/lazy evaluation. I think this is easy when operations return data frames (or columns). For example:

df['value'] + 1

If df['value'] is an in-memory representation, or a lazy expression, the result will likely be the same, and no assumptions need to be made.

But if instead, the result is a scalar:

df['value'].sum()

The output type defined in the API can make force certain executions and prevent others. For example, if the return type defined in the API is a Python int or float. See this example:

df['value'] + df['value'].sum()

While an implementation could want to keep the result of df['value'].sum() as its C representation for the next operation (the addition), making sum() return a Python object would force the conversion from C to Python and then back to C.

Another example could be Ibis or other SQL-backed implementations. Returning a Python object would cause them to execute a first query for df['value'].sum() and use the result in a second query. While in the example is likely that a single SQL query could be enough if the computation is delayed until the end.

For the array API it was discussed to use a 0-dimensional array to prevent a similar problem. Assuming we want to do the same for data frames (and not retun a Python object directly), I see two main options:

  • Using a 1x1 data frame. I think in the df['value'] example could make sense, not so sure in other cases like df.count_rows() (see #20) where we could possibly be interested in applying
  • Creating a scalar type/class that wraps a scalar and can be used by implementations to decide how the data is represented, when it is converted to Python objects... For example, an toy implementation storing the data as a numpy object could look like:
>>> import numpy

>>> class scalar:
...     def __init__(self, value, dtype):
...         self.value = numpy.array(value, dtype=dtype)
...
...     def __repr__(self):
...         return str(self.value)
...
...     def __add__(self, other):
...         return self.value + other

>>> result = scalar(12, dtype='int64')
>>> result
12
>>> result + 3
15

CC: @markusweimer @kkraus14

How to make a future dataframe API available?

This question got asked recently by @mmccarty (and others have brought it up before), so it's worth taking a stab at an answer. Note that this is slightly speculative, given that we only have fragments of a dataframe API rather than a mostly complete syntax + semantics.

A future API, or individual design elements of it, will certainly have (a) new API surface, and (b) backwards-incompatible changes compared to what dataframe libraries already implement. So how should it be made available?

Options include:

  1. In a separate namespace, ala .array_api in NumPy/CuPy,
  2. In a separate retrievable-only namespace, ala __array_namespace__,
  3. Behind an environment variable (NumPy has done this a couple of times, for example with __array_function__ and more recently with dtype casting rules changes),
  4. With a context manager,
  5. With a from __future__ import new_behavior type import (i.e., new features on a per-module basis),
  6. As an external package, which may for example monkeypatch internals (added for completeness, not preferred),

One important difference between arrays and dataframes is that for the former we only have to think about functions, for the latter we're dealing with methods on the main dataframe objects. Hiding/unhiding methods is a little more tricky of course - can be done based on an environment variable set at import time, but it's more annoying with a context manager.

For behavior it's kind of the opposite: likely not all code will work with new behavior, so granular control helps, and a context manager is probably better.

Experiences with a separate namespace for the array API standard

The short summary of this is:

  • there's a problem where we now have two array objects, and supporting both in a code base is cumbersome and requires bi-directional conversions.
  • a summary of this problem and approaches taken in scikit-learn and SciPy to work around it are described in data-apis/array-api#400
  • in NumPy the preferred solution direction longer term is to make the main numpy namespace converge to the array API standard; this takes time because of backwards compatibility constraints, but will avoid the "double namespaces" problem and have multiple other benefits, for example solving long-standing issues that Numba, CuPy etc. are running into.

Therefore, using a separate namespace to implement dataframe API standard features/compatibility should likely not be the preferred solution.

Using a context manager

Pandas already has a context manager, namely pandas.option_context. This is used for existing options, see pd.describe_option(). While most features are related to display, styling and I/O, some features that can be controlled are quite large and similar in style to what we'd expect to see in a dataframe API standard. Examples:

  • mode.chained_assignment (raise, warn, or ignore)
  • mode.data_manager ("block" or "array")
  • mode.use_inf_as_null (bool)

It could be used similarly to currently available options, one option per feature:

 with pd.option_context('mode.casting_rules', 'api-standard'):
     do_stuff()

Or there could be a single option to switch to "API-compliant mode":

 with pd.option_context('mode.api_standard', True):
     do_stuff()

Or both of those together.

Question: do other dataframe libraries have a similar context manager?

Using a from __future__ import

It looks like it's possible to implement features with a from __future__ itself, via import hooks (see Reference 3 below). That way the spelling would be uniform across libraries, which is nice. Alternatively, a from dflib.__future__ import X is easier (no import hooks), however it runs into the problem also described in Ref 3: it is not desirable to propagate options to nested scopes:

from pandas.__future__ import api_standard_unique

# should use the `unique` behavior described in the API standard
df.unique()

from other_lib import do_stuff

# should NOT use the `unique` behavior described in the API standard,
# because that other library is likely not prepared for that.
do_stuff(df)

Now of course this scope propagation is also what a context manager does. However, the point of a from __future__ import and jumping through the hoops required to make that work (= more esoteric than a context manager) is to gain a switch that is local to the Python module in which it is used.

Comparing a context manager and a from __future__ import

For new functions, methods and objects both are pretty much equivalent, since they will only be used on purpose (the scope propagation issue above is irrelevant)

For changes to existing functions or methods, both will work too. The module-local behavior of a from __future__ import is probably preferred, because code that's imported from another library that happens to use the same functionality under the hood may not expect the different result/behavior.

For behavior changes there's an issue with the from __future__ import. The import hooks will rely on AST transforms, so there must be some syntax to trigger on. With something that's very implicit, like casting rules, there is no such syntax. So it seems like there will be no good way to toggle that behavior on a module-scope level.

My current impression

  • A separate namespace is not desired, and a separate dataframe object is really not desired,
  • An environment variable is easy to implement, but pretty coarse - and given the fairly extensive backwards-compatibility issues that are likely, probably not good enough,
  • A context manager is nicest for behavior, and fine for new methods/functions
  • The from __future__ import xxx is perhaps best for adoption of changes to existing functions or methods, it has a configurable level of granularity and is explicit so should be more robust there than a context manager.

References

  1. somewhat related discussion on dataframe namespaces: #23
  2. data-apis/array-api#16
  3. https://stackoverflow.com/questions/29905278/using-future-style-imports-for-module-specific-features-in-python (by @shoyer)

Meaning of Column.offset?

Is its use similar as in Arrow, such that if you slice a string array, that you still back it by the same buffers, but the offset and length of the column convey which part of the buffer should be used?
If that is the case, this can always be 0 for numpy and primitive Arrow arrays (except for Arrow-boolean since they are bits), since we can always slice them right?

Get and set column names

Regarding column names, the next proposal, similar to what pandas currently does, uses a columns property to set and get columns names.

In #7, the preference is to restrict column names to string, and not allow duplicates.

The proposed API with an example is:

>>> df = dataframe({'col1': [1, 2], 'col2': [3, 4]})
>>> df.columns = 'foo', 'bar'
>>> df.columns = ['foo', 'bar']
>>> df.columns = map(str.upper, df.columns)
>>> df.columns
['FOO', 'BAR']

And the next cases would fail:

>>> df.columns = 1
TypeError: Columns must be an iterable, not int
>>> df.columns = 'foo'
TypeError: Columns must be an iterable, not str
>>> df.columns = 'foo', 1
TypeError: Column names must be str, int found
>>> df.columns = 'foo', 'bar', 'foobar'
ValueError: Expected 2 column names, found 3
>>> df.columns = 'foo', 'foo'
ValueError: Column names cannot be duplicated. Found duplicates: foo

Some things that people may want to discuss:

  • Using a different name for the property (e.g. column_names)
  • Being able to set a single column df.columns[0] = 'foo' (the proposal don't allow it)
  • The return type of the columns (the proposal returns a Python list, pandas returns an Index)
  • Setting the column of a dataframe with one column with df.columns = 'foo' (the proposal requires an iterable, so df.columns = ['foo'] or equivalent is needed).

In case it's useful, this is the implementation of the examples:

import collections
import typing


class dataframe:
    def __init__(self, data):
        self._columns = list(data)

    @property
    def columns(self) -> typing.List[str]:
        return self._columns
    
    @columns.setter
    def columns(self, names: typing.Iterable[str]):
        if not isinstance(names, collections.abc.Iterable) or isinstance(names, str):
            raise TypeError(f'Columns must be an iterable, not {type(names).__name__}')

        names = list(names)

        for name in names:
            if not isinstance(name, str):
                raise TypeError(f'Column names must be str, {type(name).__name__} found')
        
        if len(names) != len(self._columns):
            raise ValueError(f'Expected {len(self._columns)} column names, found {len(names)}')

        if len(set(names)) != len(self._columns):
            duplicates = set(name for name in names if names.count(name) > 1)
            raise ValueError(f'Column names cannot be duplicated. Found duplicates: {", ".join(duplicates)}')

        self._columns = names

Signature for a standard `from_dataframe` constructor function

One of the "to be decided" items at https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md#to-be-decided is:

Should there be a standard from_dataframe constructor function? This isn't completely necessary, however it's expected that a full dataframe API standard will have such a function. The array API standard also has such a function, namely from_dlpack. Adding at least a recommendation on syntax for this function would make sense, e.g., from_dataframe(df, stream=None). Discussion at #29 (comment) is relevant.

In the announcement blog post draft I tentatively answered that with "yes", and added an example. The question is what the desired signature should be. The Pandas prototype currently has the most basic signature one can think of:

def from_dataframe(df : DataFrameObject) -> pd.DataFrame:
    """
    Construct a pandas DataFrame from ``df`` if it supports ``__dataframe__``
    """
    if isinstance(df, pd.DataFrame):
        return df

    if not hasattr(df, '__dataframe__'):
        raise ValueError("`df` does not support __dataframe__")

    return _from_dataframe(df.__dataframe__())

The above just takes any dataframe supporting the protocol, and turns the whole things in the "library-native" dataframe. Now of course, it's possible to add functionality to it, to extract only a subset of the data. Most obviously, named columns:

def from_dataframe(df : DataFrameObject, *, colnames : Optional[Iterable[str]]= None) -> pd.DataFrame:

Other things we may or may not want to support:

  • columns by index
  • get a subset of chunks

My personal feeling is:

  • columns by index: maybe, and if we do then with a separate keyword like col_indices=None
  • a subset of chunks: probably not. This is more advanced usage, and if one needs it it's likely one wants to get the object returned by __dataframe__ first, then inspect some metadata, and only then decide what chunks to get.

Thoughts?

API for viewing the frame

Interactive users will want to control how the data is displayed. This might include sorting the view; coloring cells, columns, or rows; precision digits; or moving columns to the left. It may also interact with auto complete.

It is common practice to separate the view from the data (many applications can display data in a SQL database in different ways).

I believe that we need to define an interface to the display data class (for instance, an ordered dictionary of strings containing arrays is the simplest interface. additional kwargs might include display attributes for rows or columns, there might be header or footer information).

Thus, believe it is in scope to define an interface so that multiple developers can write their own display data class. Almost every demonstration needs a way to display large amount of data well.

Interchange between two dataframe types which use the same native storage representation

This was brought up by @jorisvandenbossche: if two libraries both use the same library for in-memory data storage (e.g. buffers/columns are backed by NumPy or Arrow arrays), can we avoid iterating through each buffer on each column by directly handing over that native representation?

This is a similar question to https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol_summary.md#what-is-wrong-with-to_numpy-and-to_arrow - but it's not the same, there is one important difference. The key point of that FAQ entry is that it's consumers who should rely on NumPy/Arrow, and not producers. Having a to_numpy() method somewhere is at odds with that. Here is an alternative:

  1. A Column instance may define __array__ or __arrow_array__ if and only if the column itself is backed by a single NumPy or an Arrow array.
  2. DataFrame and Buffer instance must not define __array__ or __arrow_array__.

(1) is motivated by wanting a simple shortcut like this:

    # inside `from_dataframe` constructor
    for name in df.column_names():
        col = df.get_column_by_name(name)
        # say my library natively uses Arrow:
        if hasattr(col, '__arrow_array__'):
            # apparently we're both using Arrow, take the shortcut
            columns[name] = col.__arrow_array__()
        elif ...: # continue parsing dtypes, null values, etc.

However, there are other constraints then. For __array__ this then also implies:

  • the column has either no missing values or uses NaN or a sentinel value for nulls (and this needs checking first in the code above - otherwise the consumer may still misinterpret the data)
  • this does not work for categorical or string dtypes - those are not representable by a single array

For __arrow_array__ I cannot think of issues right away. Of course the producer should also be careful to ensure that there are no differences in behavior due to adding one of these methods. For example, if there's a dataframe with a nested dtype that is supported by Arrow but not by the protocol, calling __dataframe__() should raise because of the unsupported dtype.

The main pro of doing this is:

  • A potential performance gain in the dataframe conversion (TBD how significant)

The main con is:

  • Extra code complexity to get that performance gain, because now there are two code paths on the consumer side and both must be equivalent.

My impression is: this may be useful to do for __arrow_array__, I don't think it's a good idea for __array__ because the gain is fairly limited and there's too many constraints or ways to get it wrong (e.g. describe_null must always be checked before using __array__). If __array__ is to be added, then maybe at the Buffer level where it plays the same role as __dlpack__.

storing & exchange of categorical dtypes

Categorical dtypes

xref gh-26 for some discussion on categorical dtypes.

What it looks like in different libraries

Pandas

The dtype is called category there. See pandas.Categorical docs:

>>> df = pd.DataFrame({"A": [1, 2, 5, 1]})
>>> df["B"] = df["A"].astype("category")

>>> df.dtypes
A       int64
B    category
dtype: object

>>> col = df['B']
>>> col.dtype
CategoricalDtype(categories=[1, 2, 5], ordered=False)

>>> col.values.ordered
False
>>> col.values.codes
array([0, 1, 2, 0], dtype=int8)
>>> col.values.categories
Int64Index([1, 2, 5], dtype='int64')
>>> col.values.categories.values
array([1, 2, 5])

Apache Arrow

The dtype is called _"dictionary-encoded" in Arrow - so a dataframe with a categorical dtype is called a "dictionary-encoded array" there.
See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions for details.

A practical example (from @kkraus14 in gh-38), for a categorical column of
['gold', 'bronze', 'silver', null, 'bronze', 'silver', 'gold'] with categories of
['gold' < 'silver' < 'bronze']:

categorical column: {
    mask_buffer: [119], # 01110111 in binary
    data_buffer: [0, 2, 1, 127, 2, 1, 0], # the 127 value in here is undefined since it's null
    children: [
        string column: {
            mask_buffer: None,
            offsets_buffer: [0, 4, 10, 16],
            data_buffer: [103, 111, 108, 100, 115, 105, 108, 118, 101, 114, 98, 114, 111, 110, 122, 101]
        }
    ]
}
struct ArrowSchema {
  // Array type description
  const char* format;
  const char* name;
  const char* metadata;
  int64_t flags;
  int64_t n_children;
  struct ArrowSchema** children;
  struct ArrowSchema* dictionary;  // the categories
  ...
};

struct ArrowArray {
  // Array data description
  int64_t length;
  int64_t null_count;
  int64_t offset;
  int64_t n_buffers;
  int64_t n_children;
  const void** buffers;
  struct ArrowArray** children;
  struct ArrowArray* dictionary;
  ...
};

Also see https://arrow.apache.org/docs/python/data.html#dictionary-arrays for what PyArrow does - it matches the current exchange protocol more closely than the Arrow C Data Interface. E.g., it uses an actual Python dictionary for the mapping of values to categories.

Vaex

EDIT: Vaex's API was done pre Arrow integration, and will change to match Arrow in the future.

>>> import vaex
... >>> df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6])
... >>> df = df.categorize('year', min_value=2020, max_value=2019)
... >>> df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fr
... i', 'Sat', 'Sun'])
>>> 
>>> df.dtypes
year       int64
weekday    int64
dtype: object
>>> df.is_category('year')
True
>>> df.is_category('weekday')
True
>>> df._categories
{'year': {'labels': [], 'N': 0, 'min_value': 2020}, 'weekday': {'labels': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], 'N': 7, 'min_value': 0}}

Other libraries

  • Modin follows Pandas
  • Dask follows Pandas
  • Koalas does not support categorical dtypes at all

Exchange protocol

This is the current form in gh-38 for the Pandas implementation of the exchange protocol:

>>> col = df.__dataframe__().get_column_by_name('B')
>>> col
<__main__._PandasColumn object at 0x7f0202973211>
>>> col.dtype  # kind, bitwidth, format-string, endianness
(23, 64, '|O08', '=')

>>> col.describe_categorical  # is_ordered, is_dictionary, mapping
(False, True, {0: 1, 1: 2, 2: 5})

>>> col.describe_null  # kind (2 = sentinel value), value
(2, -1)

Changes needed & discussion points

What we already determined needs changing:

  1. Add get_children() method, and store the mapping that is now in Column.describe_categorical in a child column instead. Note that child columns are also needed for variable-length strings.

To discuss:

  1. If dtype is the logical dtype for the column, where to store how to interpret the actual data buffer? Right now this is done not in a static attribute but by returning the dtype along with the buffer when accessing it:
    def get_data_buffer(self) -> Tuple[_PandasBuffer, _Dtype]:
        """
        Return the buffer containing the data.
        """
        _k = _DtypeKind
        if self.dtype[0] in (_k.INT, _k.UINT, _k.FLOAT, _k.BOOL):
            buffer = _PandasBuffer(self._col.to_numpy())
            dtype = self.dtype
        elif self.dtype[0] == _k.CATEGORICAL:
            codes = self._col.values.codes
            buffer = _PandasBuffer(codes)
            dtype = self._dtype_from_pandasdtype(codes.dtype)
        else:
            raise NotImplementedError(f"Data type {self._col.dtype} not handled yet")

        return buffer, dtype
  1. What goes in the data buffer on the column? The category-encoded data makes sense, because the buffer needs to be the same size as the column (number of elements), otherwise it would be inconsistent with other dtypes.

    • What happens when the data is strings?

API candidates for standardization

The following is a list of API candidates for standardization.

Notes

  • The list is derived from #22, #3, and API comparison data.
  • The list does not include statistics APIs which have already been spec'd (see #33).
  • The list is not exhaustive. This list is intended to provide an initial focus for standardization.
  • The categories are loosely defined and follow precedent established in the array API specification.

Math

abs
floordiv
pow
round
truediv

add
diff
div
mod
mul
sub

Any need for the r* variants? (e.g., radd, rmul, etc)

Statistics

corr
count
cov

Comparison

eq
ge
gt
le
lt
ne

Logical

isin
isna
notna

isna or isnull? notna or notnull?

Searching

where

Creation/Manipulation

append
assign
copy
drop
drop_duplicates
dropna
fillna
head
join
pop
rename
replace
set_index
tail
take

Sorting

sort_values

Utilities

all
any

Implicit alignment in operations

In #2 there seems to be some agreement that row-labels are an important component of a dataframe. Pandas takes this a step further by using them for alignment in many operations involving multiple dataframes.

In [10]: a = pd.DataFrame({"A": [1, 2, 3]}, index=['a', 'b', 'c'])

In [11]: b = pd.DataFrame({"A": [2, 3, 1]}, index=['b', 'c', 'a'])

In [12]: a
Out[12]:
   A
a  1
b  2
c  3

In [13]: b
Out[13]:
   A
b  2
c  3
a  1

In [14]: a + b
Out[14]:
   A
a  2
b  4
c  6

In the background there's an implicit a.align(b), which reindexes the dataframes to a common index. The resulting index will be the union of the two indices.

A few other places this occurs

  • Indexing a DataFrame / Series with an integer or boolean series
  • pd.concat
  • DataFrame constructor

Do we want to adopt this behavior for the standard?

Interchange protocol use case: getting certain columns as numpy array

I think it's useful to think through concrete use cases on how the interchange protocol could be used, to see if it covers those use cases / the desired APIs are available.
One example use case could be matplotlib's plot("x", "y", data=obj), where matplotlib already supports getting the x and y column of any "indexable" object. Currently they require obj["x"] to give the desired data, but so in theory this support could be extended to any object that supports the dataframe interchange protocol. But at the same time, matplotlib currently also needs those data (AFAIK) as numpy arrays because the low-level plotting code is implemented in such a way.

With the current API, matplotlib could do something like:

df = obj.__dataframe__()
x_values = some_utility_func(df.get_column_by_name("x").get_buffers())

where some_utility_func can convert the dict of Buffer objects to a numpy array (once numpy supports dlpack, converting the Buffer objects to numpy will become easy, but the function will then still need to handle potentially multiple buffers returned from get_buffers()).

That doesn't seem ideal: 1) writing the some_utility_func to do the conversion to numpy is non-trivial to implement for all different cases, 2) should an end-user library have to go down to the Buffer objects?

This isn't a pure interchange from one dataframe library to another, so we could also say that this use case is out-of-scope at the moment. But on the other hand, it seems a typical use case example, and could in theory already be supported right now (it only needs the "dataframe api" to get a column, which is one of the few things we already provide).

(disclaimer: I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))

Arrow format string for format_str in _dtype_from_vaexdtype()

The code now uses NumPy format strings, while the docs for Column.dtype specify it must use the format string from the Apache Arrow C Data Interface (similar but slightly different). So we need a utility to map NumPy to Arrow format here.

Example - should say 'b' not |b1':

df = pd.DataFrame({"A": [True, False, False, True]})

>>> df.__dataframe__().get_column_by_name('A').dtype
(<_DtypeKind.BOOL: 20>, 8, '|b1', '|')

Source:
https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
https://numpy.org/doc/stable/reference/arrays.interface.html#arrays-interface
https://numpy.org/doc/stable/reference/generated/numpy.dtype.itemsize.html
https://numpy.org/doc/stable/reference/generated/numpy.dtype.byteorder.html

Using Array API functions on DataFrame objects

In some cases users like to use Array API functions (for example where) on DataFrame objects (in particular Series). Is this something that we would like to support in the API? If not, how would we recommend users approach these kinds of problems.

For an example of this please see issue ( dask/distributed#5224 )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.