wesm / pandas2 Goto Github PK

View Code? Open in Web Editor NEW

306.0 86.0 41.0 82.37 MB

Design documents and code for the pandas 2.0 effort.

Home Page: https://pandas-dev.github.io/pandas2/

Python 55.06% Makefile 24.30% Batchfile 20.64%

pandas2's People

Contributors

Stargazers

Watchers

Forkers

jorisvandenbossche phobson chris-b1 joshuastorck femtotrader alexxnica harendranathvegi9 ykkwon zqwght asummit samholt w3ss tui-rob sadhumangal radovankavicky gapdata sdoof wwjiang007 pandinosaurus manelmachado ianlini imaeland mananpal1997 he9995 susan-shu-c saurabhprasadsah yutiansut vishwas1234567 statcompute aabadm5407 santakd isabella232 mkudimi hopejrd forestmars rakhithjk python-repository-hub yjwlc demstalferez tamilselvanarjun

pandas2's Issues

thread safety

nuff said

at least where possible / documented

Iterative changes instead of all of them in 2.0 version?

(Moved from https://github.com/wesm/pandas2-design/issues/1)

Disclaimer: I'm not involved in pandas development so my opinion here is not very informed. Sorry about that. :-/

According to my (little) experience in software development, huge refactors and changes in a library are often more difficult to drive forward and expose a larger bug surface than small, incremental changes. Moreover, releasing new versions with incremental changes helps users to test new functionality/changes and allocate enough time to identify and report new bugs.

Could changes proposed in this document be broken into independent "modules" that can be worked on separately and incrementally? For example:

Removing unused/deprecated functionality.
Logical/physical storage decoupling.
Missing data consistency.
Enhaced string handling.

Even though (2) is probably required to be able to solve (3) and (4), we could get (2) done in a 2.0 release, then (3) in 2.1, (4) in 2.2... etc.

Does that sound reasonable or do you believe it is better to make a big release with all these changes in a row? Are these changes more coupled than I think?

Unified merge API

We have merge() and merge_asof(). There may even come a time when we perform functions on overlapping columns. As someone who wants to join two tables together, I just want a single mechanism to do so.

I wonder if it's possible to have a single API like:

merge(
    left,     # DataFrame or Table
    right,    # DataFrame or Table
    on,       # one or more columns
    asof,     # one or more columns
    how,      # 'left', 'right', 'inner', 'outer'
    overlap,  # optional function to apply to overlapping column names
)

Users must specify at least one of on or asof. There can also be left_on/right_on and left_asof/right_asof. We could even have left_index/right_index for the poor souls who still have indexed data (#17).

The overlap is for when the same column name appears in both tables. Currently those columns are renamed with a suffix (though I'd be in favor of just raising an error). But there are a times when I want to perform a function. There are ways to do this with arithmetic operations (#30), though I think any function with two arguments would be nice, including overwritting the left with the right (for handling cases of missing data with a "fill" result).

Note that doesn't handle my proposed merge_window() (pandas-dev/pandas#13959). The semantics there are very specific and I'm not sure how to put that in a unified structure as with above, though I'd love to hear any ideas.

DEV: Publishing benchmarks vs. trunk?

We'll want to develop a microbenchmark suite to compare pandas 2.0 perf versus 1.x, especially in microbenchmarks. What's the right tool for this? asv, vbench?

lazy array attributes

IIRC this from the design docs, but wanted to make an issue to remember. We want to have a set of lazily computed array attributes. Sometimes these can be set at creation time based on the creation method / dtype. If the array is immutable then these are not affected by indexing checks.

immutability / read-only, xref pandas-dev/pandas#14359
unique
is_monotonic*
has_nulls
is_hashable - only on non homogeneous dtype, usually true on object dtypes (but NOT if they are mutable). The issue is that this can currently be expensive to figure out (as you need to iterate over and call hash on each element). xref

e.g. imagine a pd.date_range(....., ...), then unique, monotonic, has_nulls are trivial to compute at creation time. Since this is currently an Index in pandas it is immutable by-definition.

xref pandas-dev/pandas#12272, pandas-dev/pandas#14266

DataFrame.lookup() style indexing

.lookup pandas-dev/pandas#7138, for coordinate access is useful, but is not incorporated in a generalized indexer.

Decimal dtype

pandas would potentially benefit from a more efficient decimal dtype, possibly using libmpdec (what CPython uses internally) for the internal implementation.

First class array/list type

Similar to the ARRAY type found in SQL variants with nested types. See also the List type in Apache Arrow.

xref pandas-dev/pandas#8517

Aligning Series.index with DataFrame.index in broadcasting operations

Copying my comment from pandas-dev/pandas#10000 (comment):

We should consider making arithmetic between a Series and a DataFrame broadcast across the columns of the dataframe, i.e., aligning series.index with df.index, rather than the current behavior aligning series.index with df.columns.

I think this would be far more useful than the current behavior, because it's much more common to want to do arithmetic between a series and all columns of a DataFrame. This would make broadcasting in pandas inconsistent with NumPy, but I think that's OK for a library that focuses on 1D/2D rather than N-dimensional data structures.

DEV: C++ exceptions vs. error codes

On some contemplation, I am thinking it may lead to overall cleaner C++ code if we use exceptions for error reporting instead of status codes. These exceptions should be truly exceptional -- for example, it would be better to deal with argument validation and "can't fail" exceptions with DCHECK macros. User-visible argument validation can be handled in the Python wrapper layer.

supported dtypes

Obvious / currently supported see here
xref #20

integer
unsigned integer
float
complex
boolean
datetime (ns)
datetime w/tz (ns)
timedelta (ns)
period
category
var length string

Informational, may want to think about the desirability of adding later

datetime/timedelta units (ms, us, D)
quantity / unit
decimal #14
aggregate or nested (list/dict) #25
flexible / void / object
coordinates (geo)
fixed length string
bytes
guids (2 x 8 bytes)
interval
IP addresses (both v4 and v6)
fractions, storing numerator and denominator as integers
union
structured (xref to #25)

non-support (try to raise informative errors & point to ready-made solns)

void (a variation on object)
non-supported combinations (e.g. arbitrary dtypes, though maybe a pre-defined union)

.loc without reindexing

Currently, there is no way to index a list of values in pandas without inserting NA for missing values. It could be nice to make this possible, either by making a variation of .loc that raises KeyError in such cases or by changing the behavior of .loc altogether.

In xarray, .loc only works with pre-existing index labels for getitem df.loc[key] and assignment df.loc[key] = values (inserting new columns is OK). Reindexing behavior can still be obtained with explicit calls to .reindex.

Conceivably, we could make things work in the same way for pandas. Two major implications of such a change:

Significantly simpler indexing code. All the logic for mapping indexers to positions in xarray fits in about 50 lines of code. In contrast, pandas/core/indexing.py is some of the trickiest code in pandas, in part because it handles cases like inserting NAs and in part because it tries to handle all possible variations of indexing with minimal code duplication. I don't even envy anyone who takes on the task of translating such logic to C++.
It's harder to write code that is entirely at odds with pandas's columnar data model. You can no longer do silly things like creating an empty DataFrame and filling it in later, e.g.,

df = pd.DataFrame()
for row, col, value in data:
   df.loc[row, col] = value

In my view this is a positive, but it would certainly be a big backwards compatibility break.

Add a better pipe functionality by using an "unused" operator

Rs "new" pipes combined with easily added functions more or less made Rs data handling much easier to read and to extend than pandas. The advantage is IMO twofold:

using pipes (or . notation in python) is much easier to read than functions itself (df %>% func(...) %>% func2(...) and df.func(...).func2(...) vs func2(func(df, ...),...)
using functions as a base makes for easy extensibility (you would need monkey patching to add new functionality to a pandas df)

Pandas nowadwas has a df.pipe() method, but that looks much clumsier compared to the elegance of a separate pipe operator.

So I would like to see pandas2 reserve one of the not so much needed operators (e.g. >>?) for a pipe interface. The pipe interface would let users define new functions (which return a small object which would be used in the >> operator -> probably doable with a decorator around the function).

As this wasn't possible because it is an API breaking change, I would like to propose that it is done in pandas2.

Simplifying indexing (DataFrame.getitem)

The rules for exactly what DataFrame.__getitem__/__setitem__ does (pandas-dev/pandas#9595) are sufficiently complex and inconsistent that they are impossible to understand without extensive experimentation.

This makes for a rather embarrassing situation that we really should fix for pandas 2.0.

I made a proposal when this came up last year:

Indexing with a string or list of strings does label based selection on columns.
All other indexing is position based, NumPy style. (This includes indexing with a boolean array.)

I still like my proposal, but more importantly, it satisfies two important criteria:

The most common uses of DataFrame indexing work unchanged (df['foo'], df[['foo', 'bar']], and df[df['foo'] == 'bar'] might cover 80% of use cases).
It's short and simple, with no exceptions.

Use all aligned, 128 to 512-bit memory allocations

see pandas-dev/pandas#3146

I have closed that issue, but we should do this when implementing the pandas 2.0 memory allocator

DEV: Document code review / development workflow

Opening up "libpandas" for use by other projects

I really like the proposal so far 👍
Are there any plans to provide a semi-stable C++/Cython API that could be used by other projects for things beyond simple pandas integration? (As in being able to write your own alternative to DataFrame)

Specifically, I'm thinking of writing a "pandas for structured data" that allows you to perform in-memory queries on records with optional and repeating fields, using the trick from the Dremel paper to store the data as contiguous arrays.
It would be cool if I could make use of some of the interfaces in this proposal to simplify and speed up interop with pandas.

Another example is xarray, which is inspired by pandas, but exposes an alternative data structure for working with N-dimensional datasets.

More careful management of hash table allocations

per pandas-dev/pandas#4491

we may consider a fixed-size memory pool (which could be managed with an LRU stack) for hash table data to avoid excess internal index hash tables

Make NA/null a first-class citizen in groupby operations

xref #9

Maybe we can collect a list of pandas issues that have happened in and around this.

pandas-dev/pandas#14170

I've found it's valuable to be able to consistently compute statistics including the NA values, especially with multiple group keys. I haven't kept track of how pandas handles these now in all cases, but it would be nice to come up with a strategy to make NA behave like any other group in a group by setting.

DESIGN: Wishlist from scikit-learn, keras, tensorflow?

What can pandas provide in the way of a C/C++/Cython API to better enable upstack ML / statistical libraries? @ogrisel @amueller, who might have some good perspectives?

Dtype strict mode

pandas dtype coercion is useful in most cases, but there are situations to prohibit it to avoid unexpected values being included. The behavior should be switchable like:

s = pd.Series([1, 2, 3])
s[1] = 1.2
s
#0    1
#1    1
#2    3
# dtype: int64

s[1] = 'x'
s
#0    1
#1    x
#2    3
# dtype: object

s = pd.Series([1, 2, 3], cast=False)
s[1] = 'x'
TypeError...

Serializing more array metadata

If any array/Series statistics have been computed, we should serialize them:

pandas-dev/pandas#1324

DESIGN: Cheaper DataFrame.append

I'm thinking we can come up with a plan to yield a better .append implementation that defers stitching together arrays until it's actually needed for computations.

We can do this by having a virtual pandas::Table interface that will consolidate fragmented columns only when they are requested. Will think some more about this

c++ lazy evaluation library /. looks numpy friendly

http://quantstack.net/xtensor

DEV: Reusing low level constructs in libarrow / Apache Arrow

As I've been prototyping I've copied over a bunch of C++ code from Arrow (https://github.com/apache/arrow) — I'm not sure maintaining near clones of the same code in two places makes sense (see @xhochy comment here b982d96#commitcomment-19406430).

The code in question is:

Buffer abstraction (a reference to a block of data)
MemoryPool: memory allocation / tracking
Status — an object for capturing error information for exception-free C++ programming
Bit manipulation utilities (see src/pandas/util/bit-util.h)

Sharing this code means adding libarrow as a build / runtime dependency — if this causes problems in some way, we can absorb the bits of the library that are being used in pandas. We should definitely set using aliases so that we are not using the arrow:: namespace directly in the code for these low level bits.

Later, we can also potentially take advantage of arrow::io, a small IO subsystem for dealing with files, memory maps, etc. This may be useful for revamping the CSV reader.

When we look at adding nested data types to pandas, or even a new string array type, we may want to consider using the Arrow memory layout, so having this in the build toolchain may make life easier in a number of ways.

"Predicate pushdown" in group-bys

xref #15

I brought this up at SciPy 2015, but there's a significant performance win available in expressions like:

df[boolean_cond].groupby(grouping_exprs).agg(agg_expr)

If you do this currently, it will produce a fully materialized copy of df even if the groupby only touches a small portion of the DataFrame. Ideally, we'd have:

df.groupby(grouping_exprs, where=boolean_cond).agg(...)

I put this as a design / pandas2 issue because the boolean bytes / bits will need to get pushed down into the various C-level groupby subroutines.

Much faster to_csv implementation (in libpandas)

pandas-dev/pandas#3186

DESIGN: NA values in floating point arrays

Do we want to continue to use NaN? There are possible computational benefits to doing so, but we have the opportunity with pandas 2.0 to change to using bitmaps everywhere, which brings a lot of consistency to how missing data is handled (for example: isnull becomes free under this scenario). We may want to do some benchmarking to understand the overhead of using bitmaps in arithmetic and aggregations (rather than letting the CPU propagate NaN where relevant).

One benefit of the bitmap route is that we can use hardware popcount to skip null checking on groups of 64 or 128 bits at a time (or more, if there are AVX popcount instructions available, not actually sure about this), so the performance in aggregations may actually get a lot better on mostly non-null data.

Code review tools for libpandas

I don't think GitHub can support the level of code scrutiny that we're going to want as part of the pandas 2.0 development process, particularly for C/C++ code that may need to go through multiple rounds of iterations.

Once we get going on the work, I'd suggest we give Gerrit a trial run, maybe via Gerrithub (which I haven't tried, but interested in doing -- have used Gerrit itself quite a bit with good experience), but we can also set up a dedicated instance on a VM (Linode, Digital Ocean) if needed

http://gerrithub.io/

The benefits of Gerrit IMHO are:

much more transactional code reviews
easier to review incremental changes to a PR
much easier to examine individual comments and see the work that was done to to address that comment
easier for the developer / PR author(s) to mark comments as fixed (the code reviews start feeling like a TODO list)

More reading: https://www.beepsend.com/2016/04/05/abandoning-gitflow-github-favour-gerrit/

Alternate groupby API that is more functionally consistent with databases or systems like dplyr

pandas's row indexes introduces a level of semantic incompatibility with other systems that occasionally causes problems for users who are using both pandas and some other system.

Functionally, this mainly means returning the group keys as data columns rather than row index. In the case of .apply, it may make sense to discard the group keys altogether.

We may also discuss a means to make specifying more complex aggregations easier in a separate issue

DEV: Use Circle CI instead of Travis CI for faster turnaround time on builds

A "NULL" / "NA" logical type

There are a number of places where we "guess" a type (e.g. np.float64) where there is no reasonable choice, e.g. in the CSV parser code.

Many databases have the notion of a "null" type which can be casted to any other type implicitly. For example, if a column in a DataFrame has null type, then you could cast it to float64 or string and obtain an equivalent column of all NA/null values. This would flow through to concat operations.

Figuring this out doesn't strike me as urgent but it would be good to assess how invasive this change would be (in theory it would help with some rough edges, but may well break user code where a "float64 column of all NaNs" was assumed before)

Improving groupby-apply microperformance

Consider the case of a DataFrame with a large number of distinct groups:

import numpy as np
arr = np.random.randn(5000000)
df = pd.DataFrame({'group': arr.astype('str').repeat(2)})
df['values'] = np.random.randn(len(df))
df.groupby('group').apply(lambda g: len(g))

I have

In [17]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 6.45 s, sys: 68 ms, total: 6.52 s
Wall time: 6.51 s

The per-group overhead is fairly fixed -- with 5 million groups we have:

In [22]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 31 s, sys: 108 ms, total: 31.1 s
Wall time: 31.1 s

It would be interesting to see if, but pushing down the disassembly-reassembly of DataFrame objects into C++ whether we can take the overhead from the current ~6 microseconds to under a microsecond or even less.

Note that the effects of bad memory locality are also a factor. We could look into tricks like using a background thread which "prepares" groups (up to a certain size / buffer threshold) while user apply functions are executing, to at least mitigate the time aspect of the groupby evaluation.

API design (internal/C++ and external/Python) for logical type metadata

Placeholder for this part of the project

Dropping Python 2.7 support

There were a few comments about this in the initial pull requests.

Some of my arguments for dropping Python 2.7 support

Probably first production-ready release of pandas 2.x will realistically be late 2017 / early 2018 (beta / testing releases available before then). This means < 3 years until Python 2 end-of-life (https://pythonclock.org/)
C++11 is not well supported in Windows builds of Python 2.7 due to a dependence on an older MSVC runtime. I feel pretty strongly that judicious use of modern C++ (aka C++11/14) will yield better developer productivity (auto, anyone?) and higher quality software in the libpandas internals
Can start thinking about / making use of Python 3.x only features, like asyncio

It would be great for us to reach some decision about this and also join the Python 3 statement: https://python3statement.github.io/ (per @takluyver suggestion)

Optional indexes

The pandas.Index is fantastically useful, but in many cases pandas's insistence on always having an index gets in the way.

Usually, it can be safely ignored when not relevant, especially now that we have RangeIndex (which makes the cost of creating the index minimal), but this is not always the case:

The indexing and join behavior of default RangeIndex is actively harmful. It would be better to raise an error when implicitly joining on an index between two datasets with a default index.
When converting a DataFrame into other formats, we need an argument (e.g., index=True) for controlling whether or not to include the index.

I propose that we make the index optional, e.g., by allowing it to be set to None. This entails a need for some rules to handle missing indexes:

Operations that explicitly rely on indexes (e.g., .loc and join) should raise TypeError when called on objects without an index.
Operations that implicitly rely on indexes for alignment (e.g., the DataFrame constructor and arithmetic) now need to handle three cases:
1. Index/index operations: These work as before. The result's index has an outer join of the input indexes
2. No-index/no-index operations: The inputs have the exact same length (or raise TypeError). The result has no index.
3. Mixed index/no-index operations: The inputs must have the same length. The result takes on the index from the input with an index.

Somewhat related: #15

DOC: Design docs aren't being built properly

@jorisvandenbossche the automated docs builds here don't have the requisite Python packages to run the IPython directives, it seems. We can revert to manual builds if necessary...

Aggregation identity on entirely missing data

potentially related to #9

numpy ufuncs have an identity, which pandas follows with respect to misssing data.

np.sum([], dtype='float64')
Out[33]: 0.0

np.nansum([np.nan], dtype='float64')
Out[35]: 0.0

pd.Series([np.nan]).sum()
Out[36]: 0.0

I don't feel that strongly one way or the other but there's definitely a case to be made that [36] should be NA. The number of bug reports indicate that at minimum, people get tripped up by this, xref pandas-dev/pandas#9422

So could consider modifying the identity concept for pandas 2.0, since there will be less binding to numpy semantics.

Copy on write for views

I will work on a full document for this to get the conversation started, but this can be the placeholder for our discussion about COW

DESIGN: out-of-core / parallelism

parallel .apply, see here

xref dask & distributed

This issue is a placeholder for discussion w.r.t. how much pandas 2.0 should be in charge of out-of-core / parallel operations.

For example in IO operations (see #38), it makes sense to do parallel reading, where pandas can simply front for dask in-line with out the user directly knowning that this is happening. Sure, a more sophisticated user can control this (maybe thru some limited options, or by directly using dask), but the default should simply be it working (in appropriate scenarios) as parallel reads.

Once could argue that .apply and .groupby are similar scenarios (again assume that the dataset is 'big' enough), and we have a GIL releasing function.

We also can include the current use of numexpr for .query evaluation.

So we are hiding dask / numexpr (or potentially other parallel / out-of-core processing libraries) from the user directly. Things, just work, and are faster.

So how far should we take this? Certainly users can directly use these libraries, but it could be a potential win to include dask as an optional dependency (even though it depends on pandas), and dispatch computation for out-of-core / parallelism as approriate.

A better vectorized ternary operator

I was poking around pandas to see if someone had implemented the equivalent of a vectorized if-then-else. For example, similar to np.where, but pandas friendly

df['category'].isin(values).ifelse(df.group_a, df.group_b)

I put this here as it will be easier to implement later (in theory)

Remove pandas.stats (except moments.py)

require numexpr / numba

pandas already supports lazy evaluation via numexpr. adding numba allows the possibility of generation of lazy expressions (and recent versions support ahead-of-time compilation).

I would simply make these requirements; they are available on all platforms and just help.

xref: pandas-dev/pandas#14324
xref: https://github.com/shoyer/numbagg

xref #7

DEV: Add clang-format helper scripts for keeping the C++ code clean

see https://github.com/apache/arrow/blob/master/cpp/build-support/run-clang-format.sh and https://github.com/apache/arrow/blob/master/cpp/src/.clang-format

Security issue?

Please close if not relevant, but I'm not sure that you want this file:

https://github.com/pydata/pandas-design/blob/master/github_deploy_key.enc

committed publicly.

high performance io

examine approach & integrations for the IO subsystems. pay attention to copies, memory mapping, and parallelization. let this be a master issue, with specific discussions in separate issues as needed.

csv #34
HDF5
feather / parquet

in-line support parallel read / write (utilize dask)

Revisit Series implicit size mutability and implicit type conversions

I haven't been able to wrap my head around these behaviors:

In [10]: s = pd.Series()

In [11]: s['a'] = 7

In [12]: s
Out[12]: 
a    7
dtype: int64

In [2]: s = pd.Series([1, 2, 3])

In [3]: s['4'] = 'b'

In [4]: s
Out[4]: 
0    1
1    2
2    3
4    b
dtype: object

On first principles, I think these should raise KeyError and ValueError/TypeError, respectively. I'm concerned that preserving these APIs (especially implicit type changes) is going to be problematic for pandas 2.0 if our goal is to provide more precise / consistent / explicit behavior, particularly with respect to data types. If you want to change the type of something, you should cast (unless there is a standard implicit cast, e.g. int -> float in arithmetic). In the case of implicit size mutation, it seems like a recipe for bugs to me (or simply working around coding anti-patterns -- better to explicitly reindex then assign values).

Let me know other thoughts about this.

c++ interface to CPython

https://github.com/llllllllll/libpy

looks like interesting way to construct expressions

dtype precision / conversions

this may not actually be an issue as we aren't using float np.nan as our missing marker, but
we tend to have some subtle issues when int64 are downcast to float64, IOW we have missing values in an integer array. We end up storing them as object to avoid this precision loss.

Just a reminder to test for things like this.

xref pandas-dev/pandas#14020 as an example

Type annotations

Given that we seem to be on board with only supporting Python 3 (#5) in pandas 2.0, let's take full advantage of that and require type annotations for all internal pandas code. Users, of course, are free to ignore them as they please, but we can run mypy or pytype as part of our continguous integration tests to check for bugs.

Why?

Static typing catches loads of bugs, both for us and our users
Documenting types in code is far more reliable than in doc-strings
Such a constraint would provide strong pressure for writing functions with sane type signatures

Quite simply, this is just good software engineering.