wesm / pandas2 Goto Github PK
View Code? Open in Web Editor NEWDesign documents and code for the pandas 2.0 effort.
Home Page: https://pandas-dev.github.io/pandas2/
Design documents and code for the pandas 2.0 effort.
Home Page: https://pandas-dev.github.io/pandas2/
nuff said
at least where possible / documented
(Moved from https://github.com/wesm/pandas2-design/issues/1)
Disclaimer: I'm not involved in pandas development so my opinion here is not very informed. Sorry about that. :-/
According to my (little) experience in software development, huge refactors and changes in a library are often more difficult to drive forward and expose a larger bug surface than small, incremental changes. Moreover, releasing new versions with incremental changes helps users to test new functionality/changes and allocate enough time to identify and report new bugs.
Could changes proposed in this document be broken into independent "modules" that can be worked on separately and incrementally? For example:
Even though (2) is probably required to be able to solve (3) and (4), we could get (2) done in a 2.0 release, then (3) in 2.1, (4) in 2.2... etc.
Does that sound reasonable or do you believe it is better to make a big release with all these changes in a row? Are these changes more coupled than I think?
We have merge()
and merge_asof()
. There may even come a time when we perform functions on overlapping columns. As someone who wants to join two tables together, I just want a single mechanism to do so.
I wonder if it's possible to have a single API like:
merge(
left, # DataFrame or Table
right, # DataFrame or Table
on, # one or more columns
asof, # one or more columns
how, # 'left', 'right', 'inner', 'outer'
overlap, # optional function to apply to overlapping column names
)
Users must specify at least one of on
or asof
. There can also be left_on
/right_on
and left_asof
/right_asof
. We could even have left_index
/right_index
for the poor souls who still have indexed data (#17).
The overlap
is for when the same column name appears in both tables. Currently those columns are renamed with a suffix (though I'd be in favor of just raising an error). But there are a times when I want to perform a function. There are ways to do this with arithmetic operations (#30), though I think any function with two arguments would be nice, including overwritting the left with the right (for handling cases of missing data with a "fill" result).
Note that doesn't handle my proposed merge_window()
(pandas-dev/pandas#13959). The semantics there are very specific and I'm not sure how to put that in a unified structure as with above, though I'd love to hear any ideas.
We'll want to develop a microbenchmark suite to compare pandas 2.0 perf versus 1.x, especially in microbenchmarks. What's the right tool for this? asv, vbench?
IIRC this from the design docs, but wanted to make an issue to remember. We want to have a set of lazily computed array attributes. Sometimes these can be set at creation time based on the creation method / dtype. If the array is immutable then these are not affected by indexing checks.
immutability
/ read-only, xref pandas-dev/pandas#14359unique
is_monotonic*
has_nulls
is_hashable
- only on non homogeneous dtype, usually true on object
dtypes (but NOT if they are mutable). The issue is that this can currently be expensive to figure out (as you need to iterate over and call hash
on each element). xrefe.g. imagine a pd.date_range(....., ...)
, then unique
, monotonic
, has_nulls
are trivial to compute at creation time. Since this is currently an Index
in pandas it is immutable
by-definition.
.lookup
pandas-dev/pandas#7138, for coordinate access is useful, but is not incorporated in a generalized indexer.
pandas would potentially benefit from a more efficient decimal dtype, possibly using libmpdec (what CPython uses internally) for the internal implementation.
Similar to the ARRAY
type found in SQL variants with nested types. See also the List
type in Apache Arrow.
Copying my comment from pandas-dev/pandas#10000 (comment):
We should consider making arithmetic between a Series and a DataFrame broadcast across the columns of the dataframe, i.e., aligning series.index
with df.index
, rather than the current behavior aligning series.index
with df.columns
.
I think this would be far more useful than the current behavior, because it's much more common to want to do arithmetic between a series and all columns of a DataFrame. This would make broadcasting in pandas inconsistent with NumPy, but I think that's OK for a library that focuses on 1D/2D rather than N-dimensional data structures.
On some contemplation, I am thinking it may lead to overall cleaner C++ code if we use exceptions for error reporting instead of status codes. These exceptions should be truly exceptional -- for example, it would be better to deal with argument validation and "can't fail" exceptions with DCHECK
macros. User-visible argument validation can be handled in the Python wrapper layer.
Obvious / currently supported see here
xref #20
Informational, may want to think about the desirability of adding later
non-support (try to raise informative errors & point to ready-made solns)
object
)Currently, there is no way to index a list of values in pandas without inserting NA for missing values. It could be nice to make this possible, either by making a variation of .loc
that raises KeyError
in such cases or by changing the behavior of .loc
altogether.
In xarray, .loc
only works with pre-existing index labels for getitem df.loc[key]
and assignment df.loc[key] = values
(inserting new columns is OK). Reindexing behavior can still be obtained with explicit calls to .reindex
.
Conceivably, we could make things work in the same way for pandas. Two major implications of such a change:
pandas/core/indexing.py
is some of the trickiest code in pandas, in part because it handles cases like inserting NAs and in part because it tries to handle all possible variations of indexing with minimal code duplication. I don't even envy anyone who takes on the task of translating such logic to C++.df = pd.DataFrame()
for row, col, value in data:
df.loc[row, col] = value
In my view this is a positive, but it would certainly be a big backwards compatibility break.
Rs "new" pipes combined with easily added functions more or less made Rs data handling much easier to read and to extend than pandas. The advantage is IMO twofold:
.
notation in python) is much easier to read than functions itself (df %>% func(...) %>% func2(...)
and df.func(...).func2(...)
vs func2(func(df, ...),...)
Pandas nowadwas has a df.pipe()
method, but that looks much clumsier compared to the elegance of a separate pipe operator.
So I would like to see pandas2 reserve one of the not so much needed operators (e.g. >>
?) for a pipe interface. The pipe interface would let users define new functions (which return a small object which would be used in the >>
operator -> probably doable with a decorator around the function).
As this wasn't possible because it is an API breaking change, I would like to propose that it is done in pandas2.
The rules for exactly what DataFrame.__getitem__
/__setitem__
does (pandas-dev/pandas#9595) are sufficiently complex and inconsistent that they are impossible to understand without extensive experimentation.
This makes for a rather embarrassing situation that we really should fix for pandas 2.0.
I made a proposal when this came up last year:
I still like my proposal, but more importantly, it satisfies two important criteria:
df['foo']
, df[['foo', 'bar']]
, and df[df['foo'] == 'bar']
might cover 80% of use cases).I have closed that issue, but we should do this when implementing the pandas 2.0 memory allocator
I really like the proposal so far ๐
Are there any plans to provide a semi-stable C++/Cython API that could be used by other projects for things beyond simple pandas integration? (As in being able to write your own alternative to DataFrame
)
Specifically, I'm thinking of writing a "pandas for structured data" that allows you to perform in-memory queries on records with optional and repeating fields, using the trick from the Dremel paper to store the data as contiguous arrays.
It would be cool if I could make use of some of the interfaces in this proposal to simplify and speed up interop with pandas.
Another example is xarray, which is inspired by pandas, but exposes an alternative data structure for working with N-dimensional datasets.
we may consider a fixed-size memory pool (which could be managed with an LRU stack) for hash table data to avoid excess internal index hash tables
xref #9
Maybe we can collect a list of pandas issues that have happened in and around this.
I've found it's valuable to be able to consistently compute statistics including the NA values, especially with multiple group keys. I haven't kept track of how pandas handles these now in all cases, but it would be nice to come up with a strategy to make NA behave like any other group in a group by setting.
pandas dtype
coercion is useful in most cases, but there are situations to prohibit it to avoid unexpected values being included. The behavior should be switchable like:
s = pd.Series([1, 2, 3])
s[1] = 1.2
s
#0 1
#1 1
#2 3
# dtype: int64
s[1] = 'x'
s
#0 1
#1 x
#2 3
# dtype: object
s = pd.Series([1, 2, 3], cast=False)
s[1] = 'x'
TypeError...
If any array/Series statistics have been computed, we should serialize them:
I'm thinking we can come up with a plan to yield a better .append implementation that defers stitching together arrays until it's actually needed for computations.
We can do this by having a virtual pandas::Table
interface that will consolidate fragmented columns only when they are requested. Will think some more about this
As I've been prototyping I've copied over a bunch of C++ code from Arrow (https://github.com/apache/arrow) โ I'm not sure maintaining near clones of the same code in two places makes sense (see @xhochy comment here b982d96#commitcomment-19406430).
The code in question is:
src/pandas/util/bit-util.h
)Sharing this code means adding libarrow as a build / runtime dependency โ if this causes problems in some way, we can absorb the bits of the library that are being used in pandas. We should definitely set using
aliases so that we are not using the arrow::
namespace directly in the code for these low level bits.
Later, we can also potentially take advantage of arrow::io
, a small IO subsystem for dealing with files, memory maps, etc. This may be useful for revamping the CSV reader.
When we look at adding nested data types to pandas, or even a new string array type, we may want to consider using the Arrow memory layout, so having this in the build toolchain may make life easier in a number of ways.
xref #15
I brought this up at SciPy 2015, but there's a significant performance win available in expressions like:
df[boolean_cond].groupby(grouping_exprs).agg(agg_expr)
If you do this currently, it will produce a fully materialized copy of df
even if the groupby only touches a small portion of the DataFrame. Ideally, we'd have:
df.groupby(grouping_exprs, where=boolean_cond).agg(...)
I put this as a design / pandas2 issue because the boolean bytes / bits will need to get pushed down into the various C-level groupby subroutines.
Do we want to continue to use NaN? There are possible computational benefits to doing so, but we have the opportunity with pandas 2.0 to change to using bitmaps everywhere, which brings a lot of consistency to how missing data is handled (for example: isnull
becomes free under this scenario). We may want to do some benchmarking to understand the overhead of using bitmaps in arithmetic and aggregations (rather than letting the CPU propagate NaN where relevant).
One benefit of the bitmap route is that we can use hardware popcount to skip null checking on groups of 64 or 128 bits at a time (or more, if there are AVX popcount instructions available, not actually sure about this), so the performance in aggregations may actually get a lot better on mostly non-null data.
I don't think GitHub can support the level of code scrutiny that we're going to want as part of the pandas 2.0 development process, particularly for C/C++ code that may need to go through multiple rounds of iterations.
Once we get going on the work, I'd suggest we give Gerrit a trial run, maybe via Gerrithub (which I haven't tried, but interested in doing -- have used Gerrit itself quite a bit with good experience), but we can also set up a dedicated instance on a VM (Linode, Digital Ocean) if needed
The benefits of Gerrit IMHO are:
More reading: https://www.beepsend.com/2016/04/05/abandoning-gitflow-github-favour-gerrit/
pandas's row indexes introduces a level of semantic incompatibility with other systems that occasionally causes problems for users who are using both pandas and some other system.
Functionally, this mainly means returning the group keys as data columns rather than row index. In the case of .apply
, it may make sense to discard the group keys altogether.
We may also discuss a means to make specifying more complex aggregations easier in a separate issue
There are a number of places where we "guess" a type (e.g. np.float64
) where there is no reasonable choice, e.g. in the CSV parser code.
Many databases have the notion of a "null" type which can be casted to any other type implicitly. For example, if a column in a DataFrame has null type, then you could cast it to float64 or string and obtain an equivalent column of all NA/null values. This would flow through to concat operations.
Figuring this out doesn't strike me as urgent but it would be good to assess how invasive this change would be (in theory it would help with some rough edges, but may well break user code where a "float64 column of all NaNs" was assumed before)
Consider the case of a DataFrame with a large number of distinct groups:
import numpy as np
arr = np.random.randn(5000000)
df = pd.DataFrame({'group': arr.astype('str').repeat(2)})
df['values'] = np.random.randn(len(df))
df.groupby('group').apply(lambda g: len(g))
I have
In [17]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 6.45 s, sys: 68 ms, total: 6.52 s
Wall time: 6.51 s
The per-group overhead is fairly fixed -- with 5 million groups we have:
In [22]: %time result = df.groupby('group').apply(lambda g: len(g))
CPU times: user 31 s, sys: 108 ms, total: 31.1 s
Wall time: 31.1 s
It would be interesting to see if, but pushing down the disassembly-reassembly of DataFrame objects into C++ whether we can take the overhead from the current ~6 microseconds to under a microsecond or even less.
Note that the effects of bad memory locality are also a factor. We could look into tricks like using a background thread which "prepares" groups (up to a certain size / buffer threshold) while user apply functions are executing, to at least mitigate the time aspect of the groupby evaluation.
Placeholder for this part of the project
There were a few comments about this in the initial pull requests.
Some of my arguments for dropping Python 2.7 support
auto
, anyone?) and higher quality software in the libpandas internalsIt would be great for us to reach some decision about this and also join the Python 3 statement: https://python3statement.github.io/ (per @takluyver suggestion)
The pandas.Index
is fantastically useful, but in many cases pandas's insistence on always having an index gets in the way.
Usually, it can be safely ignored when not relevant, especially now that we have RangeIndex
(which makes the cost of creating the index minimal), but this is not always the case:
RangeIndex
is actively harmful. It would be better to raise an error when implicitly joining on an index between two datasets with a default index.index=True
) for controlling whether or not to include the index.I propose that we make the index optional, e.g., by allowing it to be set to None
. This entails a need for some rules to handle missing indexes:
.loc
and join
) should raise TypeError
when called on objects without an index.Somewhat related: #15
@jorisvandenbossche the automated docs builds here don't have the requisite Python packages to run the IPython directives, it seems. We can revert to manual builds if necessary...
potentially related to #9
numpy ufuncs have an identity, which pandas
follows with respect to misssing data.
np.sum([], dtype='float64')
Out[33]: 0.0
np.nansum([np.nan], dtype='float64')
Out[35]: 0.0
pd.Series([np.nan]).sum()
Out[36]: 0.0
I don't feel that strongly one way or the other but there's definitely a case to be made that [36]
should be NA
. The number of bug reports indicate that at minimum, people get tripped up by this, xref pandas-dev/pandas#9422
So could consider modifying the identity concept for pandas 2.0, since there will be less binding to numpy semantics.
I will work on a full document for this to get the conversation started, but this can be the placeholder for our discussion about COW
parallel .apply
, see here
xref dask & distributed
This issue is a placeholder for discussion w.r.t. how much pandas 2.0 should be in charge of out-of-core / parallel operations.
For example in IO operations (see #38), it makes sense to do parallel reading, where pandas can simply front for dask in-line with out the user directly knowning that this is happening. Sure, a more sophisticated user can control this (maybe thru some limited options, or by directly using dask), but the default should simply be it working (in appropriate scenarios) as parallel reads.
Once could argue that .apply
and .groupby
are similar scenarios (again assume that the dataset is 'big' enough), and we have a GIL releasing function.
We also can include the current use of numexpr
for .query
evaluation.
So we are hiding dask / numexpr (or potentially other parallel / out-of-core processing libraries) from the user directly. Things, just work, and are faster.
So how far should we take this? Certainly users can directly use these libraries, but it could be a potential win to include dask as an optional dependency (even though it depends on pandas), and dispatch computation for out-of-core / parallelism as approriate.
I was poking around pandas to see if someone had implemented the equivalent of a vectorized if-then-else. For example, similar to np.where
, but pandas friendly
df['category'].isin(values).ifelse(df.group_a, df.group_b)
I put this here as it will be easier to implement later (in theory)
pandas already supports lazy evaluation via numexpr. adding numba allows the possibility of generation of lazy expressions (and recent versions support ahead-of-time compilation).
I would simply make these requirements; they are available on all platforms and just help.
xref: pandas-dev/pandas#14324
xref: https://github.com/shoyer/numbagg
xref #7
Please close if not relevant, but I'm not sure that you want this file:
https://github.com/pydata/pandas-design/blob/master/github_deploy_key.enc
committed publicly.
I haven't been able to wrap my head around these behaviors:
In [10]: s = pd.Series()
In [11]: s['a'] = 7
In [12]: s
Out[12]:
a 7
dtype: int64
or
In [2]: s = pd.Series([1, 2, 3])
In [3]: s['4'] = 'b'
In [4]: s
Out[4]:
0 1
1 2
2 3
4 b
dtype: object
On first principles, I think these should raise KeyError and ValueError/TypeError, respectively. I'm concerned that preserving these APIs (especially implicit type changes) is going to be problematic for pandas 2.0 if our goal is to provide more precise / consistent / explicit behavior, particularly with respect to data types. If you want to change the type of something, you should cast (unless there is a standard implicit cast, e.g. int -> float in arithmetic). In the case of implicit size mutation, it seems like a recipe for bugs to me (or simply working around coding anti-patterns -- better to explicitly reindex then assign values).
Let me know other thoughts about this.
https://github.com/llllllllll/libpy
looks like interesting way to construct expressions
this may not actually be an issue as we aren't using float np.nan
as our missing marker, but
we tend to have some subtle issues when int64 are downcast to float64, IOW we have missing values in an integer array. We end up storing them as object
to avoid this precision loss.
Just a reminder to test for things like this.
xref pandas-dev/pandas#14020 as an example
Given that we seem to be on board with only supporting Python 3 (#5) in pandas 2.0, let's take full advantage of that and require type annotations for all internal pandas code. Users, of course, are free to ignore them as they please, but we can run mypy or pytype as part of our continguous integration tests to check for bugs.
Why?
Quite simply, this is just good software engineering.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.