Comments (6)
Thanks for the report.
It would be nice if Pandas would just fall back to not sorting the output.
I'm opposed here. This makes the output harder to predict for users and can hide bugs.
As illustrated by the case with the 'string' value, Pandas is lenient and sorts booleans and strings without complaining, which is not technically correct, and it would be convenient if this behavior is extended to other types.
What is meant by "is extended to other types"? In the first half of the line above you seem to suggest pandas shouldn't be sorting Booleans and strings, and the second half you seem to suggest pandas should be sorting more combinations of dtypes.
In this case, however, it would be good to get a hint that sort=False would solve the issue.
I think this is a good idea, but the difficulty will be in the implementation. The failure happens within factorize, which is used in a variety of places in pandas. Since we can be attempting to sort any Python object, we can't predict the type of error a sorting failure will raise. So it seems to me we can't reliably catch the exception in the groupby code, nor can we raise a message like "use sort=False instead" in factorize.
Related: #54847 (comment)
from pandas.
I suppose we could wrap the sorting in factorize:
pandas/pandas/core/algorithms.py
Lines 807 to 814 in 9a6c8f0
with something like:
try:
... = safe_sort(...)
except Exception as err:
raise type(err)("factorize failed sorting") from err
and then we could reliably catch and rethrow this in groupby.
cc @mroeschke - any thoughts?
from pandas.
Is the exception from safe_sort
be captured in groupby and rethrown instead?
from pandas.
Thanks for picking up this issue!
It would be nice if Pandas would just fall back to not sorting the output.
I'm opposed here. This makes the output harder to predict for users and can hide bugs.
As illustrated by the case with the 'string' value, Pandas is lenient and sorts booleans and strings without complaining, which is not technically correct, and it would be convenient if this behavior is extended to other types.
What is meant by "is extended to other types"? In the first half of the line above you seem to suggest pandas shouldn't be sorting Booleans and strings, and the second half you seem to suggest pandas should be sorting more combinations of dtypes.
Sorry if that was a bit confusing.
What I meant here is that I found the current behavior inconsistent in the sense that some values (strings) get special treatment and work with sort=True
despite fudamentally not being sortable with booleans, while others don't (my tuples in the example above).
Essentially, I see two options to get more consistency:
sort=True
fails with any values for whicha < b
would raise a basic python exception, like e.g.'string' < False
does. E.g. my 'alternative solution' above- Be more lenient as currently is done for strings.
What I proposed as the desired feature was to go with option 2 to resolve this inconsistency. My main reason was that it would not change the current behavior of not raising an exception with string values. With 'not technically correct', I meant that there fundamentally just is no order between booleans and strings, but pandas doesn't complain despite sort=True
, and I can only guess what it actually does. With "is extended to other types" I meant that pandas should not complain in other scenarios that are not sortable either.
This brings up the strong argument of output predictability. I think, though that a focus on predictability would support option 1. I didn't dig deeper in the code or docs, so I don't know how pandas actually sorts bools and strings for groupby, so predictability in the current state might already not be as good as it could be. For comparison, I just did a quick check and in 2.1.4 Pandas refuses to sort a pd.Series with bool and string values. Maybe it would be better to also be more strict on groupby group values (option 1).
from pandas.
Is the exception from
safe_sort
be captured in groupby and rethrown instead?
I think you mean to try to catch the exception thrown from safe_sort
as-is (that is, without changing safe_sort
). Because it can be sorting an arbitrary Python object, I don't think there is a reliable way to catch error thrown from safe_sort
within the groupby code while letting others propagate.
Essentially, I see two options to get more consistency:
I think strings and numeric data occur common enough in index labels (e.g. pandas will default to creating column labels as integers, whereas strings are very common) that this may be special enough to warrant an exception, but that does not mean we have to "sort the world" 😆 In any case, if we are going to allow strings, numerics, and NA values and nothing else, it should at least be documented better.
Maybe it would be better to also be more strict on groupby group values (option 1).
If we were designing the API from scratch, this would be my recommendation. However I think we have to consider the potential disruption to users. The benefit of having a more predictable/consistent "what is sortable" seems minor to the disruption it might cause, but of course I'm only guessing here. Unfortunately I see no way to be sure. I wouldn't be opposed if this gains support.
from pandas.
I don't think there is a reliable way to catch error thrown from safe_sort within the groupby code while letting others propagate.
OK then doing what you suggested in #57525 (comment) seems reasonable
from pandas.
Related Issues (20)
- DOC: missed doc for `DataFrame.isetitem` in https://pandas.pydata.org/docs/search.html?q=isetitem HOT 1
- BUG: Possible bad api side effect of replacing FrozenList with tuple HOT 8
- BUG: queries on categorical string columns in HDFStore.select() return unexpected results
- ENH: `(Styler|DataFrame).to_typst()` HOT 2
- Potential regression induced by PR #57328 HOT 3
- Potential regression induced by PR #57302
- Potential regression induced by PR #57402 HOT 1
- Potential regression induced by PR #57445 HOT 4
- Potential regression induced by PR #57479 HOT 4
- Potential regression induced by PR #57534 HOT 1
- pandas 2.2.1 introduce API changes, for example for `groupby().first`, `groupby().last` and `series.argsort` HOT 9
- ENH: Make ExtensionArray a Protocol HOT 6
- BUG: .str.replace repl string incorrectly parsed with pyarrow string dtype HOT 2
- Potential regression induced by #57347 HOT 1
- Potential regression induced by PR #56921 HOT 2
- QST: Grouping at fixed time intervals (hours and minutes) regardeless the date and the first row time
- BUG: interchange protocol with nullable datatypes a non-null validity provides nonsense results HOT 2
- BUG: differing syntax for datetime column dtype specification causes failure with `assert_frame_equal()` HOT 3
- BUG: Cannot use numpy FLS as indicies since pandas 2.2.1 HOT 1
- BUG: integers are being converted to floats when accessed from DataFrame HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pandas.