Coder Social home page Coder Social logo

Comments (6)

rhshadrach avatar rhshadrach commented on June 15, 2024

Thanks for the report.

It would be nice if Pandas would just fall back to not sorting the output.

I'm opposed here. This makes the output harder to predict for users and can hide bugs.

As illustrated by the case with the 'string' value, Pandas is lenient and sorts booleans and strings without complaining, which is not technically correct, and it would be convenient if this behavior is extended to other types.

What is meant by "is extended to other types"? In the first half of the line above you seem to suggest pandas shouldn't be sorting Booleans and strings, and the second half you seem to suggest pandas should be sorting more combinations of dtypes.

In this case, however, it would be good to get a hint that sort=False would solve the issue.

I think this is a good idea, but the difficulty will be in the implementation. The failure happens within factorize, which is used in a variety of places in pandas. Since we can be attempting to sort any Python object, we can't predict the type of error a sorting failure will raise. So it seems to me we can't reliably catch the exception in the groupby code, nor can we raise a message like "use sort=False instead" in factorize.

Related: #54847 (comment)

from pandas.

rhshadrach avatar rhshadrach commented on June 15, 2024

I suppose we could wrap the sorting in factorize:

if sort and len(uniques) > 0:
uniques, codes = safe_sort(
uniques,
codes,
use_na_sentinel=use_na_sentinel,
assume_unique=True,
verify=False,
)

with something like:

try:
    ... = safe_sort(...)
except Exception as err:
    raise type(err)("factorize failed sorting") from err

and then we could reliably catch and rethrow this in groupby.

cc @mroeschke - any thoughts?

from pandas.

mroeschke avatar mroeschke commented on June 15, 2024

Is the exception from safe_sort be captured in groupby and rethrown instead?

from pandas.

gabuzi avatar gabuzi commented on June 15, 2024

Thanks for picking up this issue!

It would be nice if Pandas would just fall back to not sorting the output.

I'm opposed here. This makes the output harder to predict for users and can hide bugs.

As illustrated by the case with the 'string' value, Pandas is lenient and sorts booleans and strings without complaining, which is not technically correct, and it would be convenient if this behavior is extended to other types.

What is meant by "is extended to other types"? In the first half of the line above you seem to suggest pandas shouldn't be sorting Booleans and strings, and the second half you seem to suggest pandas should be sorting more combinations of dtypes.

Sorry if that was a bit confusing.

What I meant here is that I found the current behavior inconsistent in the sense that some values (strings) get special treatment and work with sort=True despite fudamentally not being sortable with booleans, while others don't (my tuples in the example above).
Essentially, I see two options to get more consistency:

  1. sort=True fails with any values for which a < b would raise a basic python exception, like e.g. 'string' < False does. E.g. my 'alternative solution' above
  2. Be more lenient as currently is done for strings.

What I proposed as the desired feature was to go with option 2 to resolve this inconsistency. My main reason was that it would not change the current behavior of not raising an exception with string values. With 'not technically correct', I meant that there fundamentally just is no order between booleans and strings, but pandas doesn't complain despite sort=True, and I can only guess what it actually does. With "is extended to other types" I meant that pandas should not complain in other scenarios that are not sortable either.

This brings up the strong argument of output predictability. I think, though that a focus on predictability would support option 1. I didn't dig deeper in the code or docs, so I don't know how pandas actually sorts bools and strings for groupby, so predictability in the current state might already not be as good as it could be. For comparison, I just did a quick check and in 2.1.4 Pandas refuses to sort a pd.Series with bool and string values. Maybe it would be better to also be more strict on groupby group values (option 1).

from pandas.

rhshadrach avatar rhshadrach commented on June 15, 2024

@mroeschke

Is the exception from safe_sort be captured in groupby and rethrown instead?

I think you mean to try to catch the exception thrown from safe_sort as-is (that is, without changing safe_sort). Because it can be sorting an arbitrary Python object, I don't think there is a reliable way to catch error thrown from safe_sort within the groupby code while letting others propagate.

@gabuzi

Essentially, I see two options to get more consistency:

I think strings and numeric data occur common enough in index labels (e.g. pandas will default to creating column labels as integers, whereas strings are very common) that this may be special enough to warrant an exception, but that does not mean we have to "sort the world" 😆 In any case, if we are going to allow strings, numerics, and NA values and nothing else, it should at least be documented better.

Maybe it would be better to also be more strict on groupby group values (option 1).

If we were designing the API from scratch, this would be my recommendation. However I think we have to consider the potential disruption to users. The benefit of having a more predictable/consistent "what is sortable" seems minor to the disruption it might cause, but of course I'm only guessing here. Unfortunately I see no way to be sure. I wouldn't be opposed if this gains support.

from pandas.

mroeschke avatar mroeschke commented on June 15, 2024

I don't think there is a reliable way to catch error thrown from safe_sort within the groupby code while letting others propagate.

OK then doing what you suggested in #57525 (comment) seems reasonable

from pandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.