I suppose we could wrap the sorting in factorize: <div class="Bo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ENH: Better documentation or default behavior for GroupBy for columns with non-sortable values about pandas HOT 6 OPEN

gabuzi commented on June 15, 2024

ENH: Better documentation or default behavior for GroupBy for columns with non-sortable values

from pandas.

Comments (6)

rhshadrach commented on June 15, 2024

Thanks for the report.

It would be nice if Pandas would just fall back to not sorting the output.

I'm opposed here. This makes the output harder to predict for users and can hide bugs.

As illustrated by the case with the 'string' value, Pandas is lenient and sorts booleans and strings without complaining, which is not technically correct, and it would be convenient if this behavior is extended to other types.

What is meant by "is extended to other types"? In the first half of the line above you seem to suggest pandas shouldn't be sorting Booleans and strings, and the second half you seem to suggest pandas should be sorting more combinations of dtypes.

In this case, however, it would be good to get a hint that sort=False would solve the issue.

I think this is a good idea, but the difficulty will be in the implementation. The failure happens within factorize, which is used in a variety of places in pandas. Since we can be attempting to sort any Python object, we can't predict the type of error a sorting failure will raise. So it seems to me we can't reliably catch the exception in the groupby code, nor can we raise a message like "use sort=False instead" in factorize.

Related: #54847 (comment)

from pandas.

rhshadrach commented on June 15, 2024

I suppose we could wrap the sorting in factorize:

pandas/pandas/core/algorithms.py

Lines 807 to 814 in 9a6c8f0

    
           if sort and len(uniques) > 0: 
        
               uniques, codes = safe_sort( 
        
                   uniques, 
        
                   codes, 
        
                   use_na_sentinel=use_na_sentinel, 
        
                   assume_unique=True, 
        
                   verify=False, 
        
               )

with something like:

try:
    ... = safe_sort(...)
except Exception as err:
    raise type(err)("factorize failed sorting") from err

and then we could reliably catch and rethrow this in groupby.

cc @mroeschke - any thoughts?

from pandas.

mroeschke commented on June 15, 2024

Is the exception from safe_sort be captured in groupby and rethrown instead?

from pandas.

gabuzi commented on June 15, 2024

Thanks for picking up this issue!

It would be nice if Pandas would just fall back to not sorting the output.

I'm opposed here. This makes the output harder to predict for users and can hide bugs.

As illustrated by the case with the 'string' value, Pandas is lenient and sorts booleans and strings without complaining, which is not technically correct, and it would be convenient if this behavior is extended to other types.

What is meant by "is extended to other types"? In the first half of the line above you seem to suggest pandas shouldn't be sorting Booleans and strings, and the second half you seem to suggest pandas should be sorting more combinations of dtypes.

Sorry if that was a bit confusing.

What I meant here is that I found the current behavior inconsistent in the sense that some values (strings) get special treatment and work with sort=True despite fudamentally not being sortable with booleans, while others don't (my tuples in the example above).
Essentially, I see two options to get more consistency:

sort=True fails with any values for which a < b would raise a basic python exception, like e.g. 'string' < False does. E.g. my 'alternative solution' above
Be more lenient as currently is done for strings.

What I proposed as the desired feature was to go with option 2 to resolve this inconsistency. My main reason was that it would not change the current behavior of not raising an exception with string values. With 'not technically correct', I meant that there fundamentally just is no order between booleans and strings, but pandas doesn't complain despite sort=True, and I can only guess what it actually does. With "is extended to other types" I meant that pandas should not complain in other scenarios that are not sortable either.

This brings up the strong argument of output predictability. I think, though that a focus on predictability would support option 1. I didn't dig deeper in the code or docs, so I don't know how pandas actually sorts bools and strings for groupby, so predictability in the current state might already not be as good as it could be. For comparison, I just did a quick check and in 2.1.4 Pandas refuses to sort a pd.Series with bool and string values. Maybe it would be better to also be more strict on groupby group values (option 1).

from pandas.

rhshadrach commented on June 15, 2024

@mroeschke

Is the exception from safe_sort be captured in groupby and rethrown instead?

I think you mean to try to catch the exception thrown from safe_sort as-is (that is, without changing safe_sort). Because it can be sorting an arbitrary Python object, I don't think there is a reliable way to catch error thrown from safe_sort within the groupby code while letting others propagate.

@gabuzi

Essentially, I see two options to get more consistency:

I think strings and numeric data occur common enough in index labels (e.g. pandas will default to creating column labels as integers, whereas strings are very common) that this may be special enough to warrant an exception, but that does not mean we have to "sort the world" 😆 In any case, if we are going to allow strings, numerics, and NA values and nothing else, it should at least be documented better.

Maybe it would be better to also be more strict on groupby group values (option 1).

If we were designing the API from scratch, this would be my recommendation. However I think we have to consider the potential disruption to users. The benefit of having a more predictable/consistent "what is sortable" seems minor to the disruption it might cause, but of course I'm only guessing here. Unfortunately I see no way to be sure. I wouldn't be opposed if this gains support.

from pandas.

mroeschke commented on June 15, 2024

I don't think there is a reliable way to catch error thrown from safe_sort within the groupby code while letting others propagate.

OK then doing what you suggested in #57525 (comment) seems reasonable

from pandas.

ENH: Better documentation or default behavior for GroupBy for columns with non-sortable values about pandas HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	if sort and len(uniques) > 0:
	uniques, codes = safe_sort(
	uniques,
	codes,
	use_na_sentinel=use_na_sentinel,
	assume_unique=True,
	verify=False,
	)