Coder Social home page Coder Social logo

Comments (9)

cgnorthcutt avatar cgnorthcutt commented on May 22, 2024

Cleanlab requires at least 1 example in every class, otherwise there is nothing to train on.

Dealing with inputs like yours is on the road map, but for now, you need to re format your input to be labels from 0, 1, 2... And psx and s must not have any zero size classes.

from cleanlab.

Mickey-Guo avatar Mickey-Guo commented on May 22, 2024

Cleanlab requires at least 1 example in every class, otherwise there is nothing to train on.

Dealing with inputs like yours is on the road map, but for now, you need to re format your input to be labels from 0, 1, 2... And psx and s must not have any zero size classes.

That is. If s has no zero size class but the predicted result psx does, which I guess would happen if some class size is rather small, cleanlab cannot handle this situation for now. Right?

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 22, 2024

Thanks for the follow-up. Cleanlab will never remove less than 5 examples from any class. It's guaranteed. If your issue remains, could you explain further?

from cleanlab.

Mickey-Guo avatar Mickey-Guo commented on May 22, 2024

Thanks for your answer. I can understand this as I notice MIN_NUM_PER_CLASS = 5 is defined in pruning.py to guarantee this.
I found another problem that might happen if I manually set thresholds rather than calculate them from psx. If I set some class’s threshold very high like 0.999, because of the following code in the function cleanlab.latent_estimation.compute_confident_joint:

y_confident = true_label_guess[at_least_one_confident]
s_confident = s[at_least_one_confident]
confident_joint = confusion_matrix(y_confident, s_confident).T 

The union set of y_confident and s_confident may not cover all classes. This would result in wrong size confusion matrix. And the following step of Confident Learning would fail.
So I think if I want to set thresholds manually, I have to make sure that y_confident and s_confident would cover all classes. What do you think about it?

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 22, 2024

Hi @Mickey-Guo thanks for this feedback. Can you share a complete, minimum code example to achieve this error? I can look into it.

from cleanlab.

Slicerkao avatar Slicerkao commented on May 22, 2024

Hi @cgnorthcutt , in my use case I also encounter error message like "ValueError: operands could not be broadcast together with shapes" due to mismatch of thresholds/ pax shape. So is there any theoretical suggestion on how to deal with classes with no training data? Explicitly, how to set the threshold t_k for k not appeared in the data?

from cleanlab.

cgnorthcutt avatar cgnorthcutt commented on May 22, 2024

Hi @Slicerkao , I'm releasing the next version of cleanlab 0.2 in the coming months, and I plan to simplify this issue for cleanlab users. It would help me a great deal if you could create the smallest possible, complete code, working example to reproduce this issue. I'll add it into my tests and make sure that cleanlab 0.2 no longer has this issue.

If you're interested, essentially what I'll be doing is creating maps between the labels provided, psx indices, and an internal representation of the labels, and cleanlab will run on these mapped labels instead of the actual inputs. This will take me some time, but I can roll this out this year and if you all, @Mickey-Guo and @Slicerkao , can share those complete-code minimum error producing examples, it will help - thanks!

from cleanlab.

Mickey-Guo avatar Mickey-Guo commented on May 22, 2024

Hi @cgnorthcutt ! Following is a simple example. Hope it would help you.
Here is when we do not set thresholds manually, the function confusion_matrix would run successfully.

from cleanlab.latent_estimation import (
    calibrate_confident_joint,
    compute_confident_joint
)
import numpy as np

psx = np.array([
    [0.9, 0.05, 0.05],
    [0.8, 0.1, 0.1],
    [0.2, 0.7, 0.1],
    [0.2, 0.1, 0.7]
])
s = np.array([0, 0, 1, 2])
confident_joint = compute_confident_joint(s, psx)
print(confident_joint)

output:

array([[2, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

But if we set one of class's threshold very high, error would occure when confusion_matrix runs.

K = len(np.unique(s))
thresholds = np.array([np.mean(psx[:, k][s == k]) for k in range(K)])
# change one threshold
thresholds[1] = 0.9
print(thresholds)
confident_joint = compute_confident_joint(s, psx, thresholds=thresholds)
print(confident_joint)

output:

[0.85 0.9  0.7 ]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-9d666ecbdee1> in <module>
----> 1 confident_joint = compute_confident_joint(s, psx, thresholds=thresholds)
      2 confident_joint

~/anaconda3/envs/tensorflow1/lib/python3.6/site-packages/cleanlab/latent_estimation.py in compute_confident_joint(s, psx, K, thresholds, calibrate, multi_label, return_indices_of_off_diagonals)
    355 
    356     if calibrate:
--> 357         confident_joint = calibrate_confident_joint(confident_joint, s)
    358 
    359     if return_indices_of_off_diagonals:

~/anaconda3/envs/tensorflow1/lib/python3.6/site-packages/cleanlab/latent_estimation.py in calibrate_confident_joint(confident_joint, s, multi_label)
    119     # Calibrate confident joint to have correct p(s) prior on noisy labels.
    120     calibrated_cj = (
--> 121             confident_joint.T / confident_joint.sum(axis=1) * s_counts
    122     ).T
    123     # Calibrate confident joint to sum to:

ValueError: operands could not be broadcast together with shapes (2,2) (3,)

Line number 121 suggests that confident_join is shape (2,2) but s_count is shape (3,)

If we dive into the code of function compute_confident_joint, we could see that when setting one of the class's threshold very high, y_confident and s_confident cannot cover all classes.

K = len(np.unique(s))

# we mannualy set thresholds
thresholds = np.asarray([0.85, 0.9, 0.7 ])

psx_bool = (psx >= thresholds - 1e-6)
num_confident_bins = psx_bool.sum(axis=1)
at_least_one_confident = num_confident_bins > 0
more_than_one_confident = num_confident_bins > 1
psx_argmax = psx.argmax(axis=1)
confident_argmax = psx_bool.argmax(axis=1)

true_label_guess = np.where(
    more_than_one_confident,
    psx_argmax,
    confident_argmax,
)

y_confident = true_label_guess[at_least_one_confident]
s_confident = s[at_least_one_confident]

print(set(y_confident.tolist() + s_confident.tolist()))

output:

{0, 2}

The above output shows that y_confident and s_confident do not include class 1. This would result in 2*2 confusion matrix but not 3*3:

from sklearn.metrics import confusion_matrix
confident_joint = confusion_matrix(y_confident, s_confident).T 
print(confident_joint)

output:

[[1 0]
 [0 1]]

But we expect the confusion matrix could be

[[1 0 0]
 [0 0 0]
 [0 0 1]]

This will cause the next step to fail.

If we calculate thresholds by thresholds = np.array([np.mean(psx[:, k][s == k]) for k in range(K)]), the error would not occure because y_confident and s_confident would definitely cover all classes. Thus confusion matrix would be the right size, and the function calibrate_confident_joint would run successfully. If we manually set thresholds, error may occur.

So I think the confusion matrix may be needed to write again but not just call sklearn.metrics.confusion_matrix

from cleanlab.

jwmueller avatar jwmueller commented on May 22, 2024

Cleanlab now supports datasets with some classes missing (just added to the developer version).
The official number of classes is now determined by the dimensionality of pred_probs, so the package should now be more usable for iterative applications like active learning where the set of unique data labels may change over time.

This support was added in:
#511
#518

Feel free to reopen this issue if you still encounter any problems (using latest developer version)!

from cleanlab.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.