Comments (9)
Cleanlab requires at least 1 example in every class, otherwise there is nothing to train on.
Dealing with inputs like yours is on the road map, but for now, you need to re format your input to be labels from 0, 1, 2... And psx and s must not have any zero size classes.
from cleanlab.
Cleanlab requires at least 1 example in every class, otherwise there is nothing to train on.
Dealing with inputs like yours is on the road map, but for now, you need to re format your input to be labels from 0, 1, 2... And psx and s must not have any zero size classes.
That is. If s
has no zero size class but the predicted result psx
does, which I guess would happen if some class size is rather small, cleanlab cannot handle this situation for now. Right?
from cleanlab.
Thanks for the follow-up. Cleanlab will never remove less than 5 examples from any class. It's guaranteed. If your issue remains, could you explain further?
from cleanlab.
Thanks for your answer. I can understand this as I notice MIN_NUM_PER_CLASS = 5
is defined in pruning.py to guarantee this.
I found another problem that might happen if I manually set thresholds rather than calculate them from psx. If I set some class’s threshold very high like 0.999, because of the following code in the function cleanlab.latent_estimation.compute_confident_joint
:
y_confident = true_label_guess[at_least_one_confident]
s_confident = s[at_least_one_confident]
confident_joint = confusion_matrix(y_confident, s_confident).T
The union set of y_confident
and s_confident
may not cover all classes. This would result in wrong size confusion matrix. And the following step of Confident Learning would fail.
So I think if I want to set thresholds manually, I have to make sure that y_confident
and s_confident
would cover all classes. What do you think about it?
from cleanlab.
Hi @Mickey-Guo thanks for this feedback. Can you share a complete, minimum code example to achieve this error? I can look into it.
from cleanlab.
Hi @cgnorthcutt , in my use case I also encounter error message like "ValueError: operands could not be broadcast together with shapes" due to mismatch of thresholds/ pax shape. So is there any theoretical suggestion on how to deal with classes with no training data? Explicitly, how to set the threshold t_k for k not appeared in the data?
from cleanlab.
Hi @Slicerkao , I'm releasing the next version of cleanlab 0.2 in the coming months, and I plan to simplify this issue for cleanlab users. It would help me a great deal if you could create the smallest possible, complete code, working example to reproduce this issue. I'll add it into my tests and make sure that cleanlab 0.2 no longer has this issue.
If you're interested, essentially what I'll be doing is creating maps between the labels provided, psx indices, and an internal representation of the labels, and cleanlab will run on these mapped labels instead of the actual inputs. This will take me some time, but I can roll this out this year and if you all, @Mickey-Guo and @Slicerkao , can share those complete-code minimum error producing examples, it will help - thanks!
from cleanlab.
Hi @cgnorthcutt ! Following is a simple example. Hope it would help you.
Here is when we do not set thresholds
manually, the function confusion_matrix
would run successfully.
from cleanlab.latent_estimation import (
calibrate_confident_joint,
compute_confident_joint
)
import numpy as np
psx = np.array([
[0.9, 0.05, 0.05],
[0.8, 0.1, 0.1],
[0.2, 0.7, 0.1],
[0.2, 0.1, 0.7]
])
s = np.array([0, 0, 1, 2])
confident_joint = compute_confident_joint(s, psx)
print(confident_joint)
output:
array([[2, 0, 0],
[0, 1, 0],
[0, 0, 1]])
But if we set one of class's threshold very high, error would occure when confusion_matrix
runs.
K = len(np.unique(s))
thresholds = np.array([np.mean(psx[:, k][s == k]) for k in range(K)])
# change one threshold
thresholds[1] = 0.9
print(thresholds)
confident_joint = compute_confident_joint(s, psx, thresholds=thresholds)
print(confident_joint)
output:
[0.85 0.9 0.7 ]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-9d666ecbdee1> in <module>
----> 1 confident_joint = compute_confident_joint(s, psx, thresholds=thresholds)
2 confident_joint
~/anaconda3/envs/tensorflow1/lib/python3.6/site-packages/cleanlab/latent_estimation.py in compute_confident_joint(s, psx, K, thresholds, calibrate, multi_label, return_indices_of_off_diagonals)
355
356 if calibrate:
--> 357 confident_joint = calibrate_confident_joint(confident_joint, s)
358
359 if return_indices_of_off_diagonals:
~/anaconda3/envs/tensorflow1/lib/python3.6/site-packages/cleanlab/latent_estimation.py in calibrate_confident_joint(confident_joint, s, multi_label)
119 # Calibrate confident joint to have correct p(s) prior on noisy labels.
120 calibrated_cj = (
--> 121 confident_joint.T / confident_joint.sum(axis=1) * s_counts
122 ).T
123 # Calibrate confident joint to sum to:
ValueError: operands could not be broadcast together with shapes (2,2) (3,)
Line number 121 suggests that confident_join
is shape (2,2) but s_count
is shape (3,)
If we dive into the code of function compute_confident_joint
, we could see that when setting one of the class's threshold very high, y_confident
and s_confident
cannot cover all classes.
K = len(np.unique(s))
# we mannualy set thresholds
thresholds = np.asarray([0.85, 0.9, 0.7 ])
psx_bool = (psx >= thresholds - 1e-6)
num_confident_bins = psx_bool.sum(axis=1)
at_least_one_confident = num_confident_bins > 0
more_than_one_confident = num_confident_bins > 1
psx_argmax = psx.argmax(axis=1)
confident_argmax = psx_bool.argmax(axis=1)
true_label_guess = np.where(
more_than_one_confident,
psx_argmax,
confident_argmax,
)
y_confident = true_label_guess[at_least_one_confident]
s_confident = s[at_least_one_confident]
print(set(y_confident.tolist() + s_confident.tolist()))
output:
{0, 2}
The above output shows that y_confident
and s_confident
do not include class 1. This would result in 2*2 confusion matrix but not 3*3:
from sklearn.metrics import confusion_matrix
confident_joint = confusion_matrix(y_confident, s_confident).T
print(confident_joint)
output:
[[1 0]
[0 1]]
But we expect the confusion matrix could be
[[1 0 0]
[0 0 0]
[0 0 1]]
This will cause the next step to fail.
If we calculate thresholds by thresholds = np.array([np.mean(psx[:, k][s == k]) for k in range(K)])
, the error would not occure because y_confident
and s_confident
would definitely cover all classes. Thus confusion matrix would be the right size, and the function calibrate_confident_joint
would run successfully. If we manually set thresholds, error may occur.
So I think the confusion matrix
may be needed to write again but not just call sklearn.metrics.confusion_matrix
from cleanlab.
Cleanlab now supports datasets with some classes missing (just added to the developer version).
The official number of classes is now determined by the dimensionality of pred_probs
, so the package should now be more usable for iterative applications like active learning where the set of unique data labels may change over time.
This support was added in:
#511
#518
Feel free to reopen this issue if you still encounter any problems (using latest developer version)!
from cleanlab.
Related Issues (20)
- Error in null: Ambiguous truth value of a Series HOT 4
- Add end-to-end tests at the end of Datalab quickstart tutorial
- get rid of warnings in the datalab quickstart tutorial
- Remove Tensorflow version constraint in developer dependencies
- add unit test with all identical dataset HOT 3
- Difference of object detection confident learning with objectlab paper HOT 1
- update coveragerc to only skip over specific experimental subfolders that currently are untested
- Null issue check throwing an error HOT 1
- lab.find_issues(features=features) outputs error for underperforming issue HOT 1
- Object detection, segmentation k-fold practical issue HOT 1
- Trying to create Datalab object with label set to a dtype of 'category' but getting 'NotImplementedError'
- test_scores_for_identical_examples unit test fails
- be able to pass in kwargs to plt.show()
- datalab issue guide should better describe the relevant cleanlab columns
- Trying to build docs with a new notebook I have created but getting `AttributeError` from the audio.ipynb tutorial HOT 1
- Doctests are failing for some functions HOT 1
- In the “Synthetic Data Quality” part, do we need the same amount of real data and generated data HOT 1
- image datalab tutorial broken: Getting build error RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [64, 1, 1, 28, 28] HOT 2
- 3D Cleanlab / DCAI ?
- Follow-Up: Revert macOS CI Environment to Latest Version Once Python Compatibility Is Resolved
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cleanlab.