Comments (4)
Thanks for opening this! Sorry I don't have a more helpful update, but I just wanted to say that I'm looking into this, and I do think there's a bug here. I'm making some regression tests using the Iris dataset, with a single label-encoded target column, so I'll probably ask you to try to reproduce some of my results when I'm further along in the bug hunt.
In the meantime, have you tried adjusting the do_predict_proba
kwarg of your Environment
? Are you expecting log_loss
to be called with a single column of label-encoded predictions, or four columns of class probabilities? Because I believe the former won't work, as log_loss
automatically assumes a 1-dimensional y_pred
to be binary...
Like I said, I need to investigate some more, but I'd really appreciate you commenting any of your findings here!
Edit: Thanks for looking for related issues, as well!
from hyperparameter_hunter.
Thanks for your quick reply!
Sure I tried setting do_predict_proba=True
but it didn't help. Seems that it refused to accept multi-column predicts for some reason.
I have to say log_loss
is a bit special 'cause it requires (n_samples,n_classes)
y_pred, while other examples you tested before, I guess, forced the input y_pred
to be 1 column.
Here is the code I used to test:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from hyperparameter_hunter import Environment, CVExperiment, BayesianOptPro
# make a toy dataset
x,y = make_classification(n_samples=1000,n_classes=4,n_informative=10)
train_df = pd.DataFrame(x, columns=range(x.shape[1]))
train_df["y"] = y
'''
TEST 1
metrics=["log_loss"]
do_predict_proba=False
ValueError: y_true and y_pred contain different number of classes 4, 2.
Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2 3]
'''
env1 = Environment(
train_dataset=train_df,
results_path="HyperparameterHunterAssets",
target_column="y",
metrics=["log_loss"],
do_predict_proba=False,
cv_type="StratifiedKFold",
cv_params=dict(n_splits=5, random_state=32),
verbose=1,
)
'''
TEST 2
metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3]))
do_predict_proba=False
ValueError: The number of classes in labels is different from that in y_pred.
Classes found in labels: [0 1 2 3]
'''
env2 = Environment(
train_dataset=train_df,
results_path="HyperparameterHunterAssets",
target_column="y",
metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3])),
do_predict_proba=False,
cv_type="StratifiedKFold",
cv_params=dict(n_splits=5, random_state=32),
verbose=1,
)
'''
TEST 3
metrics=["log_loss"]
do_predict_proba=True
ValueError: Wrong number of items passed 4, placement implies 1
'''
env3 = Environment(
train_dataset=train_df,
results_path="HyperparameterHunterAssets",
target_column="y",
metrics=["log_loss"],
do_predict_proba=True,
cv_type="StratifiedKFold",
cv_params=dict(n_splits=5, random_state=32),
verbose=1,
)
'''
TEST 4
metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3]))
do_predict_proba=True
ValueError: Wrong number of items passed 4, placement implies 1
'''
env4 = Environment(
train_dataset=train_df,
results_path="HyperparameterHunterAssets",
target_column="y",
metrics=dict(logloss=lambda y_true, y_pred: metrics.log_loss(y_true, y_pred, labels=[0,1,2,3])),
do_predict_proba=True,
cv_type="StratifiedKFold",
cv_params=dict(n_splits=5, random_state=32),
verbose=1,
)
experiment = CVExperiment(
model_initializer=RandomForestClassifier,
model_init_params=dict(n_estimators=10),
)
from hyperparameter_hunter.
Thanks for posting your sample code! It's very helpful! Sorry for the delay, but I've been busy with other things lately. I'm looking at this issue again today, and I have to agree with you log_loss
does seem rather weird. Although I may just be thinking that because I haven't done too much experimentation with other metrics.
Do you know of any other metrics that behave similarly or might cause other problems?
Also, do you think that another Environment
kwarg might be necessary to clear up behavior in situations like this? do_predict_proba
seems like half of the solution... But I'm thinking we need one kwarg to declare how predictions should be passed to metrics, then a second to declare how predictions should be saved in a situation like this. I'd love to hear your thoughts!
from hyperparameter_hunter.
Sorry for my late reply, bit busy these days...
I think there are only two metrics that accept multi-column predicted probabilities: log_loss
and hinge_loss
. I think do_predict_proba
is enough, as you already indicated in the document:
If True, it will call models.Model.model.predict_proba(), and the values in all columns will be used as the actual prediction values
I know in most cases there is no need to take proba
into consideration, even logloss
is mostly applied as loss function rather than metric, but in my recent case I have to evaluate how much "confidence" the model has in the results so I can improve it.
from hyperparameter_hunter.
Related Issues (20)
- Support for various advanced functionality? HOT 9
- Unable to access docs HOT 2
- How to do predict_proba in catboost classifier? HOT 3
- Support for nested parameters/parameterizing objects that can't be called by name. HOT 2
- How to handle variable number of layers?
- How can I set class weights in a multiclass classification with imbalance dataset? HOT 1
- How to use the Experiments? HOT 6
- OSError: could not get source code HOT 9
- Q: what is RandomForestOptPro exactly? HOT 3
- Doc improvement suggestion HOT 4
- Metaclass conflict with Keras 2.3.0 HOT 1
- Timeseries Cross Validation HOT 6
- Any way to send hyperparameter_hunter.Integer to custom feature_engineer function? HOT 3
- pd.DataFrame.sparse supported in environment?
- Interface with pytorch or tensorflow models HOT 2
- Problem with library scikit-optimize in Python 3.6 HOT 9
- ROC AUC scores don't match to those of sklearn HOT 1
- Failed to import packages from hyperparameter_hunter HOT 3
- ImportError: cannot import name 'Log10' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hyperparameter_hunter.