catboost / benchmarks Goto Github PK

View Code? Open in Web Editor NEW

164.0 13.0 47.0 44.38 MB

Comparison tools

License: Apache License 2.0

Python 4.93% Shell 0.05% Jupyter Notebook 93.59% R 1.04% TeX 0.39% Dockerfile 0.01%

benchmark comparison quality speed gpu ranking classification regression catboost xgboost

benchmarks's Introduction

Website | Documentation | Tutorials | Installation | Release Notes

CatBoost is a machine learning method based on gradient boosting over decision trees.

Main advantages of CatBoost:

Superior quality when compared with other GBDT libraries on many datasets.
Best in class prediction speed.
Support for both numerical and categorical features.
Fast GPU and multi-GPU support for training out of the box.
Visualization tools included.
Fast and reproducible distributed training with Apache Spark and CLI.

Get Started and Documentation

All CatBoost documentation is available here.

Install CatBoost by following the guide for the

Next you may want to investigate:

Tutorials
Training modes and metrics
Cross-validation
Parameters tuning
Feature importance calculation
Regular and staged predictions
CatBoost for Apache Spark videos: Introduction and Architecture

If you cannot open documentation in your browser try adding yastatic.net and yastat.net to the list of allowed domains in your privacy badger.

Catboost models in production

If you want to evaluate Catboost model in your application read model api documentation.

Questions and bug reports

For reporting bugs please use the catboost/bugreport page.
Ask a question on Stack Overflow with the catboost tag, we monitor this for new questions.
Seek prompt advice at Telegram group or Russian-speaking Telegram chat

Help to Make CatBoost Better

Check out open problems and help wanted issues to see what can be improved, or open an issue if you want something.
Add your stories and experience to Awesome CatBoost.
To contribute to CatBoost you need to first read CLA text and add to your pull request, that you agree to the terms of the CLA. More information can be found in CONTRIBUTING.md
Instructions for contributors can be found here.

News

Reference Paper

Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, Aleksandr Vorobev "Fighting biases with dynamic boosting". arXiv:1706.09516, 2017.

Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin "CatBoost: gradient boosting with categorical features support". Workshop on ML Systems at NIPS 2017.

License

benchmarks's People

Stargazers

Watchers

Forkers

wi1w climbsrocks aidirectory huiyi1990 ulybinvitaliy leezqcst eugenesavenko isuranimalasri willcampbell433 petroffss yama1968 blitzglep1326 kizill johannespetrat mirekphd pramodsravan cheryomukhin hfp wingrime runxingzhong mainak24 15026596063 sergii-mamedov pplonski lim-anggun captify-dieter xiaoxiangxiang522 kdjyss kinglu peeking wangzaisheng01 hntee crackend eikden battlegg aixioma fomightez aytuar huiminye292 mysqlsc kikynapitupulu fellowship afcarl shunsunsun vahidrostami pmrnn gurmanbhullar

benchmarks's Issues

TypeError: fmin() got an unexpected keyword argument 'rseed'

Hi guys, does anyone knows why at /quality_benchmarks/experiments.py cannot trace this parameter?

TypeError: fmin() got an unexpected keyword argument 'rseed'

Catbosst with categorical features failed to work with SKlearn CalibratedCV

I try to calibrate my CatBoostClassifier model using sklearn CalibratedClassifierCV . It is running well when fitting but failed when using calibrated model to predict. I already try to use LGBMClassifier because it has the same categorical_features and it is running well. Is there any solution for this issue? Here is code that I use:

from catboost import CatBoostClassifier
from sklearn.calibration import CalibratedClassifierCV
import pandas as pd
X, y = make_classification(n_samples=100, n_features=3,n_redundant=0, random_state=42)
X=pd.DataFrame(X,columns=['a','b','c'])
X['d'] = [1,2,3,4,5]*20
model = CatBoostClassifier()
model.fit(X,y,verbose=False,cat_features=[3])
model_cat = CalibratedClassifierCV(base_estimator=model,cv='prefit')
model_cat.fit(X,y)
model_cat.predict(X)

CatBoostError                             Traceback (most recent call last)
/tmp/ipykernel_3228/1832915274.py in <module>
----> 1 model_cat.predict(X)

~/anaconda3/lib/python3.8/site-packages/sklearn/calibration.py in predict(self, X)
    383         """
    384         check_is_fitted(self)
--> 385         return self.classes_[np.argmax(self.predict_proba(X), axis=1)]
    386 
    387     def _more_tags(self):

~/anaconda3/lib/python3.8/site-packages/sklearn/calibration.py in predict_proba(self, X)
    360         mean_proba = np.zeros((X.shape[0], len(self.classes_)))
    361         for calibrated_classifier in self.calibrated_classifiers_:
--> 362             proba = calibrated_classifier.predict_proba(X)
    363             mean_proba += proba
    364 

~/anaconda3/lib/python3.8/site-packages/sklearn/calibration.py in predict_proba(self, X)
    637         n_classes = len(self.classes)
    638         pred_method = _get_prediction_method(self.base_estimator)
--> 639         predictions = _compute_predictions(pred_method, X, n_classes)
    640 
    641         label_encoder = LabelEncoder().fit(self.classes)

~/anaconda3/lib/python3.8/site-packages/sklearn/calibration.py in _compute_predictions(pred_method, X, n_classes)
    499         (X.shape[0], 1).
    500     """
--> 501     predictions = pred_method(X=X)
    502     if hasattr(pred_method, '__name__'):
    503         method_name = pred_method.__name__

~/anaconda3/lib/python3.8/site-packages/catboost/core.py in predict_proba(self, X, ntree_start, ntree_end, thread_count, verbose, task_type)
   4767                 with probability for every class for each object.
   4768         """
-> 4769         return self._predict(X, 'Probability', ntree_start, ntree_end, thread_count, verbose, 'predict_proba', task_type)
   4770 
   4771 

~/anaconda3/lib/python3.8/site-packages/catboost/core.py in _predict(self, data, prediction_type, ntree_start, ntree_end, thread_count, verbose, parent_method_name, task_type)
   2175         if verbose is None:
   2176             verbose = False
-> 2177         data, data_is_single_object = self._process_predict_input_data(data, parent_method_name, thread_count)
   2178         self._validate_prediction_type(prediction_type)
   2179 

~/anaconda3/lib/python3.8/site-packages/catboost/core.py in _process_predict_input_data(self, data, parent_method_name, thread_count, label)
   2155         is_single_object = _is_data_single_object(data)
   2156         if not isinstance(data, Pool):
-> 2157             data = Pool(
   2158                 data=[data] if is_single_object else data,
   2159                 label=label,

~/anaconda3/lib/python3.8/site-packages/catboost/core.py in __init__(self, data, label, cat_features, text_features, embedding_features, column_description, pairs, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count, log_cout, log_cerr)
    580                 elif isinstance(data, np.ndarray):
    581                     if (data.dtype.kind == 'f') and (cat_features is not None) and (len(cat_features) > 0):
--> 582                         raise CatBoostError(
    583                             "'data' is numpy array of floating point numerical type, it means no categorical features,"
    584                             " but 'cat_features' parameter specifies nonzero number of categorical features"

CatBoostError: 'data' is numpy array of floating point numerical type, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features

Bug in experiments.py

File "training_speed/experiments.py", line 126
return ''.join(map(lambda (key, value): '{}[{}]'.format(key, str(value)), params.items()))

SyntaxError: invalid syntax

I'm using python 3 and it gave me this warning on the use of lmabda.

What are recommended params?

I want to run benchmark (e.g. gpu_vs_cpu_training_speed/run_experiment_catboost.py) on my device.

What command line arguments should I use?

questions about the pre-processing of categorical features

From the pdf, I found:

I want to ask:

Is this processing used in all tools ? If yes, why CatBoost is still much better when using the same dataset without original categorical features ? (I thought the advantage of CatBoost is the Categorical feature.)
the range of k for c_ij and d_ij ?

Thanks !

'data' is numpy array of floating point numerical type, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features

Hi everyone, it's me again. I have run this code. I get error code below:
pool = Pool(data, label, cat_features=cat_cols)

the error :
'data' is numpy array of floating point numerical type, it means no categorical features," _catboost.CatBoostError: 'data' is numpy array of floating point numerical type, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features

Does anyone know what is happening, i did't change any of the code but got error maybe because of my train and test file. But I dont know how is the structure for test and train file.

Here is the link:
https://github.com/yandexdataschool/catboost_research/blob/master/experiments/comparison_description.pdf

issue in the function computing NDCG

Hi,

I just find there is an issue in the following function

def ndcg(y_pred, y_true, top):
   assert y_pred.shape[0] == y_true.shape[0]
   top = min(top, y_pred.shape[0])

   first_k_docs = sorted(zip(y_true, y_pred), key=cmp_to_key(doc_comparator))
   first_k_docs = np.array(first_k_docs)[:top,0]

   top_k_idxs = np.argsort(y_true)[::-1][:top]
   top_k_docs = y_true[top_k_idxs]

   dcg = cumulative_gain(first_k_docs)
   idcg = cumulative_gain(top_k_docs)

   return dcg / idcg if idcg > 0 else 1.

how can ndcg=1 if idcg == 0? If idcg == 0 you should just ignore that query. This definitely makes the NDCG look higher than it is expected to be.

Best,

Ruocheng Guo