Coder Social home page Coder Social logo

catboost / benchmarks Goto Github PK

View Code? Open in Web Editor NEW
164.0 13.0 47.0 44.38 MB

Comparison tools

License: Apache License 2.0

Python 4.93% Shell 0.05% Jupyter Notebook 93.59% R 1.04% TeX 0.39% Dockerfile 0.01%
benchmark comparison quality speed gpu ranking classification regression catboost xgboost

benchmarks's Introduction

Website | Documentation | Tutorials | Installation | Release Notes

GitHub license PyPI version Conda Version GitHub issues Telegram Twitter

CatBoost is a machine learning method based on gradient boosting over decision trees.

Main advantages of CatBoost:

Get Started and Documentation

All CatBoost documentation is available here.

Install CatBoost by following the guide for the

Next you may want to investigate:

If you cannot open documentation in your browser try adding yastatic.net and yastat.net to the list of allowed domains in your privacy badger.

Catboost models in production

If you want to evaluate Catboost model in your application read model api documentation.

Questions and bug reports

Help to Make CatBoost Better

  • Check out open problems and help wanted issues to see what can be improved, or open an issue if you want something.
  • Add your stories and experience to Awesome CatBoost.
  • To contribute to CatBoost you need to first read CLA text and add to your pull request, that you agree to the terms of the CLA. More information can be found in CONTRIBUTING.md
  • Instructions for contributors can be found here.

News

Latest news are published on twitter.

Reference Paper

Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, Aleksandr Vorobev "Fighting biases with dynamic boosting". arXiv:1706.09516, 2017.

Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin "CatBoost: gradient boosting with categorical features support". Workshop on ML Systems at NIPS 2017.

License

© YANDEX LLC, 2017-2024. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.

benchmarks's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

benchmarks's Issues

Catbosst with categorical features failed to work with SKlearn CalibratedCV

I try to calibrate my CatBoostClassifier model using sklearn CalibratedClassifierCV . It is running well when fitting but failed when using calibrated model to predict. I already try to use LGBMClassifier because it has the same categorical_features and it is running well. Is there any solution for this issue? Here is code that I use:

from catboost import CatBoostClassifier
from sklearn.calibration import CalibratedClassifierCV
import pandas as pd
X, y = make_classification(n_samples=100, n_features=3,n_redundant=0, random_state=42)
X=pd.DataFrame(X,columns=['a','b','c'])
X['d'] = [1,2,3,4,5]*20
model = CatBoostClassifier()
model.fit(X,y,verbose=False,cat_features=[3])
model_cat = CalibratedClassifierCV(base_estimator=model,cv='prefit')
model_cat.fit(X,y)
model_cat.predict(X)

CatBoostError                             Traceback (most recent call last)
/tmp/ipykernel_3228/1832915274.py in <module>
----> 1 model_cat.predict(X)

~/anaconda3/lib/python3.8/site-packages/sklearn/calibration.py in predict(self, X)
    383         """
    384         check_is_fitted(self)
--> 385         return self.classes_[np.argmax(self.predict_proba(X), axis=1)]
    386 
    387     def _more_tags(self):

~/anaconda3/lib/python3.8/site-packages/sklearn/calibration.py in predict_proba(self, X)
    360         mean_proba = np.zeros((X.shape[0], len(self.classes_)))
    361         for calibrated_classifier in self.calibrated_classifiers_:
--> 362             proba = calibrated_classifier.predict_proba(X)
    363             mean_proba += proba
    364 

~/anaconda3/lib/python3.8/site-packages/sklearn/calibration.py in predict_proba(self, X)
    637         n_classes = len(self.classes)
    638         pred_method = _get_prediction_method(self.base_estimator)
--> 639         predictions = _compute_predictions(pred_method, X, n_classes)
    640 
    641         label_encoder = LabelEncoder().fit(self.classes)

~/anaconda3/lib/python3.8/site-packages/sklearn/calibration.py in _compute_predictions(pred_method, X, n_classes)
    499         (X.shape[0], 1).
    500     """
--> 501     predictions = pred_method(X=X)
    502     if hasattr(pred_method, '__name__'):
    503         method_name = pred_method.__name__

~/anaconda3/lib/python3.8/site-packages/catboost/core.py in predict_proba(self, X, ntree_start, ntree_end, thread_count, verbose, task_type)
   4767                 with probability for every class for each object.
   4768         """
-> 4769         return self._predict(X, 'Probability', ntree_start, ntree_end, thread_count, verbose, 'predict_proba', task_type)
   4770 
   4771 

~/anaconda3/lib/python3.8/site-packages/catboost/core.py in _predict(self, data, prediction_type, ntree_start, ntree_end, thread_count, verbose, parent_method_name, task_type)
   2175         if verbose is None:
   2176             verbose = False
-> 2177         data, data_is_single_object = self._process_predict_input_data(data, parent_method_name, thread_count)
   2178         self._validate_prediction_type(prediction_type)
   2179 

~/anaconda3/lib/python3.8/site-packages/catboost/core.py in _process_predict_input_data(self, data, parent_method_name, thread_count, label)
   2155         is_single_object = _is_data_single_object(data)
   2156         if not isinstance(data, Pool):
-> 2157             data = Pool(
   2158                 data=[data] if is_single_object else data,
   2159                 label=label,

~/anaconda3/lib/python3.8/site-packages/catboost/core.py in __init__(self, data, label, cat_features, text_features, embedding_features, column_description, pairs, delimiter, has_header, ignore_csv_quoting, weight, group_id, group_weight, subgroup_id, pairs_weight, baseline, feature_names, thread_count, log_cout, log_cerr)
    580                 elif isinstance(data, np.ndarray):
    581                     if (data.dtype.kind == 'f') and (cat_features is not None) and (len(cat_features) > 0):
--> 582                         raise CatBoostError(
    583                             "'data' is numpy array of floating point numerical type, it means no categorical features,"
    584                             " but 'cat_features' parameter specifies nonzero number of categorical features"

CatBoostError: 'data' is numpy array of floating point numerical type, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features

Bug in experiments.py

File "training_speed/experiments.py", line 126
return ''.join(map(lambda (key, value): '{}[{}]'.format(key, str(value)), params.items()))

SyntaxError: invalid syntax

I'm using python 3 and it gave me this warning on the use of lmabda.

What are recommended params?

I want to run benchmark (e.g. gpu_vs_cpu_training_speed/run_experiment_catboost.py) on my device.

What command line arguments should I use?

questions about the pre-processing of categorical features

From the pdf, I found:
image

I want to ask:

  1. Is this processing used in all tools ? If yes, why CatBoost is still much better when using the same dataset without original categorical features ? (I thought the advantage of CatBoost is the Categorical feature.)
  2. the range of k for c_ij and d_ij ?

Thanks !

'data' is numpy array of floating point numerical type, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features

Hi everyone, it's me again. I have run this code. I get error code below:
pool = Pool(data, label, cat_features=cat_cols)

the error :
'data' is numpy array of floating point numerical type, it means no categorical features," _catboost.CatBoostError: 'data' is numpy array of floating point numerical type, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features

Does anyone know what is happening, i did't change any of the code but got error maybe because of my train and test file. But I dont know how is the structure for test and train file.

Amazon/cd

Hi can you provide me amazon/cd file because I have no idea what is cd file is as it needed to launch the codes.

issue in the function computing NDCG

Hi,

I just find there is an issue in the following function

def ndcg(y_pred, y_true, top):
   assert y_pred.shape[0] == y_true.shape[0]
   top = min(top, y_pred.shape[0])

   first_k_docs = sorted(zip(y_true, y_pred), key=cmp_to_key(doc_comparator))
   first_k_docs = np.array(first_k_docs)[:top,0]

   top_k_idxs = np.argsort(y_true)[::-1][:top]
   top_k_docs = y_true[top_k_idxs]

   dcg = cumulative_gain(first_k_docs)
   idcg = cumulative_gain(top_k_docs)

   return dcg / idcg if idcg > 0 else 1.

how can ndcg=1 if idcg == 0? If idcg == 0 you should just ignore that query. This definitely makes the NDCG look higher than it is expected to be.

Best,

Ruocheng Guo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.