zhiningliu1998 / self-paced-ensemble Goto Github PK

ICDE'20 | A general & effective ensemble framework for imbalanced classification. | 泛用，高效，鲁棒的类别不平衡学习框架

Home Page: https://arxiv.org/abs/1909.03500v3

License: MIT License

Python 100.00%

imbalance-classification ensemble-learning imbalanced-learning imbalanced-data ensemble-methods machine-learning python3 ensemble ensemble-model classification

self-paced-ensemble's Introduction

Self-paced Ensemble for Highly Imbalanced Massive Data Classification (ICDE 2020)

Links: Paper | Slides | Video | arXiv | PyPI | API Reference | Related Projects | Zhihu/知乎

Self-paced Ensemble (SPE) is an ensemble learning framework for massive highly imbalanced classification. It is an easy-to-use solution to class-imbalanced problems, features outstanding computing efficiency, good performance, and wide compatibility with different learning models. This SPE implementation supports multi-class classification.

Note: SPE is now a part of imbalanced-ensemble [Doc, PyPI]. Try it for more methods and advanced features!

Cite Us

If you find this repository helpful in your work or research, we would greatly appreciate citations to the following paper:

@inproceedings{liu2020self-paced-ensemble,
    title={Self-paced Ensemble for Highly Imbalanced Massive Data Classification},
    author={Liu, Zhining and Cao, Wei and Gao, Zhifeng and Bian, Jiang and Chen, Hechang and Chang, Yi and Liu, Tie-Yan},
    booktitle={2020 IEEE 36th International Conference on Data Engineering (ICDE)},
    pages={841--852},
    year={2020},
    organization={IEEE}
}

Installation

It is recommended to use pip for installation.
Please make sure the latest version is installed to avoid potential problems:

$ pip install self-paced-ensemble            # normal install
$ pip install --upgrade self-paced-ensemble  # update if needed

Or you can install SPE by clone this repository:

$ git clone https://github.com/ZhiningLiu1998/self-paced-ensemble.git
$ cd self-paced-ensemble
$ python setup.py install

Following dependencies are required:

python (>=3.6)
numpy (>=1.13.3)
scipy (>=0.19.1)
joblib (>=0.11)
scikit-learn (>=0.24)
imblearn (>=0.7.0)
imbalanced-ensemble (>=0.1.3)

Cite Us
Installation
Table of Contents
Background
Documentation
Examples
Results
Miscellaneous
References
Related Projects
Contributors ✨

Background

SPE performs strictly balanced under-sampling in each iteration and is therefore very computationally efficient. In addition, SPE does not rely on calculating the distance between samples to perform resampling. It can be easily applied to datasets that lack well-defined distance metrics (e.g. with categorical features / missing values) without any modification. Moreover, as a generic ensemble framework, our methods can be easily adapted to most of the existing learning methods (e.g., C4.5, SVM, GBDT, and Neural Network) to boost their performance on imbalanced data. Compared to existing imbalance learning methods, SPE works particularly well on datasets that are large-scale, noisy, and highly imbalanced (e.g. with imbalance ratio greater than 100:1). Such kind of data widely exists in real-world industrial applications. The figure below gives an overview of the SPE framework.

Documentation

Our SPE implementation can be used much in the same way as the sklearn.ensemble classifiers. Detailed documentation of SelfPacedEnsembleClassifier can be found HERE.

Examples

You can check out examples using SPE for more comprehensive usage examples.

API demo

from self_paced_ensemble import SelfPacedEnsembleClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Prepare class-imbalanced train & test data
X, y = make_classification(n_classes=2, random_state=42, weights=[0.1, 0.9])
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42)

# Train an SPE classifier
clf = SelfPacedEnsembleClassifier(
        base_estimator=DecisionTreeClassifier(), 
        n_estimators=10,
    ).fit(X_train, y_train)

# Predict with an SPE classifier
clf.predict(X_test)

Advanced usage example

Please see usage_example.ipynb.

Save & Load model

We recommend to use joblib or pickle for saving and loading SPE models, e.g.,

from joblib import dump, load

# save the model
dump(clf, filename='clf.joblib')
# load the model
clf = load('clf.joblib')

You can also use the alternative APIs provided in SPE:

from self_paced_ensemble.utils import save_model, load_model

# save the model
clf.save('clf.joblib')        # option 1
save_model(clf, 'clf.joblib') # option 2
# load the model
clf = load_model('clf.joblib')

Compare SPE with other methods

Please see comparison_example.ipynb.

Results

Dataset links: Credit Fraud, KDDCUP, Record Linkage, Payment Simulation.

Comparisons of SPE with traditional resampling/ensemble methods in terms of performance & computational efficiency.

Miscellaneous

This repository contains:

Implementation of Self-paced Ensemble
Implementation of 5 ensemble-based imbalance learning baselines
- SMOTEBoost [1]
- SMOTEBagging [2]
- RUSBoost [3]
- UnderBagging [4]
- BalanceCascade [5]
Implementation of resampling based imbalance learning baselines [6]
Additional experimental results

NOTE: The implementations of other ensemble and resampling methods are based on imbalanced-ensemble and imbalanced-learn.

References

#	Reference
[1]	N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, Smoteboost: Improving prediction of the minority class in boosting. in European conference on principles of data mining and knowledge discovery. Springer, 2003, pp. 107–119
[2]	S. Wang and X. Yao, Diversity analysis on imbalanced data sets by using ensemble models. in 2009 IEEE Symposium on Computational Intelligence and Data Mining. IEEE, 2009, pp. 324–331.
[3]	C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “Rusboost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 40, no. 1, pp. 185–197, 2010.
[4]	R. Barandela, R. M. Valdovinos, and J. S. Sanchez, “New applications´ of ensembles of classifiers,” Pattern Analysis & Applications, vol. 6, no. 3, pp. 245–256, 2003.
[5]	X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 2, pp. 539–550, 2009.
[6]	Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.

Related Projects

Check out Zhining's other open-source projects!

_{Imbalanced-Ensemble [PythonLib]}

_{Imbalanced Learning [Awesome]}

_{Machine Learning [Awesome]}

_{Meta-Sampler [NeurIPS]}

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Zhining Liu}
💻 📖 💡

_{Yuming Fu}
💻 🐛

_{Thúlio Costa}
💻 🐛

_{Neko Null}
🚧

_{lirenjieArthur}
🐛

_AC手动机
🐛

_{Carlo Moro}
🤔

This project follows the all-contributors specification. Contributions of any kind welcome!

self-paced-ensemble's People

Contributors

Stargazers

Watchers

Forkers

chenxiangse 0zhongying0 lihengtianxia wolfhu rudolffu sofq developpeur-windev-php-python shenxuhui sodleave jtang-qhzx ylch xrosliang chaos1992 mata62n postyear gordondoo aiedward greengrass2015 breakend2010 ahmedtariq bigdatamatta dotrado maybeee18 dayongren ljk1072911239 jingmouren org-mars eeebird zhang-shizhe tanmdl dashengbryant tianjunli thulio ella0102 klonggan mengkunzhao zhongbineden gptcod ljx1006 younghs-stu chaoswin dumpmemory chrisbutton xhqing harel-coffee anminhhung yuantingmasc lengocduc195khtn

self-paced-ensemble's Issues

请问hardnee函数如何设计提升才最有效？

我在自己机器上用不平衡类别的二分类数据集尝试了几种常用的基于树的机器学习模型，当n_estimators到一定大小就没有提升后，只有改hardness函数才会变化，但是不论我怎么改参数，效果都不如直接用LGBM训练的各项评分高，很疑惑

[typo] slide link in README is broken

Perhaps this is the correct one: https://zhiningliu.com/files/ICDE_2020_SPE_slides.pdf

cannot import name 'if_delegate_has_method'

if_delegate_has_method has been removed from metaestimators.py

...
---> [21](file:///C:/Users/nuran/hello/.venv/lib/site-packages/imbalanced_ensemble/pipeline.py:21) from sklearn.utils.metaestimators import if_delegate_has_method
     [22](file:///C:/Users/nuran/hello/.venv/lib/site-packages/imbalanced_ensemble/pipeline.py:22) from sklearn.utils.validation import check_memory
     [24](file:///C:/Users/nuran/hello/.venv/lib/site-packages/imbalanced_ensemble/pipeline.py:24) __all__ = ["Pipeline", "make_pipeline"]

ImportError: cannot import name 'if_delegate_has_method' from 'sklearn.utils.metaestimators' (c:\Users\nuran\hello\.venv\lib\site-packages\sklearn\utils\metaestimators.py)

很优秀的模型伟大的工程！但怎么保存和调用模型呢

我在我的数据集中实验了spe 有些结果是非常棒的！感谢大神的分享，但有个问题怎么save 和load已经fit的模型呢
谢谢！！！

How to use GBDT/XGBOOST in SPE?

valueError

I'm really sorry to bother you. I have read your code and article, and I have some problems.

1.when I was using my own data set，There was an error in your code.
“ValueError: The target 'y' needs to have more than 1 class. Got 1 class instead.
Method used is 'SMOTEBagging' .n_min=5,n_maj = 219 in my data set.

2.The methods you compare are all the old ones. Have you tried to compare with neural network methods like Focal loss?

Looking forward to your reply. Thank you

TypeError: 'numpy.ndarray' object is not callable

I tried to run the the following example, but got TypeError exception!

https://github.com/ZhiningLiu1998/self-paced-ensemble#examples

TypeError: 'numpy.ndarray' object is not callable

ImportError: cannot import name 'if_delegate_has_method' from 'sklearn.utils.metaestimators'

ImportError: cannot import name 'if_delegate_has_method' from 'sklearn.utils.metaestimators' (C:\Users\user\AppData\Local\miniconda3\envs\crashes2\Lib\site-packages\sklearn\utils\metaestimators.py)

请问hardnee函数如何设计提升才最有效？

bug at self_paced_ensemble.py:318

Hello, thanks for your excellent work.
But when I was using this package like following:
spe_clf.fit(X_train, y_train, label_maj=0, label_min=3)
then I got this error
RuntimeWarning: The specified minority class 3 has no data samples, please check usage.

I think here is a mistake at self_paced_ensemble.py:318 where y is replaced by the return value of np.unique(y, return_inverse=True). New value of y is "The indices to reconstruct the original array from the unique array" and it's either 0 or 1.
Resulting that: at 334 line maj_index, min_index = (y==label_maj), (y==label_min) where y==3 is no existence , so min_index.sum()==0 and _n_samples_min==0.

Perhaps line 318 can just be changed to self.classes_= np.unique(y)?

Does it only support two categories?

Import error under scikit-learn>=1.3.0

utils.metaestimators.if_delegate_has_method is deprecated and will be removed in version 1.3. Please use utils.metaestimators.available_if instead.

and please fixed the dependence file or update your package.

hope you well

Using the estimator_params parameter in gridsearchcv

How can we use the estimator_params parameter in general? it says it needs to be a list of str but thats not working.

would appreciate it if you could explain how we can use it. Thanks

[ENH] keep sync with imbalanced_ensemble implementation

Keep sync with imbalanced_ensemble implementation of SelfPacedEnsembleClassifier.

Adding Contributors

Extract feature importances

Hello, is it possible to extract feature importances of the ensemble ?

As in scikit, we would use something like

model.features_importances_

or shap values

[Install Failed] pip install failed

ERROR: Could not find a version that satisfies the requirement pandas>=1.1.3scipy>=0.19.1 (from self-paced-ensemble) (from versions: 0.1, 0.2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.25.2, 0.25.3)
ERROR: No matching distribution found for pandas>=1.1.3scipy>=0.19.1

Supports the generation of PMML files

Hi, IMBENS is a great machine learning library, but I have a problem now, I want to deploy spe algorithm as online services to production environment, how to generate PMML files successfully. I have tried Nyoka and sklearn2pmml so far, but all failed.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.