Open Source Contributions in Learning
from Imbalanced and Overlapped Data

Code availability is a crucial aspect for the reproducibility of results. Well-established methods in the field of Imbalanced Learning are commonly found in several open-source implementations. Some of the most popular are KEEL Software Tool and WEKA workbench, among other R and Python packages. This repository further identifies existing resources (code and/or data) related to the joint-study of Class Imbalance, Class Overlap and Data Complexity.

Dataset Benchmarking and Data Generation

Benchmark Datasets:

Well-known data repositories include:

UCI Machine Learning Repository [Datasets]
Kaggle Repository [Datasets]
OpenML [Datasets]
KEEL Dataset Repository [Datasets]

Regarding specific data characteristics, KEEL is perhaps the most popular repository. It provides a collection of both standard datasets as well as datasets targeted to imbalance learning, noisy and borderline example detection, and singular problems (multi-instance and multi-label datasets) [Imbalanced Datasets]

Data Generation and Visualisation:

Data Generation: Within the scope of artificially generated data, we recommend the data generator used in [Wojciechowski2017]. A comprehensive description of the generator may be found in this repository.
Instance Space Analysis (ISA): Regarding the characterisation of datasets comprised in well-known repositories, exploring MATILDA (Melbourne Algorithm Test Instance Library with Data Analytics ) is an interesting direction. It allows the visualisation of the distribution and diversity of existing benchmark and real-world instances, and the generation of new synthetic test instances at specific locations of the instance space (e.g. real-world-like instances, or instances with controllable properties) [Code][Munoz2018]. Another recent tool for ISA is PyHard, which allows to assess the complexity of individual examples within a dataset [Code][Paiva2021].

Class Overlap-Based Approaches:

Class Overlap-Based Approaches aim to address class imbalance and overlap simultaneously [Santos2021a]. Despite they often derive from distribution-based approaches to some extent (frequently focused on undersampling or oversampling), their inner operations are more attentive to class overlap. For a comprehensive review on class overlap-based approaches please refer to [Santos2021a] and [Vuttipittayamongkol2021].

The following approaches include open-source implementations:

OBU: Overlap-based Undersampling [Code][Vuttipittayamongkol2018].
BoostOBU: Improved Overlap-based Undersampling [Code][Vuttipittayamongkol2020a].
NB-Basic, NB-Tomek, NB-Comm, NB-Rec: Neighbourhood-based Undersampling [Code][Vuttipittayamongkol2020b].
A-SUWO: Adaptive semi-unsupervised weighted oversampling [Code][Nekooeimehr2016]
PAIO: Position characteristic-Aware Interpolation Oversampling [Code][Zhu2020]
CCR: Combined Cleaning and Resampling algorithm [Code][Koziarski2017]
G-SMOTE: Geometric SMOTE [Code][Douzas2019]
EFIS-MOEA: Ensemble with Feature and Instance Selection with Multi-Objective Evolutionary Algorithm [Code][Fernandez2017]
SPE: Self-paced Ensemble for Highly Imbalanced Massive Data Classification [Code][Liu2020]
MOSNS and MOSS: Minimising Overlapping Selection under No-Sampling and Minimising Overlapping Selection under SMOTE [Code][Fu2020]
ImGrid: Imbalanced Grid Clusterer [Code][Lango2017]

Data Complexity Measures

Data Complexity Measures (DCM) are commonly used to characterise the difficulty of classification tasks through the analysis of certain data characteristics. Existing open-source implementations of complexity measures include:

DCoL: Data Complexity Library (C++) [Code][Orriols-Puig2010]
ECoL: Extended Complexity Library (R) [Code][Lorena2019]
ImbCoL: Data Complexity Measures for Imbalanced Classification (R): [Code][Barella2018]
SCoL: Simulated Complexity Library (R) [Code][Garcia2020]
mfe: Meta-Feature Extractor (R) [Code][Alcobaca2020]
pymfe: Python Meta-Feature Extractor (Python) [Code][Alcobaca2020]
Metalearn: Library of Meta-Learning Tools (Python) [Code]
PyHard: Instance Hardness Python package (Python) [Code][Smith2014]
pycol: Python Class Overlap Library [Code]

Imbalanced Data Learning Software

unbalanced: Racing for Unbalanced Methods Selection (R) [Code]
smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE (R) [Code]
ROSE: Random Over-Sampling Examples (R) [Code]
imbalance: Preprocessing Algorithms for Imbalanced Datasets (R) [Code]
imbalanced-learn: Resampling techniques for Imbalanced Data (Python) [Code]
smote_variants: SMOTE variantes for Imbalanced Learning (Python) [Code][Kovacs2019]
multi-imbalance: Python library for Multi-Class Imbalanced Classification (Python) [Code][Grycza2020]
Multi_Imbalance: An open-source software for multi-class imbalance learning [Code][Zhang2019]
cluster-over-sampling: Clustering based Oversampling Algorithms(Python) [Code][Douzas2017][Douzas2018]
undersampling: A Scala library for undersampling in Imbalanced Classification (Scala) [Code]

References:

[Santos2021a] M. S. Santos, P. H. Abreu, N. Japkowicz, A. Fernández, C. Soares, S. Wilk, J. Santos, On the joint-effect of Class Imbalance and Overlap: A Critical Review, 2021.

[Santos2021b] M. S. Santos, P. H. Abreu, N. Japkowicz, A. Fernández, J. Santos, A Unifying View of Class Imbalance and Overlap Key Concepts, Panorama and Open Avenues for Research, 2021.

[Vuttipittayamongkol2021] P. Vuttipittayamongkol, E. Elyan, A. Petrovski, On the class overlap problem in imbalanced data classification, Knowledge-based systems, 2021. [Link]

[Vuttipittayamongkol2018] P. Vuttipittayamongkol, E. Elyan, A. Petrovski, C. Jayne, Overlap-based undersampling for improving imbalanced data classification, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2018, pp. 689–697. [Link]

[Vuttipittayamongkol2020a] P. Vuttipittayamongkol, E. Elyan, Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease, International journal of neural systems, 30(8), 2020. [Link]

[Vuttipittayamongkol2020b] P. Vuttipittayamongkol, E. Elyan, Neighbourhood-based undersampling approach for handling imbalanced and overlapped data, Information Sciences 509 (2020) 47–70. [Link]

[Nekooeimehr2016] I. Nekooeimehr, S. K. Lai-Yuen, Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Systems with Applications 46 (2016) 405–416. [Link]

[Zhu2020] T. Zhu, Y. Lin, Y. Liu, Improving interpolation-based oversampling for imbalanced data learning, Knowledge-Based Systems 187 (2020). [Link]

[Koziarski2017] M. Koziarski, M. Wozniak, CCR: A combined cleaning and resampling algorithm for imbalanced data classification, International Journal of Applied Mathematics and Computer Science 27 (4) (2017) 727–736. [Link]

[Douzas2019] G. Douzas, F. Bacao, Geometric smote a geometrically enhanced drop-in replacement for smote, Information sciences 501 (2019) 118–135. [Link]

[Fernandez2017] A. Fernández, C. J. Carmona, M. Jose del Jesus, F. Herrera, A pareto-based ensemble with feature and instance selection for learning from multi-class imbalanced datasets, International Journal of neural systems, 27(6), 2017. [Link]

[Liu2020] Z. Liu, W. Cao, Z. Gao, J. Bian, H. Chen, Y. Chang, T. Y. Liu. Self-paced ensemble for highly imbalanced massive data classification. 36th IEEE International Conference on Data Engineering (ICDE) (pp. 841-852), 2020.[Link]

[Fu2020] G. H. Fu, Y. J. Wu, M. J. Zong, L. Z. Yi, Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics, Chemometrics and Intelligent Laboratory Systems 196, 2020. [Link]

[Lango2017] M. Lango, D. Brzezinski, S. Firlik, J. Stefanowski, Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data, International Conference on Discovery Science. Springer, Cham, 2017. [Link]

[Wojciechowski2017] S. Wojciechowski, S. Wilk, Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data, Foundations of Computing and Decision Sciences 42 (2) (2017) 149–176. [Link]

[Munoz2018] M. A. Muñoz, L. Villanova, D. Baatar, K. Smith-Miles, Instance spaces for machine learning classification, Machine Learning 107 (1) (2018) 109–147. [Link]

[Paiva2021] P. Y. A. Paiva, K. Smith-Miles, M. G. Valeriano, A. C. Lorena, PyHard: a novel tool for generating hardness embeddings to support data-centric analysis, arXiv preprint arXiv:2109.14430. [Link]

[Orriols-Puig2010] A. Orriols-Puig, N. Macia, T. K. Ho, Documentation for the data complexity library in c++, Universitat Ramon Llull, La Salle 196 (2010) 1–40. [Link]

[Lorena2019] A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto, T. K. Ho, How complex is your classification problem? a survey on measuring classification complexity, ACM Computing Surveys (CSUR) 52 (5) (2019) 1–34. [Link]

[Barella2018] V. H. Barella, L. P. Garcia, M. P. de Souto, A. C. Lorena, A. de Carvalho, Data complexity measures for imbalanced classification tasks, in: 2018 International Joint Conference on Neural Networks (IJCNN), IEEE, 2018, pp. 1–8. [Link]

[Garcia2020] L. P. Garcia, A. Rivolli, E. Alcoba ̧ca, A. C. Lorena, A. C. de Carvalho, Boosting meta-learning with simulated data complexity measures, Intelligent Data Analysis 24 (5) (2020) 1011–1028. [Link]

[Alcobaca2020] E. Alcobaça, F. Siqueira, A. Rivolli, L. P. F. Garcia, J. T. Oliva, A. C. P. L. F. de Carvalho, Mfe: Towards reproducible meta-feature extraction, Journal of Machine Learning Research 21 (111) (2020) 1–5. [Link]

[Smith2014] M. R. Smith, T. Martinez, C. Giraud-Carrier, An instance level analysis of data complexity, Machine learning 95 (2) (2014) 225–256. [Link]

[Kovacs2019] G. Kovács, An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets, Applied Soft Computing 83 (2019). [Link]

[Grycza2020] J. Grycza, D. Horna, H. Klimczak, K. Plucínski, Multi-imbalance: Python package for multi-class imbalance learning, Poznan University of Technology, Poland, 2020. [Link]

[Douzas2017] G. Douzas, F. Bacao, Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning, Expert systems with Applications 82 (2017): 40-52. [Link]

[Douzas2018] G. Douzas, F. Bacao, F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Information Sciences 465 (2018): 1-20. [Link]

[Zhang2019] C. Zhang, J. Bi, S. Xu, E. Ramentol, G. Fan, B. Qiao, H. Fujita, Multi-imbalance: An open-source software for multi-class imbalance learning. Knowledge-Based Systems, 174, 137-143, 2019. [Link]

miriamspsantos / open-source-imbalance-overlap Goto Github PK

open-source-imbalance-overlap's Introduction

Open Source Contributions in Learning
from Imbalanced and Overlapped Data

Dataset Benchmarking and Data Generation

Benchmark Datasets:

Data Generation and Visualisation:

Class Overlap-Based Approaches:

Data Complexity Measures

Imbalanced Data Learning Software

References:

open-source-imbalance-overlap's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

miriamspsantos / open-source-imbalance-overlap Goto Github PK

open-source-imbalance-overlap's Introduction

Open Source Contributions in Learning from Imbalanced and Overlapped Data

Dataset Benchmarking and Data Generation

Benchmark Datasets:

Data Generation and Visualisation:

Class Overlap-Based Approaches:

Data Complexity Measures

Imbalanced Data Learning Software

References:

open-source-imbalance-overlap's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Open Source Contributions in Learning
from Imbalanced and Overlapped Data