Coder Social home page Coder Social logo

miriamspsantos / open-source-imbalance-overlap Goto Github PK

View Code? Open in Web Editor NEW
12.0 1.0 3.0 8 KB

A collection of Open Source Contributions in Learning from Imbalanced and Overlapped Data

imbalanced-data imbalanced-learning imbalanced-classification class-overlap supervised-learning machine-learning open-source reproducible-science resources datasets

open-source-imbalance-overlap's Introduction

Open Source Contributions in Learning
from Imbalanced and Overlapped Data

Code availability is a crucial aspect for the reproducibility of results. Well-established methods in the field of Imbalanced Learning are commonly found in several open-source implementations. Some of the most popular are KEEL Software Tool and WEKA workbench, among other R and Python packages. This repository further identifies existing resources (code and/or data) related to the joint-study of Class Imbalance, Class Overlap and Data Complexity.

Dataset Benchmarking and Data Generation

Benchmark Datasets:

Well-known data repositories include:

Regarding specific data characteristics, KEEL is perhaps the most popular repository. It provides a collection of both standard datasets as well as datasets targeted to imbalance learning, noisy and borderline example detection, and singular problems (multi-instance and multi-label datasets) [Imbalanced Datasets]

Data Generation and Visualisation:

  • Data Generation: Within the scope of artificially generated data, we recommend the data generator used in [Wojciechowski2017]. A comprehensive description of the generator may be found in this repository.

  • Instance Space Analysis (ISA): Regarding the characterisation of datasets comprised in well-known repositories, exploring MATILDA (Melbourne Algorithm Test Instance Library with Data Analytics ) is an interesting direction. It allows the visualisation of the distribution and diversity of existing benchmark and real-world instances, and the generation of new synthetic test instances at specific locations of the instance space (e.g. real-world-like instances, or instances with controllable properties) [Code][Munoz2018]. Another recent tool for ISA is PyHard, which allows to assess the complexity of individual examples within a dataset [Code][Paiva2021].

Class Overlap-Based Approaches:

Class Overlap-Based Approaches aim to address class imbalance and overlap simultaneously [Santos2021a]. Despite they often derive from distribution-based approaches to some extent (frequently focused on undersampling or oversampling), their inner operations are more attentive to class overlap. For a comprehensive review on class overlap-based approaches please refer to [Santos2021a] and [Vuttipittayamongkol2021].

The following approaches include open-source implementations:

Data Complexity Measures

Data Complexity Measures (DCM) are commonly used to characterise the difficulty of classification tasks through the analysis of certain data characteristics. Existing open-source implementations of complexity measures include:

Imbalanced Data Learning Software

  • unbalanced: Racing for Unbalanced Methods Selection (R) [Code]

  • smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE (R) [Code]

  • ROSE: Random Over-Sampling Examples (R) [Code]

  • imbalance: Preprocessing Algorithms for Imbalanced Datasets (R) [Code]

  • imbalanced-learn: Resampling techniques for Imbalanced Data (Python) [Code]

  • smote_variants: SMOTE variantes for Imbalanced Learning (Python) [Code][Kovacs2019]

  • multi-imbalance: Python library for Multi-Class Imbalanced Classification (Python) [Code][Grycza2020]

  • Multi_Imbalance: An open-source software for multi-class imbalance learning [Code][Zhang2019]

  • cluster-over-sampling: Clustering based Oversampling Algorithms(Python) [Code][Douzas2017][Douzas2018]

  • undersampling: A Scala library for undersampling in Imbalanced Classification (Scala) [Code]

References:

[Santos2021a] M. S. Santos, P. H. Abreu, N. Japkowicz, A. Fernández, C. Soares, S. Wilk, J. Santos, On the joint-effect of Class Imbalance and Overlap: A Critical Review, 2021.

[Santos2021b] M. S. Santos, P. H. Abreu, N. Japkowicz, A. Fernández, J. Santos, A Unifying View of Class Imbalance and Overlap Key Concepts, Panorama and Open Avenues for Research, 2021.

[Vuttipittayamongkol2021] P. Vuttipittayamongkol, E. Elyan, A. Petrovski, On the class overlap problem in imbalanced data classification, Knowledge-based systems, 2021. [Link]

[Vuttipittayamongkol2018] P. Vuttipittayamongkol, E. Elyan, A. Petrovski, C. Jayne, Overlap-based undersampling for improving imbalanced data classification, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2018, pp. 689–697. [Link]

[Vuttipittayamongkol2020a] P. Vuttipittayamongkol, E. Elyan, Improved overlap-based undersampling for imbalanced dataset classification with application to epilepsy and parkinson’s disease, International journal of neural systems, 30(8), 2020. [Link]

[Vuttipittayamongkol2020b] P. Vuttipittayamongkol, E. Elyan, Neighbourhood-based undersampling approach for handling imbalanced and overlapped data, Information Sciences 509 (2020) 47–70. [Link]

[Nekooeimehr2016] I. Nekooeimehr, S. K. Lai-Yuen, Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Systems with Applications 46 (2016) 405–416. [Link]

[Zhu2020] T. Zhu, Y. Lin, Y. Liu, Improving interpolation-based oversampling for imbalanced data learning, Knowledge-Based Systems 187 (2020). [Link]

[Koziarski2017] M. Koziarski, M. Wozniak, CCR: A combined cleaning and resampling algorithm for imbalanced data classification, International Journal of Applied Mathematics and Computer Science 27 (4) (2017) 727–736. [Link]

[Douzas2019] G. Douzas, F. Bacao, Geometric smote a geometrically enhanced drop-in replacement for smote, Information sciences 501 (2019) 118–135. [Link]

[Fernandez2017] A. Fernández, C. J. Carmona, M. Jose del Jesus, F. Herrera, A pareto-based ensemble with feature and instance selection for learning from multi-class imbalanced datasets, International Journal of neural systems, 27(6), 2017. [Link]

[Liu2020] Z. Liu, W. Cao, Z. Gao, J. Bian, H. Chen, Y. Chang, T. Y. Liu. Self-paced ensemble for highly imbalanced massive data classification. 36th IEEE International Conference on Data Engineering (ICDE) (pp. 841-852), 2020.[Link]

[Fu2020] G. H. Fu, Y. J. Wu, M. J. Zong, L. Z. Yi, Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics, Chemometrics and Intelligent Laboratory Systems 196, 2020. [Link]

[Lango2017] M. Lango, D. Brzezinski, S. Firlik, J. Stefanowski, Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data, International Conference on Discovery Science. Springer, Cham, 2017. [Link]

[Wojciechowski2017] S. Wojciechowski, S. Wilk, Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data, Foundations of Computing and Decision Sciences 42 (2) (2017) 149–176. [Link]

[Munoz2018] M. A. Muñoz, L. Villanova, D. Baatar, K. Smith-Miles, Instance spaces for machine learning classification, Machine Learning 107 (1) (2018) 109–147. [Link]

[Paiva2021] P. Y. A. Paiva, K. Smith-Miles, M. G. Valeriano, A. C. Lorena, PyHard: a novel tool for generating hardness embeddings to support data-centric analysis, arXiv preprint arXiv:2109.14430. [Link]

[Orriols-Puig2010] A. Orriols-Puig, N. Macia, T. K. Ho, Documentation for the data complexity library in c++, Universitat Ramon Llull, La Salle 196 (2010) 1–40. [Link]

[Lorena2019] A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto, T. K. Ho, How complex is your classification problem? a survey on measuring classification complexity, ACM Computing Surveys (CSUR) 52 (5) (2019) 1–34. [Link]

[Barella2018] V. H. Barella, L. P. Garcia, M. P. de Souto, A. C. Lorena, A. de Carvalho, Data complexity measures for imbalanced classification tasks, in: 2018 International Joint Conference on Neural Networks (IJCNN), IEEE, 2018, pp. 1–8. [Link]

[Garcia2020] L. P. Garcia, A. Rivolli, E. Alcoba ̧ca, A. C. Lorena, A. C. de Carvalho, Boosting meta-learning with simulated data complexity measures, Intelligent Data Analysis 24 (5) (2020) 1011–1028. [Link]

[Alcobaca2020] E. Alcobaça, F. Siqueira, A. Rivolli, L. P. F. Garcia, J. T. Oliva, A. C. P. L. F. de Carvalho, Mfe: Towards reproducible meta-feature extraction, Journal of Machine Learning Research 21 (111) (2020) 1–5. [Link]

[Smith2014] M. R. Smith, T. Martinez, C. Giraud-Carrier, An instance level analysis of data complexity, Machine learning 95 (2) (2014) 225–256. [Link]

[Kovacs2019] G. Kovács, An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets, Applied Soft Computing 83 (2019). [Link]

[Grycza2020] J. Grycza, D. Horna, H. Klimczak, K. Plucínski, Multi-imbalance: Python package for multi-class imbalance learning, Poznan University of Technology, Poland, 2020. [Link]

[Douzas2017] G. Douzas, F. Bacao, Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning, Expert systems with Applications 82 (2017): 40-52. [Link]

[Douzas2018] G. Douzas, F. Bacao, F. Last, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Information Sciences 465 (2018): 1-20. [Link]

[Zhang2019] C. Zhang, J. Bi, S. Xu, E. Ramentol, G. Fan, B. Qiao, H. Fujita, Multi-imbalance: An open-source software for multi-class imbalance learning. Knowledge-Based Systems, 174, 137-143, 2019. [Link]

open-source-imbalance-overlap's People

Contributors

miriamspsantos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.