Coder Social home page Coder Social logo

c4pub / deodel Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 2.0 273 KB

A mixed attributes predictive algorithm implemented in Python.

Python 100.00%
classifier machine-learning mixed-attributes python supervised-learning categorical-data categorical-features missing-values regression hybrid-regression-classification

deodel's Introduction

deodel

Update: improved accuracy with deodel2

Predictive Algorithm for Very Mixed Data

A predictive algorithm with a unique combination of features:

  • supports multi-class target classification
  • supports nominal and continuous attributes
  • admits missing values in the training and query/test data
  • admits mixed types, categorical and numerical, in the same attribute column
  • performs classification intermixed with regression
  • good accuracy

Deodel started as a python implementation of the "Deodata Delanga" classifier [1]. It has been extended to operate predictions on very heterogenous data. It supports attributes/features that are continuous, categorical, or a mix of both. Furthermore, this applies also to the target values. This means it can do classification, regression, or a mix of both within the context of the same training data. In regression mode, the algorithm implements the technique described in [2].

Its main characteristics are versatility, accuracy, and robustness.

The operation of the module is similar to that of standard classifiers from "sklearn". It accepts as input not only numpy and pandas data but also tables formatted as lists of lists.

The usage of list of lists enables operation with both numerical and categorical attributes. All attributes that are of type "int" or "float" will be categorized as numerical/continuous attributes. If the attribute is "None", it will be interpreted as "missing". All other types are viewed as discrete/nominal multi-valued attributes. Note that even the mixing of categorical and numerical types for the same attribute is supported.

The target/outcome values can also be of any type, including a mix of categorical and numerical. By default, the module will determine automatically if the numerical values are categorical outcomes or if they represent continuous numerical ones. In the latter case, the entries will be considered for regression instead of classification.

Very mixed example

To illustrate this "very mixed" ability consider this toy dataset:

#                       ____ _____________ enjoyed party
#                      /     _____________ age
#                     /     /    _________ gender
#                    /     /    /      ___ alcohol conc.
#                   /     /    /      /      
party_data = [  #  A  :  B  :  C  :  D       
                [ 'n',  12,   'm',  'juice'  ],
                [ 'y',  16,   'f',  'coffee' ],
                [ 'y',  20,   'm',   0.06    ],
                [ 'y',  21,   'f',   0.04    ],
                [ 'n',  29,   'm',   0.39    ],
             ]

Let's assume that the dataset represents data about a hypothetical party. Column D, the alcohol concentration of the beverage consumed, is an example of a mix of categorical and numerical values. If column A is used for prediction, it is a classification problem: did the participant enjoy the party knowing the age, gender, etc.? If column D is used for prediction, it becomes a mixed regression/classification problem: what is the alcohol concentration of the beverage consumed? It could be a number (e.g., 0.05 for a beer) or could be a category of non-alcoholic beverage. Deodel will generate a prediction in either case.

How does deodel deals with mixed attributes?

The "Deodata Delanga" algorithm is a variation of the nearest neighbor type of supervised classifiers. It doesn't seek a fixed number of neighbors, but rather a nearest neighborhood with a variable number of neighbours. Unlike k-NN, that was used on data with continuous numerical attributes, the "Deodata Delanga" algorithm was intended to deal with discrete attributes; much like the ID3 decision tree. The deodel classifier can be viewed as a flattened/collapsed ID3 decision tree. Each branch of the flattened tree corresponds to a unique combination of attribute values, rather than just a single one, as is the case with conventional decision trees.

Deodel starts from a categorical data classifier and adapts it to continuous/numerical attributes by discretizing them. So, unlike many other algorithms that convert discrete data into numerical ones (like one hot encoding), deodel discretizes the continuous data. Obviously, this entails a loss of information. However, this loss doesn't seem to be severe, and provides surprisingly good results in many settings.

Deodel is essentially non-parametric. However, there are configuration adjustments that can be tuned through an optional dictionary structure specified at initialization. The main adjustment is the number of bins used to discretize numerical attributes. By default, it is set to three. It is also possible to choose the discretization method: "equal-width" vs "equal-frequency". Also, the automatic usage of regression can be overridden.

In terms of accuracy, deodel performs well on datasets with heterogeneous features. On datasets with only categorical/nominal data, the algorithm exhibits accuracy convergence: it approaches the maximum achievable accuracy as more training data is provided.

Occasionally, deodel outperforms more established algorithms like RandomForest, GradientBoostingClassifier, MLPClassifier, etc., in terms of accuracy. An example can be seen here.

Deodel is coded in Python and is compact, fitting in one file/module.

Deodel2 update (Nov 20, 2023)

Deodel2 is a new iteration of the algorithm that substantially improves accuracy. The difference in operation appears in the processing of numerical attribute values. The original version simply discretized the values and compared them as if they were categorical. This didn't take into account the distance separating the two compared values. Whether adjacent or at the extreme of their range, they would be categorized only as a mismatch. Deodel2 takes this into account and substantially improves accuracy for datasets containing many continuous attribute values. If the dataset consists only of categorical attributes, the accuracy will be identical for the two versions. As most datasets do contain lots of numerical attributes, the accuracy increases substantially and approaches that of ensemble classifiers like Random Forest.

modules

  • deodel.py

    • It contains all that is needed for the operation of the classifier/regressor.
  • deodel2.py

    • A new iteration of the algorithm with better processing of numerical attributes.
  • main.py

    • Module that serves as a starting point / launchpad for use applications.
  • usap_demo.py

    • Module that contains a demo of the classifier usage.
  • usap_common.py

    • Module that contains common code used by applications.
  • usap_utest.py

    • Module that implements a non-systematic set of sanity/unit tests.
  • usap_utest2.py

    • Unit testing for deodel2.
  • usap_cmp_binning.py

  • usap_csv_eval.py



deodel's People

Contributors

c4pub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.