Coder Social home page Coder Social logo

paschistrobel / decisiontree-java-implementation Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 281 KB

Java-implementation of a decision tree classifier based on "Titanic - Machine Learning from Disaster" dataset.

License: GNU General Public License v3.0

Java 100.00%
java decision-tree titanic-kaggle titanic-survival-prediction titanic-dataset machinelearning

decisiontree-java-implementation's Introduction

DecisionTree-Java-Implementation

Java decision tree implementation for the the "Titanic - Machine Learning from Disaster" dataset [1].

The classifier achieves an accuracy of 0.7249 on the Kaggle test data set.

Algorithm

The implementation is based on the C4.5 algorithm developed by Ross Quinlan [2]. Pseudocode and a detailed explanation of its main concepts is provided in the following sections.

  1. Pseudocode
  2. C4.5 Main concepts
    2.1 Entropy
    2.2 Information gain

Pseudocode

train(EXAMPLES, TARGET_ATTRIBUTE, ATTRIBUTES)
   if all EXAMPLES positiv
     return root (label = +)
   else if all EXAMPLES negative
     return root (label = -)
   else if ATTRIBUTES is empty
     return root (label = mcv(TARGET_ATTRIBUTE)) //mcv = most common value
   else
     A <- the attribute of ATTRIBUTES that best classifies EXAMPLES
     A is attribute for ROOT
     For each value v in A
       create new child c_v for ROOT
       create subset EXAMPLES_v for c_v
       if EXAMPLES_v is empty
         create leaf (label = mcv(TARGET_ATTRIBUTE))
       else
         train(EXAMPLES_v, TARGET_ATTRIBUTE, ATTRIBUTES - {A})

The recursive algorithm generates a decision tree from a set of training data using the concept of (information) entropy. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits the training data into subsets. To find that "best attribute", the information gain (related to entropy but not the same) for each attribute is calculated and the one with the highest gain is chosen. The algorithm then recurses on the partitioned data subsets.

Main concepts

The aim of the algorithm is to train a classifier with as little effort as possible and at the same time as few errors as possible. We can do this by excluding large parts of the search area or classifying large parts of the data set as quickly as possible. In the following example, the two attributes Sex and PriceClass would be differently well suited.
comparison of pure and impure subsets
If splitting on the gender attribute, the 24 data points would be distributed to the two classes male and female. The resulting subsets are homogeneous with regard to the target attribute (survived), which means that we don't have to train the tree any further. We can make the best possible decision here based on the training data, namely: male passengers are classified as survived, female passengers as not survived. If splitting on PriceClass, the decisions are not that simple. The 8 passengers in price class 1 are divided into 4 survivors and 4 non-survivors. When trying to classify a new passenger of price class 1, we could only guess the label, which is why the impure subsets have to be split up on the basis of further attributes (if there are more). The same applies to the subsets with price class 2 and price class 3.
In a nutshell: The optimal case when training the classifier would be to obtain completely pure / homogeneous subsets when dividing them based on an attribute (here: the gender), the worst case would be completely impure / heterogeneous subsets (here: the PriceClass).

Entropy

A metric to measure the uncertainty of a class in a subset of examples is entropy [3].
formula of information entropy
It can be interpreted as the average number of decisions (bits) needed to distinguish the positive (survived) from the negative (not survived) examples.
The following applies to our pure subsets:
entropy for pure subsets
And for our completely impure subsets:
entropy for impure subsets
The course of the entropy for two possible events is also nicely illustrated in the following graphic:
entropy course for two possible events

The more certain a decision, the lower the entropy. This also means that an attribute is the better the more the entropy can be reduced and the higher the gained information.

Information gain

Entropy is only a measure of how pure a subset of an attribute is. But since several subsets arise when splitting an attribute, we have to combine the individual entropies into one value in order to be able to make an overall statement about how good the split is. The keyword here is information gain.
formula for information gain

Information gain is nothing other than the difference in entropy before and after the split at a certain attribute. The entropy that would result from the split is additionally weighted so that subsets with more data also have more weight.
The following applies to our two examples:
calculated information gain for attribute sex
calculated information gain for attribute priceclass

The value for the information gain is between 0 and 1. The higher the value, the more certainty there is in our decision on the classification. The best attribute is therefore the one with the highest information gain.

A more detailed (and visualized) explanation can be found in the youtube playlist from Victor Lavrenko [4].

References

[1] https://www.kaggle.com/c/titanic
[2] https://www.rulequest.com/Personal/
[3] https://en.wikipedia.org/wiki/Entropy_(information_theory)
[4] https://www.youtube.com/playlist?list=PLBv09BD7ez_4temBw7vLA19p3tdQH6FYO

decisiontree-java-implementation's People

Contributors

adriansterr avatar paschistrobel avatar

Watchers

 avatar

decisiontree-java-implementation's Issues

License?

Hi there, what is the license for this C4-5 implementation?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.