Coder Social home page Coder Social logo

c4.5.ts's Introduction

Build Status Coverage Status

C4.5.ts

๐ŸŒณ Ross Quinlan's C4.5 Decision Tree Algorithm1, implemented in TypeScript. Handles Numeric (continuous) and Categorical (discrete) attributes. Uses the Information Gain Ratio to perform splits. Performs reduced error pruning based on confidence intervals.

Getting Started

Dependencies

What you need to run this app:

  • node and npm (brew install node)
  • Ensure you're running the latest versions Node v8.x.x+ (or v9.x.x) and NPM 5.x.x+
  • A Markdown viewer to visualize the results (See MacDown, Typora, or The MarkdownPreview plugin for Atom).

If you have nvm installed, which is highly recommended (brew install nvm) you can do a nvm install --lts && nvm use in $ to run with the latest Node LTS. You can also have this zsh done for you automatically

Once you have those, you should install these globals with npm install --global:

  • typescript (npm install --global typescript)
  • ts-node (npm install --global ts-node)

Installing

  • fork this repo
  • clone your fork
  • npm install to install required dependencies

Running the app

You can run the sample application (ts-node sample) to build a decision tree for the 'owls' test data set.

Usage Guide

Adding a DataSet

Datasets are stored in the test folder. This application only supports unlabelled .csv files.

Adding a Schema

The application learns about the structure of your data from a schema .json file, which is placed in the test folder.

In the schema you must specify the target attribute you wish to build the classification model on, as well as the names and types of all the attributes in the dataset.

C4.5 distinguishes between two types of attributes:

  • CATEGORICAL - The attribute is a discrete set of strings. (e.g. 'dog', 'cat', etc...)
  • NUMERIC - The attribute is a numerical value with no specified range. (e.g. 2.0, 4.5, 1.3, etc...)

Schemas are defined with a single target string and a list of attributes (name and type pairs).

A sample schema is shown below:

{
    "target": "PlayGolf",
    "attributes": [
        {
            "name": "Outlook",
            "type": "CATEGORICAL"
        },
        {
            "name": "Temperature",
            "type": "NUMERIC"
        },
        {
            "name": "Humidity",
            "type": "CATEGORICAL"
        },
        {
            "name": "Windy",
            "type": "CATEGORICAL"
        },
        {
            "name": "PlayGolf",
            "type": "CATEGORICAL"
        }
    ]
}

Constructing a DataSet

const myDataSet = new DataSet('test/tennis.csv', 'test/tennis-schema.json', 70);

The DataSet class takes 3 arguments:

  • The path to your data file (e.g. tennis.csv)
  • The path to your schema file (e.g. tennis-schema.json)
  • The split percentage you wish to use for training and testing the decision tree. (e.g. 66 indicates 66% of the data is used for training, and the remaining 33% is used for testing)

Constructing a Decision Tree

const myC45 = new C45(myDataSet);

The C45 class takes 1 argument: the dataset you wish to build a decision tree for. Once instantiated, you can call the following methods:

  • C45.train() - Builds the unpruned decision Tree
  • C45.prune() - Prunes the decision tree using reduced error pruning.
  • C45.test() - Validates the tree against the withheld test data.

Visualizing the Results

const myVisualizer = new Display('mytree', myC45);
// Prints the Markdown file to 'results/mytree-tennis.md'
myVisualizer.display();

The Display class prints the results of the classification process to a Markdown file. It takes 2 arguments:

  • filePrefix - The prefix for the resulting .md file
  • c45 - The C45 object you wish to display

Once instantiated, you can call the display method to print the results to results/filePrefix-filename.md

Exploring the Results

The Markdown file generated by the Display class contains the following information:

  • The name of the dataset
  • The total number of data points (instances)
  • The test/train split
  • The unpruned decision tree
  • The pruned decision tree and the results from the test set validation. Each prediction node contains a right/wrong number pair
  • The test results (accuracy and number of correctly classified instances)
  • A table that breaks down each test instance, along with its predicted and actual classifications

Sample Results File

Results for 'owls15.csv' Data Set

Statistics

Number of Instances: 135

Percentage Train/Test Split: 66%

Unpruned Decision Tree

Split on Numeric Attribute: wing-width, Value - 0.6

------Condition: lessThanEqual: 0.6, Choose: LongEaredOwl (27/0)

------Condition: greaterThan: 0.6, Split on Numeric Attribute: wing-width, Value - 1.7

------------Condition: lessThanEqual: 1.7, Split on Numeric Attribute: body-width, Value - 4.8

------------------Condition: lessThanEqual: 4.8, Split on Numeric Attribute: wing-width, Value - 1.6

------------------------Condition: lessThanEqual: 1.6, Choose: BarnOwl (25/0)

------------------------Condition: greaterThan: 1.6, Choose: SnowyOwl (1/0)

------------------Condition: greaterThan: 4.8, Split on Numeric Attribute: wing-width, Value - 1.6

------------------------Condition: lessThanEqual: 1.6, Split on Numeric Attribute: wing-length, Value - 6

------------------------------Condition: lessThanEqual: 6, Choose: BarnOwl (1/1)

------------------------------Condition: greaterThan: 6, Choose: SnowyOwl (3/0)

------------------------Condition: greaterThan: 1.6, Choose: BarnOwl (1/0)

------------Condition: greaterThan: 1.7, Split on Numeric Attribute: body-width, Value - 4.8

------------------Condition: lessThanEqual: 4.8, Split on Numeric Attribute: wing-length, Value - 5.9

------------------------Condition: lessThanEqual: 5.9, Choose: BarnOwl (1/0)

------------------------Condition: greaterThan: 5.9, Choose: SnowyOwl (1/0)

------------------Condition: greaterThan: 4.8, Choose: SnowyOwl (28/0)

Pruned Decision Tree

Split on Numeric Attribute: wing-width, Value - 0.6

------Condition: lessThanEqual: 0.6, Choose: LongEaredOwl (18/0)

------Condition: greaterThan: 0.6, Split on Numeric Attribute: wing-width, Value - 1.7

------------Condition: lessThanEqual: 1.7, Split on Numeric Attribute: body-width, Value - 4.8

------------------Condition: lessThanEqual: 4.8, Split on Numeric Attribute: wing-width, Value - 1.6

------------------------Condition: lessThanEqual: 1.6, Choose: BarnOwl (15/0)

------------------------Condition: greaterThan: 1.6, Choose: SnowyOwl (0/0)

------------------Condition: greaterThan: 4.8, Choose: BarnOwl (2/0)

------------Condition: greaterThan: 1.7, Choose: SnowyOwl (11/0)

Test Results

Correctly Classified 46 out of of 46 Instances

Accuracy: 100.0%

Instance Breakdown:

instance number body-length wing-length body-width wing-width predicted actual
113 2.3 5 3.3 1 BarnOwl BarnOwl
101 2.4 5.5 3.8 1.1 BarnOwl BarnOwl
32 3.1 4.6 1.5 0.2 LongEaredOwl LongEaredOwl
29 3.5 5 1.3 0.3 LongEaredOwl LongEaredOwl
83 2.8 7.4 6.1 1.9 SnowyOwl SnowyOwl
2 3.4 4.6 1.4 0.3 LongEaredOwl LongEaredOwl
114 2.9 5.7 4.2 1.3 BarnOwl BarnOwl
7 3.4 4.8 1.6 0.2 LongEaredOwl LongEaredOwl
11 3.9 5.4 1.7 0.4 LongEaredOwl LongEaredOwl
36 3.4 5 1.6 0.4 LongEaredOwl LongEaredOwl
129 2.2 6 4 1 BarnOwl BarnOwl
124 2.7 5.2 3.9 1.4 BarnOwl BarnOwl
6 3.3 5.1 1.7 0.5 LongEaredOwl LongEaredOwl
18 3 4.8 1.4 0.3 LongEaredOwl LongEaredOwl
31 3.9 5.4 1.3 0.4 LongEaredOwl LongEaredOwl
25 3.4 4.8 1.9 0.2 LongEaredOwl LongEaredOwl
100 3.1 6.9 4.9 1.5 BarnOwl BarnOwl
90 2.7 5.6 4.2 1.3 BarnOwl BarnOwl
56 3.8 7.7 6.7 2.2 SnowyOwl SnowyOwl
131 3.1 6.7 4.7 1.5 BarnOwl BarnOwl
89 2.7 6.4 5.3 1.9 SnowyOwl SnowyOwl
128 2.9 5.6 3.6 1.3 BarnOwl BarnOwl
125 2.5 6.3 4.9 1.5 BarnOwl BarnOwl
47 2.8 6.2 4.8 1.8 SnowyOwl SnowyOwl
57 3.4 6.2 5.4 2.3 SnowyOwl SnowyOwl
133 3 5.6 4.1 1.3 BarnOwl BarnOwl
78 2.5 5.7 5 2 SnowyOwl SnowyOwl
44 2.3 4.5 1.3 0.3 LongEaredOwl LongEaredOwl
109 3 5.7 4.2 1.2 BarnOwl BarnOwl
105 2.9 6.4 4.3 1.3 BarnOwl BarnOwl
95 3 5.6 4.5 1.5 BarnOwl BarnOwl
13 2.9 4.4 1.4 0.2 LongEaredOwl LongEaredOwl
8 3.7 5.1 1.5 0.4 LongEaredOwl LongEaredOwl
84 2.8 7.7 6.7 2 SnowyOwl SnowyOwl
99 2.9 6.2 4.3 1.3 BarnOwl BarnOwl
5 3 4.9 1.4 0.2 LongEaredOwl LongEaredOwl
60 3 6.8 5.5 2.1 SnowyOwl SnowyOwl
107 2.7 5.8 4.1 1 BarnOwl BarnOwl
61 3.3 6.7 5.7 2.5 SnowyOwl SnowyOwl
12 3 4.3 1.1 0.1 LongEaredOwl LongEaredOwl
110 2.4 5.5 3.7 1 BarnOwl BarnOwl
34 4 5.8 1.2 0.2 LongEaredOwl LongEaredOwl
3 3.6 5 1.4 0.2 LongEaredOwl LongEaredOwl
81 3 6.5 5.2 2 SnowyOwl SnowyOwl
71 2.9 6.3 5.6 1.8 SnowyOwl SnowyOwl
23 3.2 4.7 1.3 0.2 LongEaredOwl LongEaredOwl

Running Your Own Trials With Trial.ts

ts-node test/trial -d owls15 -s owls15-schema -p 66 -n 1000

The 'test/' folder contains a command line script to run multiple C4.5 trials. It will output the results of each trial to the console and print the highest performing Decision Tree to a markdown file ([number of trials]-trials-best-[filename].md).

Trial.ts accepts the following arguments:

  • -d or --dataFile - The name of your .csv data file in the test/ folder.
  • -s or --schemaFile - The name of your .json schema file in the test/ folder.
  • -p or --percentageSplit - The train/test split you wish to use. Default value is 66%.
  • -n or --trials - The number of trials you wish to run. Default value is 1000.

Hyperparameters

The constants.ts file in the Constants folder contains several editable hyperparameters to adjust the decision tree building and pruning processes:

  • useMinInstancePruning: boolean - Default: false. If true, any leaf nodes that are only reached by a number of training instances less than the minInstancesPerLeaf value are pruned away.
  • minInstancesPerLeaf: number - Default: 2. Minimum number of training instances that must reach a leaf node to spare it from pruning.
  • zValue: number - Default: 0.69. The z-value for confidence interval calculations. Default confidence is 75% -> z-value of 0.69.

Running the Tests

There are several units tests that validate the entropy and information gain calculations. You can run these with:

npm run test

References

[1] Ross, Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.

To Do

  • Learning Curves
  • ROC Curves
  • Cross-Validation
    • N-fold Cross-Validation
    • Leave-One-Out Cross-Validation
    • Additional Hyperparameters

c4.5.ts's People

Contributors

dependabot[bot] avatar michaeljneely avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.