Coder Social home page Coder Social logo

oner's Introduction

Rust

A 1R implementation in Rust

Re-implementing the 1R experiments described in Holte, 1993.

1R is a baseline rule learning algorithm.

The algorithm generates a rule for each attribute, and then picks the "one rule" that has the best accuracy.

For example, given a data set of drinking habits with attributes such as age, time of day, mood (attributes), 1R might produces a rule of the form:

if time="morning" then drink="coffee"
if time="afternoon" then drink="tea"
if time="evening" then drink="water"

The rule might only have, say, 60% accuracy. That's a baseline to compare to other algorithms.

Example run

New to Rust? ๐Ÿ‘‹ Start by installing rustup to get various tools, including the cargo command. Then...

$ cargo build --quiet --release
โฏ ./target/release/oner -d data/fake-house/house.csv -w
Config { data: "data/fake-house/house.csv", seed: 1, training_fraction: 0.6666666666666666, hide_rules: false, use_whole_dataset: true, repeats: 25, distinct_above: 6, small: 6, missing: "?" }
// Training set accuracy: 0.70
IF size IS small THEN low
IF size IS big THEN high
IF size IS medium THEN medium

Example data sets

This application assumes attributes (features) are the columns and rows are the instances (examples).

I have taken data sets and converted to CSV where necessary, including adding header rows.

The data folder contains the data from various sources. Unless otherwise specified, it'll be the UCI Machine Learning Repository.

bc

A breast cancer dataset.

In the CSV version I have moved the class from the first column to the last column (that's what this code expects). I did this with: awk -F, '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$1}' OFS=, < breast-cancer.data > bc.csv

Holt's experiments (section 2.2 of Holte 1993) used a random 2/3 of the data for training, and 1/3 for testing, repeated 25 times. The experiment resulted in a 0.687 classification accuracy on the test (Holte, table 3) against a baseline (0R) accuracy of 0.73 (table 2).

Model Accuracy %
0R 70.3
1R 68.7
This code (mean of 10 seeds) 68.4
This code (median of 10 seeds) 68.3
This code (range of 10 seeds) 67.6 - 69.6

ch

The Chess (King-Rook vs. King-Pawn) dataset.

Model Accuracy %
0R 52.5
1R 67.6
This code (mean of 10 seeds) 67.6
This code (median of 10 seeds) 67.6
This code (range of 10 seeds) 67.2 - 67.8

ir

The Iris dataset. The CSV version was created with:

$ echo "SepalLengthInCm,SepalWidthInCm,PetalLengthInCm,PetalWidthInCm,Class" > iris.csv
$ cat iris.data >> iris.csv
Model Accuracy %
0R 33.3
1R 93.5
This code (mean of 10 seeds) 95.1
This code (median of 10 seeds) 95.0
This code (range of 10 seeds) 94.5 - 95.9

Using the whole data set:

โฏ ./target/release/oner -d data/ir/iris.csv -w
Config { data: "data/ir/iris.csv", seed: 1, training_fraction: 0.6666666666666666, hide_rules: false, use_whole_dataset: true, repeats: 25, distinct_above: 6, small: 6, missing: "?" }
// Training set accuracy: 0.960
IF PetalWidthInCm IS < 1 THEN Iris-setosa
IF PetalWidthInCm IS >= 1 and < 1.7 THEN Iris-versicolor
IF PetalWidthInCm IS >= 1.7 THEN Iris-virginica

fake-house

The dataset used to introduce 1R in Interpretable Machine Learning (published under CC BY-NC-SA 4.0). To run the example use the -w flag to use the whole dataset for rule discovery.

Configuration

oner 0.2.0

USAGE:
    oner [FLAGS] [OPTIONS] --data <filename>

FLAGS:
        --help                 Prints help information
    -h, --hide-rules           Suppress printing of rules at end of run
    -w, --use-whole-dataset    Use all the data for training and testing (overrides -t)
    -V, --version              Prints version information

OPTIONS:
        --distinct-above <distinct-above>
            An attribute must have more than than this number of distinct values for a column to be detected as numeric
            (and so quantized) [default: 6]
    -d, --data <filename>                          Complete data set to learn from (in CSV format, with header row)
    -m, --missing <missing>
            When quantizing, a value to treat as a missing value (in addition to blank attribute values) [default: ?]

    -r, --repeats <repeats>
            Number of times to repeat an experiment to report average accuracy [default: 25]

    -s, --seed <seed>                              Random seed [default: 1]
        --small <small>
            When quantizing, an interval must have a dominant class must occur more than this many times. [default: 6]

    -t, --training-fraction <training-fraction>
            Fraction of the data to use for training (vs. testing) [default: 0.6666666666666666]

Licence

Copyright 2020 Richard Dallaway

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

oner's People

Contributors

d6y avatar

Watchers

 avatar

oner's Issues

Handle ties in rule discovery

During discover, if two rules have the same accuracy, how should we pick between them?

The current implementation is to take the first highest value, which I think means it'll be dependent on the order of the attributes in the dataset.

Support missing values

From Holte (1993):

It handles missing values by treating "missing" as a legitimate value. Appendix A gives pseudocode for IR.

Support continuous values in features

From Holte (1993):

It treats all numerically valued attributes as continuous and uses a straightforward method to divide the range of values into several disjoint intervals.

and

To be counted, in table 2, as continuous (col- umn entitled "cont") an attribute must have more than six numerical values

and

In dividing the continuous range of values into a finite number of intervals, it is tempting to make each interval "pure," i.e., containing examples that are all of the same class. But just as overfitting may result from deepening a decision tree until all the leaves are pure, so too overfitting may result from subdividing an interval until all the subintervals are pure. To avoid this, IR requires all intervals (except the rightmost) to contain more than a predefined number of examples in the same class. Based on the results in Holte et al. (1989), the threshold was set at six for all datasets except for the datasets with fewest examples (LA, SO) where the threshold was set at three.

and Appendix A:

The user also sets SMALL, the "small disjunct" threshold

  1. FOR EACH NUMERICAL ATTRIBUTE, A, create a nominal version of A by defin- ing a finite number of intervals of values. These intervals become the "values" of the nominal version of A. For example, if A's numerical values are partitioned into three intervals, the nominal version of A will have three values: "interval 1," "interval 2," and "interval 3." [...]

Definitions:
Class C is optimal for attribute A, value V, if it maximizes COUNT[C,V,A].
Class C is optimal for attribute A, ipnterval I, if it maximizes COUNT[C, "interval I",A].

Values are partitioned into intervals so that every interval satisfies the following constraints:
(a) there is at least one class that is "optimal" for more than SMALL of the values
in the interval (this constraint does not apply to the rightmost interval); and
(b) if V[I] is the smallest value for attribute A in the training set that is larger than the values in interval I, then there is no class C that is optimal both for V[I] and
for interval I.

Repeats is off by one

If you ask for 25 repeats of an experment, you see output 1 to 24. I'd expect to see 25 lines of output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.