A 1R implementation in Rust

Re-implementing the 1R experiments described in Holte, 1993.

1R is a baseline rule learning algorithm.

The algorithm generates a rule for each attribute, and then picks the "one rule" that has the best accuracy.

For example, given a data set of drinking habits with attributes such as age, time of day, mood (attributes), 1R might produces a rule of the form:

if time="morning" then drink="coffee"
if time="afternoon" then drink="tea"
if time="evening" then drink="water"

The rule might only have, say, 60% accuracy. That's a baseline to compare to other algorithms.

Example run

New to Rust? 👋 Start by installing rustup to get various tools, including the cargo command. Then...

$ cargo build --quiet --release
❯ ./target/release/oner -d data/fake-house/house.csv -w
Config { data: "data/fake-house/house.csv", seed: 1, training_fraction: 0.6666666666666666, hide_rules: false, use_whole_dataset: true, repeats: 25, distinct_above: 6, small: 6, missing: "?" }
// Training set accuracy: 0.70
IF size IS small THEN low
IF size IS big THEN high
IF size IS medium THEN medium

Example data sets

This application assumes attributes (features) are the columns and rows are the instances (examples).

I have taken data sets and converted to CSV where necessary, including adding header rows.

The data folder contains the data from various sources. Unless otherwise specified, it'll be the UCI Machine Learning Repository.

`bc`

A breast cancer dataset.

In the CSV version I have moved the class from the first column to the last column (that's what this code expects). I did this with: awk -F, '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$1}' OFS=, < breast-cancer.data > bc.csv

Holt's experiments (section 2.2 of Holte 1993) used a random 2/3 of the data for training, and 1/3 for testing, repeated 25 times. The experiment resulted in a 0.687 classification accuracy on the test (Holte, table 3) against a baseline (0R) accuracy of 0.73 (table 2).

Model	Accuracy %
0R	70.3
1R	68.7
This code (mean of 10 seeds)	68.4
This code (median of 10 seeds)	68.3
This code (range of 10 seeds)	67.6 - 69.6

`ch`

The Chess (King-Rook vs. King-Pawn) dataset.

Model	Accuracy %
0R	52.5
1R	67.6
This code (mean of 10 seeds)	67.6
This code (median of 10 seeds)	67.6
This code (range of 10 seeds)	67.2 - 67.8

`ir`

The Iris dataset. The CSV version was created with:

$ echo "SepalLengthInCm,SepalWidthInCm,PetalLengthInCm,PetalWidthInCm,Class" > iris.csv
$ cat iris.data >> iris.csv

Model	Accuracy %
0R	33.3
1R	93.5
This code (mean of 10 seeds)	95.1
This code (median of 10 seeds)	95.0
This code (range of 10 seeds)	94.5 - 95.9

Using the whole data set:

❯ ./target/release/oner -d data/ir/iris.csv -w
Config { data: "data/ir/iris.csv", seed: 1, training_fraction: 0.6666666666666666, hide_rules: false, use_whole_dataset: true, repeats: 25, distinct_above: 6, small: 6, missing: "?" }
// Training set accuracy: 0.960
IF PetalWidthInCm IS < 1 THEN Iris-setosa
IF PetalWidthInCm IS >= 1 and < 1.7 THEN Iris-versicolor
IF PetalWidthInCm IS >= 1.7 THEN Iris-virginica

`fake-house`

The dataset used to introduce 1R in Interpretable Machine Learning (published under CC BY-NC-SA 4.0). To run the example use the -w flag to use the whole dataset for rule discovery.

Configuration

oner 0.2.0

USAGE:
    oner [FLAGS] [OPTIONS] --data <filename>

FLAGS:
        --help                 Prints help information
    -h, --hide-rules           Suppress printing of rules at end of run
    -w, --use-whole-dataset    Use all the data for training and testing (overrides -t)
    -V, --version              Prints version information

OPTIONS:
        --distinct-above <distinct-above>
            An attribute must have more than than this number of distinct values for a column to be detected as numeric
            (and so quantized) [default: 6]
    -d, --data <filename>                          Complete data set to learn from (in CSV format, with header row)
    -m, --missing <missing>
            When quantizing, a value to treat as a missing value (in addition to blank attribute values) [default: ?]

    -r, --repeats <repeats>
            Number of times to repeat an experiment to report average accuracy [default: 25]

    -s, --seed <seed>                              Random seed [default: 1]
        --small <small>
            When quantizing, an interval must have a dominant class must occur more than this many times. [default: 6]

    -t, --training-fraction <training-fraction>
            Fraction of the data to use for training (vs. testing) [default: 0.6666666666666666]

Licence

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

Support continuous values in features

From Holte (1993):

It treats all numerically valued attributes as continuous and uses a straightforward method to divide the range of values into several disjoint intervals.

and

To be counted, in table 2, as continuous (col- umn entitled "cont") an attribute must have more than six numerical values

and

In dividing the continuous range of values into a finite number of intervals, it is tempting to make each interval "pure," i.e., containing examples that are all of the same class. But just as overfitting may result from deepening a decision tree until all the leaves are pure, so too overfitting may result from subdividing an interval until all the subintervals are pure. To avoid this, IR requires all intervals (except the rightmost) to contain more than a predefined number of examples in the same class. Based on the results in Holte et al. (1989), the threshold was set at six for all datasets except for the datasets with fewest examples (LA, SO) where the threshold was set at three.

and Appendix A:

The user also sets SMALL, the "small disjunct" threshold

FOR EACH NUMERICAL ATTRIBUTE, A, create a nominal version of A by defin- ing a finite number of intervals of values. These intervals become the "values" of the nominal version of A. For example, if A's numerical values are partitioned into three intervals, the nominal version of A will have three values: "interval 1," "interval 2," and "interval 3." [...]

Definitions:
Class C is optimal for attribute A, value V, if it maximizes COUNT[C,V,A].
Class C is optimal for attribute A, ipnterval I, if it maximizes COUNT[C, "interval I",A].

Values are partitioned into intervals so that every interval satisfies the following constraints:
(a) there is at least one class that is "optimal" for more than SMALL of the values
in the interval (this constraint does not apply to the rightmost interval); and
(b) if V[I] is the smallest value for attribute A in the training set that is larger than the values in interval I, then there is no class C that is optimal both for V[I] and
for interval I.

d6y / oner Goto Github PK

oner's Introduction

A 1R implementation in Rust

1R is a baseline rule learning algorithm.

Example run

Example data sets

bc

ch

ir

fake-house

Configuration

Licence

oner's People

Contributors

Watchers

oner's Issues

Recommend Projects

Recommend Topics

Recommend Org

`bc`

`ch`

`ir`

`fake-house`