Branch-and-bound algorithm, with caching, for decision lists.
The main function is bbound
in code/serial_priority.py
and the work in the inner loop is done by incremental
in code/branch_bound.py
cd paper
make
Python 2.7x
numpy
tabular
matplotlib
gmpy2
These can be installed on Mac OS X with brew install
.
mpfr
libmpc
tdata_R.out : set of rules
tdata_R.labels : labels
nrules : 377
ndata : 639
total pairs : 70876
commuting pairs : 22445
Note that the labeled training data is biased in the sense that 63.8% (408/639) have label 1. We thus expect short rule lists to predict 1 by default.
By design, the set of rules has 377 elements and does not include all possible rules a human might think to write down for this problem. Specifically, it does not include two of the rules for 'o' winning: {c1=o,c2=o,c3=o} and {c7=o,c8=o,c9=o}.
It turns out that the greedy algorithm identifies as the first rule a rule for 'o' winning: {c2=o,c5=o,c8=o}. As a result, the greedy algorithm continues to add the other rules for 'o' winning. Since only 6 out of 8 such rules exist in the set of rules, the greedy algorithm, with maximum rule list length set to 8, outputs a list that only achieves an accuracy of 0.912 on the training set. This is fine for us, since we are only using the greedy algorithm to find a reasonable warm start rule list.
greedily constructed prefix = (153, 372, 176, 178, 134, 51, 121, 125)
Note that if we initialize the greedy algorithm with a rule for 'x' winning, it will find a perfect rule list. Furthermore, the rules in this rule list perfectly partition the dataset.
adult_R.out : set of rules
adult_R.labels : labels
nrules : 284
ndata : 30081
sum(ones) : 22645
bias : 0.7528
total pairs : 40186
commuting pairs : 7235
This module contains functions and data structures used by variants of the branch-and-bound algorithm, including an object for a cache element.
This module contains a serial implementation of the branch-and-bound algorithm, with a cache to support incremental computation. Prefixes are added to the queue greedily, which causes the queue to grow exponentially fast. It's here because it's a bit easier to understand.
This module contains a serial implementation of the branch-and-bound algorithm, with a cache to support incremental computation and a queue that grows lazily. The queue's length is bounded by the total number of rules.
How many prefixes get stored in the cache? We empirically measure this, for different (simulated) warm starts. Note that to simulate a warm start with a particular accuracy, we don't actually need to generate a prefix with that accuracy -- we simply set max_accuracy to the desired level. For the tic-tac-toe dataset, we know that a perfect rule list can be generated from the given rules.
This module contains a serial implementation of the branch-and-bound algorithm, with a cache to support incremental computation, a queue that grows lazily, and garbage collection. The queue's length is bounded by the total number of rules. This will replace serial_lazy.py, as includes introduces garbage collection in a modular and optional fashion.
Each round, we track groups of prefixes that are equivalent up to permutation. Since the prefixes in such a group capture the same data, we only keep one that has the highest accuracy within the group.
cache_size[i] + captured_zero[i] + dead_prefix[i] + inferior[i]
= (nrules - i + 1) * (cache_size[i-1] - dead_prefix_start[i] - stunted_prefix[i])
Update: additional symmetry-based pruning based on (all sets of) rules that commute.
This module contains a serial implementation of the branch-and-bound algorithm, with:
-
A cache that supports incremental computation. The cache data structure is a hash map (Python
dict
) with keys encoding prefixes and also has a tree structure reflecting relationships among prefixes. -
A priority queue to manage different scheduling policies (implemented using Python's
heapq
module). -
Symmetry-aware garbage collection for sets of prefixes that are equivalent up to permutation -- we only keep the best.
-
Symmetry-aware pruning for equivalence classes of prefixes that contain (possibly multiple) adjacent pairs of commuting rules -- we only evaluate one.
A small amount of work is done once upfront to compute pairs of rules that commute globally due to zero overlap, and to compute pairs of rules where one dominates another, i.e., A dominates B if A captures all the data that B captures.
Note that the maximum length prefix we have to check is
M = (best observed objective) / c < 1 / c.
Furthermore, if we ever encounter a perfect prefix of length L, then we can set
M = L - 1.
(If the policy is breadth first, we can just stop. Otherwise, the algorithm stops when the queue is empty.)
Suppose we take prefix P of length K off the queue. That means we've already evaluated P and placed it in the cache.
First we check whether P is still in the cache (it could have been garbage collected, e.g., because we found a new objective smaller than its lower bound, or because we found a permutation with lower objective). You could say that we garbage collect the queue lazily. If P is not in the cache, we stop (and continue to the next prefix in the queue).
Next we construct the list of new rules to consider. Naively, this would be all rules not in the prefix. Let R be the last rule in P. We eliminate the following rules from consideration:
-
rules that have zero overlap with R and have a larger index
-
rules that are dominated by (any rule in) P
-
rules that we already know would be rejected by P
A rule is rejected by a prefix if it doesn't correctly capture enough data
(it must correctly capture >= c * ndata
).
Let Q be P's parent. If Q rejects a rule, then P will also reject that rule.
This 'inheritance' of rejected rules only depends on which data are captured
by Q, and doesn't actually depend on the order of rules in Q.
Let S be the set of rules formed from (K-1) rules of P, in any order.
P inherits rejected rules from any elements of S.
Because of our symmetry-based garbage collection of prefixes equivalent up
to a permutation, there are at most K elements of S in the cache;
we can identify these via the inverse canonical map (ICM) that maps an
ordered prefix to its permutation in the cache.
We thus lazily initialize the list of P's reject list of rejected rules.
- Aside: This depends on finding elements of S, which depends on what's in the cache. When is a prefix not in the cache? Either it hasn't yet been evaluated, or it has been partially evaluated and not inserted, or evaluated and (not inserted, or inserted and later deleted). The cache is thus complemented by information that either isn't inserted or gets deleted -- are we throwing away something useful here?
Now we have a list of candidate new rules. For each candidate rule R, we compute the following:
(1) We compute which data are captured by R. If R captures < (ndata * c)
data, then R captured insufficient data.
We add R to the reject list of P and continue to the next candidate rule.
(2) If R captures all data uncaptured by P, then R now behaves like the default rule, but at the cost of adding a rule. We add R to the reject list of P and continue.
(3) We compute the majority class of data captured by R,
and count how many data it correctly predicts.
If this number is < (ndata * c)
, then R correctly captured insufficient data.
We add R to the reject list of P and continue.
Let P' be prefix P extended with rule R. We compute the lower bound of the objective of R
(number of mistakes) / ndata + c * len(P')
(4) If it's greater than the smallest observed objective, then P' is a dead prefix and we continue to the next candidate.
We compute the default prediction for P', which lets us compute the objective. If it's smaller than the best objective we've seen, we update that.
(5) The children of P' are one longer that P' (i.e., are length K + 2). If this length exceeds M, the maximum prefix length we have to check, then there's no point to pursuing P', it is a dead prefix and we continue to the next candidate.
Also because the children of P' are one longer that P', we can give them a tighter lower bound,
(number of mistakes) / ndata + c * len(P') + c
(6) If this is greater than the smallest observed objective, then there's no point in pursuing P', it is a dead prefix and we continue to the next candidate.
(7) We check whether there's a permutation of P' in the ICM.
- If not, then add an entry for P' to the ICM.
- Otherwise, call the permutation T.
- If the objective of P' is lower than T, then T is inferior and we replace it with P'.
- Otherwise, P' is inferior and we continue to the next candidate.
If P' reaches this far, then we construct a cache entry for it.
The cache data structure is a hash map that also encodes a tree. The keys of the hash map encode prefixes; retrieving a cache entry for a given prefix is thus an O(1) operation. The tree structure reflects relationships among prefixes, and these relationships enable incremental computation in our branch-and-bound algorithm.
Each node of the tree is indexed by a prefix.
The root node is the empty prefix and nodes at depth d
(below) the root
correspond to prefixes of length d
.
The children of a node indexed by prefix
are prefixes starting with prefix
and longer by one rule (not in prefix
).
The parent of a node indexed by prefix
of length k > 0
is therefore indexed
by trim(prefix)
, where this operation returns a prefix whose rules are given
by the first k - 1
rules of prefix
.
It is thus efficient to find the parent and other ancestors of a given prefix,
i.e., to traverse the tree up towards the root.
To support efficient traversals in the direction of descendents, a node must
also keep track of its children; our implementation is detailed below, in our
description of cache entry attributes.
The cache's tree structure also encodes the state of computation and contains
two kinds of nodes, used
and unused
.
When a cache entry is first inserted, it is unused
in the sense that it has
not yet been (completely) used in incremental computations related to its children.
If it is ever used for such computations, we call it used
.
An unused
cache entry has the following attributes:
-
prefix : n-tuple of integers corresponding to rules (redundant w.r.t. key)
-
prediction : binary n-tuple of corresponding predictions
-
default_rule : binary value encoding the default rule
-
accuracy : real value in [0, 1], accuracy (not necessary)
-
upper_bound : upper bound on the accuracy for rule lists starting with prefix (not necessary)
-
objective : objective value of the prefix (not necessary)
-
lower_bound : lower bound on the objective value for rule lists starting with prefix
-
num_captured : number of data captured by the prefix
-
num_captured_correct : number of data captured by the prefix and predicted correctly
-
not_captured : binary vector encoding which data are not captured by the prefix (represented as an integer using
mpz
) -
curiosity : curiosity (not necessary for cache but can be used by priority queue)
-
children : set of integers encoding child nodes
-
num_children : number of children
-
reject_list : tuple of integers encoding rejected rules that should never be appended to
prefix
(if a rule captures insufficient data when appended to a prefix, then it will be insufficient for any rule list that starts with that prefix)
If an unused
cache entry becomes used
, then we delete these attributes used
in the computation because they will not be used again:
prefix, prediction, default_rule, accuracy, objective,
num_captured, num_captured_correct, not_captured, curiosity
A used
cache entry thus retains four attributes:
-
lower_bound : the lower bounds of interior nodes are used to prune the cache when the minimum observed objective decreases (via the
garbage_collect
routine) -
children : this attribute encodes the tree structure
-
num_children : if
num_children
reaches zero, theused
cache entry is adead end
and we delete it (via theprune_up
routine, which also removesdead end
ancestors) -
reject_list : when an
unused
cache entry for prefix P of length L is retrieved, this attribute is lazily initialized by taking the union ofreject_list
elements from any cached prefix in S, where S is the set of prefixes formed by taking subsets of (L-1) rules from P; this is efficient due to our symmetry-aware garbage collection and an associated data structure.
-
insert(prefix, cache_entry) : insert an
unused
cache entry forprefix
(and update its parent) -
delete(prefix) : delete
prefix
and all of its descendents (and update its parent, callingprune_up
if relevant)
-
insert(prefix, cache_entry) : symmetry-aware garbage collection happens on cache entry insertion; we discuss this below
-
prune_up(prefix) : if
prefix
corresponds to adead end
, i.e., aused
cache entry with no children, then remove cache entries forprefix
as well as anydead end
ancestors -
garbage_collect(min_objective) : delete all prefixes with
lower_bound > min_objective
; called whenmin_objective
decreases
Each is a function that maps a prefix to a value (bounded below by zero, corresponding to the highest priority).
Prefix length : Implements breadth-first search
Objective lower bound : The lower bound on the objective is the sum of the lower bound on the number of mistakes and the regularization term.
priority = (# captured and incorrect) / (# data) + c * (prefix length)
Curiosity : Expected objective if all data are captured.
priority = (prefix misclassification) + c * (expected prefix length)
Where
prefix misclassification = (# captured and incorrect) / (# captured)
and
expected prefix length = (prefix length) * (# data) / (# captured)
Objective : The objective is the sum of the fraction of mistakes and a regularization term.
priority = (# incorrect) / (# data) + c * (prefix length)
If two rules A and B capture non-intersecting subsets of data, they
commute globally (type I).
If rules A and B commute, then a rule list where A and B are adjacent is
equivalent to another rule list where A and B swap positions.
More generally, a rule list containing possibly multiple, possibly overlapping,
pairs of commuting rules is equivalent to any other rule list that can be
generated by swapping one or more such pairs of rules.
We avoid evaluating multiple such equivalent rule lists by eliminating all but one.
We achieve this by restricting how we grow a prefix (by appending a rule),
as informed by a hash map cdict
that maps each rule R_i
to a set of rules
S_i = {R_j}
such that R_i
commutes with every R_j
in S_i
and j > i
.
When growing a prefix that ends with rule R_i
, we only append a rule R_j
if
it is not in S_i
.
For tdata
, cdict
maps each rule to a set that on average has 59 rules
(15% of 377 total).
For adult
, cdict
maps each rule to a set that on average has 25 rules
(8% of 284 total).
Note that we could also define a local version of (type I) commuting.
If two rules A and B are adjacent in a rule list and the both predict "the same way" (i.e., both predict "0" or both predict "1" for captured data), then they commute locally (II). The way we handle this currently could be made more efficient.
Rule A dominates rule B if the data captured by B is a subset of the data
captured by A.
Thus, rule B should never follow rule A in a prefix, because it will never
capture additional data.
Similar to symmetry-aware pruning, we achieve this by restricting how we grow a
prefix, as informed by a hash map rdict
that maps a rule R_i
to a set of
rules T_i = {R_k}
such that R_i
dominates every R_k
in T_i
.
When growing a prefix that ends with rule R_i
, we only append a rule R_k
if
it is not in T_i
.
Notice that the intersection of S_i
and T_i
is the empty set, thus the
mappings represented by cdict
and rdict
can easily be combined into a single
mapping.
For tdata
, rdict
maps each rule to a set that on average has 3 rules.
For adult
, rdict
maps each rule to a set that on average has 2 rules.
If two prefixes P and Q are composed of the same rules and equivalent up to a permutation, then they also capture the same data. The two corresponding rule lists need not yield the same objective, since the objective depends on rule order. Obtain a rule list P' by appending P with some ordered list of unique rules not contained in P, and Q' by appending Q with the same ordered list. The performance of P' compared to P will be the same as that of Q' compared to Q, e.g.,
objective(P') - objective(P) = objective(Q') - objective (Q).
Thus, we can delete whichever of P or Q performs worse than the other.
We call this symmetry-aware garbage collection.
Since a prefix of length k
belongs to an equivalence class of k!
prefixes
equivalent up to permutation, this garbage collection dramatically prunes the
search space.
(More generally, when two prefixes capture the same data, we only need to keep the better of the two; we have not investigated the utility of this idea.)
We perform symmetry-aware garbage collection on cache insertion, which enforces
that the cache never contains multiple prefixes equivalent up to permutation.
We support this via a hash map data structure (Python dict
) that we call the
inverse canonical map (ICM).
A set of all prefixes equivalent up to permutation is represented by any member
of the set; we choose the prefix with rules ordered from least to greatest to
represent this set, and call this prefix canonical.
The ICM maps canonical prefixes to cached prefixes (and for convenience,
their objectives) and contains one entry for each cache entry, and lets us
quickly check whether a prefix is equivalent to a cached prefix up to
permutation, and if so, determine which is better.
We couple ICM and cache updates: cache insertions and deletions trigger
corresponding ICM operations.
Semantics should buy us more than we're already achieving. Let's optimize the crap out of tic-tac-toe!
From Cynthia: If the rules make the same prediction on all of their overlapping points, then they commute.
Compute commutes as rule is added!
Curiosity doesn't seem to be helping adult
with c = 0.01
find the best
prefix of length 3, compared to objective
and breadth_first
.
Run breadth_first
on all prefixes of at least length 4.
Does the FP growth code have ideas that would be useful to us?
Conditional commutativity?
Insertions and deletions into pdict
(and priority_queue
?) should happen in
parallel with analogous cache operations.
May want to completely remove dependence of incremental on cache.
Should we skip symmetry-based garbage collection (via pdict
) when
len(prefix) == max_prefix_len_check
?
Not properly updating num_children
in cache entries or metrics.cache_size
.
Should we maintain a (lazily / partially materialized?) priority queue ordered by lower bound (to facilitate garbage collection when the min_objective decreases)?
More efficient way to encode lists of integers (currently tuples of integers).
Combine objective-based priority with restriction (prevent excessive exploration of a single subtree).
Restrict search to sub-tree. (Could be implemented by thresholding rules
.
May want this as a subroutine to enable smaller cache entries, and perhaps as
part of a parallel scheme.)
Implement back-off. (Something like put the cache entries back on the queue and
increase the lying/optimistic min_objective
.)
An actual pruning routine.
Rename max_accuracy
to best_prefix_accuracy
.
Remove prefix from cache entry (it's already the corresponding key).
Fix the greedy algorithm to stop early based on evaluating the default rule.
Think about useful heuristics to cut down on the size of the search space.
Cynthia's optimization.
We might want to explore notions of approximate equivalence.