dtqec / anatevka Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 1.0 51 KB

A distributed blossom algorithm for minimum-weight perfect matching

License: MIT License

Makefile 0.50% Common Lisp 99.50%

blossom-algorithm common-lisp distributed-computing graph-theory

anatevka's People

Contributors

Stargazers

Watchers

Forkers

gregoryschwing

anatevka's Issues

Convergecast-ify scanning or other larger operations

Follow-on tracking issue for leftover work from previous internal PR

Shut down blossom processes at anatevka's dryad's termination

We leave the subprocesses running, and this is unclean.

Optionally enforce blossom solution to be present in &body

In the blossom test framework, a test passes if it returns a matching which has the same weight as the ones provided in the test definition &body. If the test returns a matching not in the &body, but it is still a perfect match of the same weight as those provided in the solutions, the test still passes and it emits a "found new match" message. For small-scale tests where it is easy to write out all the right answers, I think it is useful to have the option to make the success condition even stronger, enforcing that the found solution is one-of the ones in the &body.

write a solver wrapper for anatevka

To ease things for the novice programmer, we might provide a wrapper that consumes a graph and produces a solution, dealing with all the distributed system set-up / tear-down internally.

Factor out UNLESS ABORTING? supervisor checks

A lingering TODO in src/supervisor.lisp, courtesy of @ecpeterson:

rather than cluttering all the command definitions with checks for UNLESS ABORTING?, consider wrapping that segment of script into an UNLESS node!

Calculate average runtime of blossom tests & enforce

Underlying randomness in the blossom algorithm simulations means that, from run to run, time-to-solution can vary wildly. However, it would be good to gain a better understanding of the average blossom algorithm runtime for different problem graphs, and the blossom test suite offers an opportunity to collect such data. Once we know the average runtime for a particular problem graph, we can then track & enforce that (plus maybe a worst-case for a single run), and as we improve performance we can make the tests stricter.

Log messages don't seem to pretty print anymore

When running reduce-log after the aether refactor, I get this for the recommendation message:

135.9: SUPERVISOR @ #<ADDRESS CHANNEL318588117> logged GOT-RECOMMENDATION:
    (SOURCE-ROOT #<ADDRESS CHANNEL318584031> RECOMMENDATION TRADE-BOUNDARIES
     WEIGHT 0 EDGES
     (#<[#<ADDRESS CHANNEL318572636>: #<ADDRESS CHANNEL318543724>--]--->#<ADDRESS CHANNEL318543726>>))

Think further on the ALL vs SOFT distinction

Especially in MULTIREWEIGHT, which is the only place where we have an ALL pingability after being locked. In a lot of places, SOFT is used when doing a "staleness guard". It also allows for roots that are simultaneously reweighting to communicate. Maybe it matters, maybe it doesn't?

Migrate `CONTRACT` procedure onto the blossom

Currently, the SUPERVISOR is responsible for parsing out a PONG, acquiring locks on the relevant subtrees, sending the subtrees manipulation instructions, and releasing the locks. For all of the tree operations save one, this looks like sending a "trigger" instruction to the relevant sub-tree (e.g., just the particular node, or to the root to be propagated down). However, for CONTRACT, much of the logic is contained inside in the SUPERVISOR's command definitions themselves. This should be moved into command definitions on the BLOSSOM-NODEs, perhaps to be executed "on start-up" as the freshly instantiated blossom begins to process events.

Only adjoin-root up to the topmost blossom

We can save some time on pongs by storing the root address at each node. Then we'd only have to travel to our oldest blossom ancestor to figure out what our total internal weight is before replying with a recommendation. Obviously this means that the root will have to be modified by various operations, but I don't think that will be very difficult. See #22 for additional ideas -- we could potentially pong right away if we're pistil-less, and skip the adjoin-root message altogether.

NB: Not sure if there are any live/deadlock considerations here. Upon changing this, a child could send a reply pong even if the root is starting to lock (before that wouldn't happen because all pongs exit through the root). I don't think there is a problem, but just wanted to point it out.

Refernce to sb-ext means it cannot do the test on other systems except sbcl.

anatevka/tests/blossom.lisp

Line 204 in 7823b02

(sb-ext:gc :full t))

You can try to use trivial-garbage:gc from the system "trivial-garbage" https://github.com/trivial-garbage/trivial-garbage

Write blossom test with random problem graphs

Writing up specific problem graphs is a useful way to further test coverage of the blossom algorithm, but we'll also want more holistic tests of the correctness and performance of the algorithm. One way to achieve this is by writing a random-problem-instance generator, feeding the instances into the algorithm, and checking the results for correctness. This form of testing, though maybe harder to reason about, is likely to make it much easier to test some of the more complicated subroutines of the algorithm. In order to test the correctness of the output, we can either brute-force solve the problems in parallel (which will work for small-scale instances) or use a serial implementation of the blossom algorithm (#37).

Installation and Running Example problems

I installed sbcl and quicklisp. I am installing "aether" and "anatevka" using quicklisp.

I load anatevka and aether using quicklisp

0] (ql:quickload "anatevka")
To load "anatevka":
Load 1 ASDF system:
anatevka
; Loading "anatevka"

("anatevka")

0[2] (ql:quickload "aether")
To load "aether":
Load 1 ASDF system:
aether
; Loading "aether"

("aether")

Then I past the code boxes from the readme into sbcl.

0[2] (defstruct demo-id
"A wrapper for a vertex ID used in the Mathematica blossom demo."
(value nil :type (integer 1 8)))

(defmethod anatevka::vertex-vertex-distance ((id-v demo-id) (id-w demo-id))
(let ((v (demo-id-value id-v))
(w (demo-id-value id-w)))
;; index into the following weighted adjacency matrix
(aref #2A(( 0 40 52 50 46 70 36 46)
(40 0 34 54 28 64 20 6)
(52 34 0 28 34 24 2 30)
(50 54 28 0 42 18 36 8)
(46 28 34 42 0 14 80 22)
(70 64 24 18 14 0 22 64)
(36 20 2 36 80 22 0 80)
(46 6 30 8 22 64 80 0))
(1- v) (1- w))))
DEMO-ID
0[2]

#<STANDARD-METHOD ANATEVKA:VERTEX-VERTEX-DISTANCE (DEMO-ID DEMO-ID) {1001A42053}>

0[2] (let* ((simulation (make-simulation))
;; aether requires us to bind *local-courier*' before spawning processes. (*local-courier* (make-courier :processing-clock-rate 300)) ;; The edges discovered by the algorithm will be announced on this address. (match-address (register)) ;; This process manages graph discovery. (dryad (spawn-process 'dryad :process-clock-rate 20 :debug? t :match-address match-address))) ;; Set up the core simulation components: the network host and the dryad. (simulation-add-event simulation (make-event :callback *local-courier* :time 0)) (simulation-add-event simulation (make-event :callback dryad :time 0)) ;; Prime the dryad with messages to spawn workers for the eight vertices. (loop :for j :from 1 :to 8 :for id := (make-demo-id :value j) :do (send-message (process-public-address dryad) (anatevka::make-message-sow :id id))) ;; Run simulation until maximally matched (i.e., until the dryad terminates). (simulation-run simulation :canary (canary-process dryad)) ;; Read out the match edges from the match-address' mailbox.
(labels ((drain-match-address (&optional acc)
(receive-message (match-address message)
(message-reap
(drain-match-address (list* (message-reap-ids message) acc)))
(otherwise
acc))))

;; Calculate the weight of the matching.
(loop :for (left right) :in (drain-match-address)
      :do (format t "~d --~02d-- ~d~%"
                  (demo-id-value left)
                  (vertex-vertex-distance left right)
                  (demo-id-value right))
      :sum (anatevka::vertex-vertex-distance left right))))

; in: LAMBDA (#:G476)
; (LET* ((SIMULATION (MAKE-SIMULATION))
; (LOCAL-COURIER (MAKE-COURIER :PROCESSING-CLOCK-RATE 300))
; (MATCH-ADDRESS (REGISTER))
; (DRYAD
; (SPAWN-PROCESS 'DRYAD :PROCESS-CLOCK-RATE 20 :DEBUG? T :MATCH-ADDRESS
; MATCH-ADDRESS)))
; (SIMULATION-ADD-EVENT SIMULATION
; (MAKE-EVENT :CALLBACK LOCAL-COURIER :TIME 0))
; (SIMULATION-ADD-EVENT SIMULATION (MAKE-EVENT :CALLBACK DRYAD :TIME 0))
; (LOOP :FOR J :FROM 1 :TO 8
; :FOR ID := (MAKE-DEMO-ID :VALUE J)
; :DO ...)
; (SIMULATION-RUN SIMULATION :CANARY (CANARY-PROCESS DRYAD))
; (LABELS ((DRAIN-MATCH-ADDRESS (&OPTIONAL ACC)
; (RECEIVE-MESSAGE # # #)))
; (LOOP :FOR (LEFT RIGHT) :IN (DRAIN-MATCH-ADDRESS)
; :DO (FORMAT T "~d --~02d-- d%" (DEMO-ID-VALUE LEFT)
; (VERTEX-VERTEX-DISTANCE LEFT RIGHT)
; (DEMO-ID-VALUE RIGHT))
; :SUM (ANATEVKA:VERTEX-VERTEX-DISTANCE LEFT RIGHT))))
;
; caught STYLE-WARNING:
; using the lexical binding of the symbol (COMMON-LISP-USER::LOCAL-COURIER), not the
; dynamic binding, even though the name follows
; the usual naming convention (names like FOO) for special variables
; in: LAMBDA (#:G476)
; (CANARY-PROCESS DRYAD)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::CANARY-PROCESS

; (MAKE-COURIER :PROCESSING-CLOCK-RATE 300)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::MAKE-COURIER

; (MAKE-EVENT :CALLBACK LOCAL-COURIER :TIME 0)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::MAKE-EVENT

; (MAKE-SIMULATION)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::MAKE-SIMULATION

; (MATCH-ADDRESS MESSAGE)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::MATCH-ADDRESS
;
; caught WARNING:
; undefined variable: COMMON-LISP-USER::MESSAGE

; (MESSAGE-REAP (DRAIN-MATCH-ADDRESS (LIST* (MESSAGE-REAP-IDS MESSAGE) ACC)))
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::MESSAGE-REAP

; (MESSAGE-REAP-IDS MESSAGE)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::MESSAGE-REAP-IDS

; (OTHERWISE ACC)
;
; caught WARNING:
; The function OTHERWISE is undefined, and its name is reserved by ANSI CL so
; that even if it were defined later, the code doing so would not be portable.

; (PROCESS-PUBLIC-ADDRESS DRYAD)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::PROCESS-PUBLIC-ADDRESS

; (RECEIVE-MESSAGE (MATCH-ADDRESS MESSAGE)
; (MESSAGE-REAP (DRAIN-MATCH-ADDRESS (LIST* (MESSAGE-REAP-IDS MESSAGE) ACC)))
; (OTHERWISE ACC))
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::RECEIVE-MESSAGE

; (REGISTER)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::REGISTER

; (SEND-MESSAGE (PROCESS-PUBLIC-ADDRESS DRYAD)
; (ANATEVKA:MAKE-MESSAGE-SOW :ID ID))
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::SEND-MESSAGE

; (SIMULATION-ADD-EVENT SIMULATION
; (MAKE-EVENT :CALLBACK LOCAL-COURIER :TIME 0))
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::SIMULATION-ADD-EVENT

; (SIMULATION-RUN SIMULATION :CANARY (CANARY-PROCESS DRYAD))
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::SIMULATION-RUN

; (SPAWN-PROCESS 'DRYAD :PROCESS-CLOCK-RATE 20 :DEBUG? T :MATCH-ADDRESS
; MATCH-ADDRESS)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::SPAWN-PROCESS

; (VERTEX-VERTEX-DISTANCE LEFT RIGHT)
;
; caught STYLE-WARNING:
; undefined function: COMMON-LISP-USER::VERTEX-VERTEX-DISTANCE
;
; compilation unit finished
; Undefined functions:
; CANARY-PROCESS MAKE-COURIER MAKE-EVENT MAKE-SIMULATION MATCH-ADDRESS MESSAGE-REAP MESSAGE-REAP-IDS OTHERWISE PROCESS-PUBLIC-ADDRESS RECEIVE-MESSAGE REGISTER SEND-MESSAGE SIMULATION-ADD-EVENT SIMULATION-RUN SPAWN-PROCESS VERTEX-VERTEX-DISTANCE
; Undefined variable:
; MESSAGE
; caught 2 WARNING conditions
; caught 16 STYLE-WARNING conditions

debugger invoked on a UNDEFINED-FUNCTION in thread
#<THREAD "main thread" RUNNING {1000560083}>:
The function COMMON-LISP-USER::MAKE-SIMULATION is undefined.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
0: [CONTINUE ] Retry calling MAKE-SIMULATION.
1: [USE-VALUE ] Call specified function.
2: [RETURN-VALUE ] Return specified values.
3: [RETURN-NOTHING] Return zero values.
4: [ABORT ] Reduce debugger level (to debug level 2).
5: Reduce debugger level (to debug level 1).
6: Exit debugger, returning to top level.

("undefined function")
0[3]

Install local assertions into the blossom algorithm

In tracing through the algorithm as part of an earlier internal PR, @karalekas suggested a couple of assertions that could be made while respecting memory locality:

When installing a master edge, check that the source node of the proposed edge matches your address.
When substituting one address for another over a collection of slave targets, check that exactly one such substitution occurs.

There are probably many more assertions that could be introduced which would not violate locality or consume too much runtime—but, in my experience, they are hard to list off without reading through the code line-by-line. This issue is an instruction to take the time to do exactly that.

Deal with partially connected graphs in anatevka

Anatevka's dryad's termination condition assumes that the problem graph is connected. We either need to warn the user or remove this restriction.

Rename forward/finish to broadcast/convergecast

From my readings on distributed algorithms (e.g. these Yale course notes) there is a common pattern wrt gathering information about a distributed network, which looks a lot like our FORWARD-SCAN/SCAN/PING/FINISH-SCAN paradigm, called broadcast and convergecast. In these algorithms, the root of a distributed tree tells all of its leaves to gather some information (broadcast) and then the leaves relay that information back up the tree, with some protocol for merging the information as it is aggregated up the tree (convergecast). This is essentially what we're doing to come up with a supervisor recommendation, but the root and internal (non-leaf) nodes also participate in the information-gathering, and our protocol (unify-pongs) is homogeneously applied to both the gathering (PING) and aggregating (pong throw) stages.

One thing: It's important to note that FINISH-SCAN actually serves two functions: (1) if we were told to scan (i.e. if we're not the root), it just sends our pong up the tree (and then it's merged as part of a sync receive in FORWARD-SCAN) and (2) if we initiated the scan (i.e. we are the root), it uses the merged info to spawn a supervisor. We might be able to break these up into two commands.

Random note: the root necessarily has to gather (PING) after all of its children in order for its information to be merged into the final recommendation.

Use ascii art graphs as test definitions

As Eric commented, with sufficient free time the ascii art pictograph comments in blossom.lisp could actually serve as test definitions, rather than solely visual aids.

A regular hexagon of events:

       o              o              O              O
       |              |              |              |
   o - + - o      o - +   *      o   + - O      o   +   *
   |   |   |  =>          |  ||  |          ||  |   |   |
   o - + - o      O - +   *      o   + - *      o   +   *
       |              |              |              |
       o              O              *              O

Reconsider `WIND-DOWN`

The anatevka dryad "winds down" at close, to allow blossoms to wrap up whatever stale queries they're performing. (If it closes immediately, then an ongoing stale SCAN e.g. might cause an RTS on a blossom, which causes a crash.) It might be better to make the blossoms responsive to RTSes (e.g., by removing themselves from the computation) than to install this arbitrary delay.

Have separate toggle for blossom-node debug logging

Right now, whatever is passed to the initialization of the dryad for PROCESS-DEBUG? is forwarded along to every blossom node that it creates. It might be nice to decouple these settings, and might even help with performance in certain cases where dryad logging is required but blossom logging is not?

Add simulate-until-command to support simultaneity tests

For example, it would be great to test fully-rewinding multireweighting, but there's no way that case will be reached just by setting up a simple supervisor test. This is because a zero-weight edge will be seen earlier in the multireweighting process, causing it to abort. We would need to simulate until reaching MULTIREWEIGHT-BROADCAST-REWEIGHT, then reach in and modify the weight of a nearby node so that there is then a weightless edge between it and the HOLD cluster, and then simulate until dead. I think this feature will also allow us to write tests for SCANing -- we could inject the START-SCAN command and simulate until IDLE.

(defun simulate-until-command (simulation process command &key (start-time 0) timeout)
  "Runs `SIMULATION' until `PROCESS' reaches `COMMAND'."
  (loop :with stopping-time := start-time
        :with delta := (/ (anatevka::process-clock-rate process))
        :while (and (process-command-stack process)
                    (not (= command (anatevka::peek (process-command-stack process)))))
        :when
          (not (process-command-stack process))
          :do (error "SIMULATE-UNTIL-COMMAND supervisor died.")
        :when (and timeout (< timeout (- stopping-time start-time)))
          :do (error "SIMULATE-UNTIL-COMMAND hit a timeout.")
        :do (incf stopping-time delta)
            (simulation-run simulation :until stopping-time)
        :finally (return (values simulation stopping-time))))

Eliminate ADJOIN-ROOT

The current mechanism for replying to a received ping message is to create an empty pong message, wrap both the ping and a pong in an adjoin-root message, send adjoin-root to loopback, use an adjoin-root handler to percolate up the tree and fill out the pong message, and finally (at the top of the tree) to reply with the fully-crafted pong.

This envelope may not be necessary: anyone receiving a ping can tell that they aren't the root and forward the ping appropriately. It would probably cut down a small bit on code complexity to remove this envelope and distribute its slots appropriately.

Implement tie-breaking

A significant amount of simulation time is wasted as blossoms defer to one another's attempts to perform this or that action (typically: a reweighting), waiting for the injected random communication delays to eventually pry them far enough apart from each other that one can sneak in an uninterrupted action. It would be far preferable for the blossoms to defer to each other in a selective way (i.e., according to some priority), so that global action is taken more quickly.

The positivep attribute is degenerate, calculate on the fly

As I was doing some debugging this week, one of the final things I had to fiddle with before inner blossom expansion started working was the positivep attribute -- even when the tree structure was set up correctly, an incorrect setting for positivep would throw off future recommendations and result in invalid states such as a node having two mates. A node's positivity is entirely determined by local information about the tree it is participating in -- for example, a bare vertex is positive, a vertex with a mate but no master is neither positive nor negative, a vertex with a mate that is also its master is positive, and a vertex with a mate that is also its slave is negative. Thus, there is no need to store an additional boolean positivep -- it opens us up to more possibilities for errors during blossom operations, and it can be calculated on the fly.

We don't need to send address in discovery message

As @ecpeterson pointed out, we could use the location to filter the discovery list, rather than sending along the address as well. I'll change this in a follow-up PR.

Code cleanliness: Perform scan in START-INNER-MULTIREWEIGHT

If we do this, then the custom MULTIREWEIGHT-BROADCAST-SCAN, MULTIREWEIGHT-BROADCAST-REWEIGHT, and MULTIREWEIGHT-CHECK-REWINDING commands can be dropped & replaced with their REWEIGHT counterparts.

Write the serial version of the blossom algorithm

It is probably worth writing out the simpler, serial version of the blossom algorithm. It could be used as a stepping stone in some simple demonstrations of how the distributed decoder works, and could be used as a way to check the correctness of matches in some blossom algorithm tests (rather than using a brute-force checker).

Write a validation routine for blossoms & alternating trees

Spin off from an earlier internal PR. We've written a lot of tests now for supervisor and dryad operations, but their usefulness is predicated on the fact that the starting trees in the tests actually are well-formed -- otherwise the behavior might be unexpected. It would be worthwhile to write routines that verify that what we say is a blossom actually has all the characteristics of being a blossom, and that what we say is an alternating tree actually has all the characteristics of being an alternating tree, and then use these in the tests (likely during compile-time).

Write a utility for setting up the dryad in matchmaker tests

It's a bit silly to be copying this from test to test.

(dolist (node original-tree)
  (let* ((location (slot-value node 'anatevka::id))
   (space (anatevka::spacetime-coordinate-space location))
   (address (process-public-address node))
   (internal-weight (anatevka::blossom-node-internal-weight node))
   ;; NOTE: if this is too small, then it's possible that we get
   ;;       conflicting recommendations for the final unify-pong
   (search-distance 4)
   (radius (cons internal-weight search-distance))
   (sprouted? (not (null (anatevka::blossom-node-mate-edge node)))))
    (setf (gethash address (anatevka::dryad-locations dryad))        location
          (gethash location (anatevka::dryad-boundary-mates dryad))  (register)
          (gethash location (anatevka::dryad-search-sizes dryad))    radius
          (gethash space (anatevka::dryad-future-horizons dryad))    100
          (gethash address (anatevka::dryad-sprouted? dryad))        sprouted?)

Add more complicated augment and attach-barbell tests

Currently, we have some simple test coverage for augmentation and barbell attaching, but it would be worthwhile to add some more complex ones that, for example, include trees that branch, or trees that have blossoms.

Tackle STYLE-WARNINGs

Compiling the code from scratch results in a variety of STYLE-WARNINGs. For example, some of the code refactors have left some straggling LET bindings behind. Go through and mop these up.

Consider filtering out fully rewound supervisors

In a previous internal PR, I added rewound supervisors to the reduced log, which is important for doing by-hand debugging when a supervisor partially rewinds its reweighting. However, when a supervisor fully rewinds, it does not need to be included.

Expand blossom test coverage to spacetime graphs

Currently the blossom test suite only supports providing spacelike problem graphs. The issue Eric encountered last week that caused him to change the behavior of the blossom algorithm (s.t. scanning nodes wait for inner/odd-height nodes to become active and return their pings before considering the scan finished) originated from a problem graph that spanned two timesteps (two identical equilateral triangles stacked one atop the other, one timestep apart). Thus, it would be beneficial to extend the define-blossom-test framework to allow coordinates to be sowed at different times. Some "timelike" (i.e. only in the time direction) tests should be included.

Gather hold cluster before spawning supervisor

Follow-on work from previous internal PR -- we could easily gather the multireweight hold cluster in FINISH-SCAN, which eliminates superfluous supervisor spawning, and greatly simplifies the script for the multireweight supervisor operation.