Coder Social home page Coder Social logo

iris's Introduction

Iris - Decentralized cloud messaging

Iris is an attempt at bringing the simplicity and elegance of cloud computing to the application layer. Consumer clouds provide unlimited virtual machines at the click of a button, but leaves it to developer to wire them together. Iris ensures that you can forget about networking challenges and instead focus on solving your own domain problems.

It is a completely decentralized messaging solution for simplifying the design and implementation of cloud services. Among others, Iris features zero-configuration (i.e. start it up and it will do its magic), semantic addressing (i.e. application use textual names to address each other), clusters as units (i.e. automatic load balancing between apps of the same name) and perfect secrecy (i.e. all network traffic is encrypted).

You can find further infos on the Iris website and details of the above features in the Core concepts section of The book of Iris. For the scientifically inclined, a small collection of papers is also available featuring Iris.

There is a growing community on Twitter @iriscmf, Google groups project-iris and GitHub project-iris.

Compatibility

Since from time to time the relay protocol changes (communication between Iris and client libraries), the below compatibility matrix was introduced to make it easier to find which versions of the client libraries (columns) match with which versions of the Iris node (rows).

iris-erl iris-go iris-java iris-scala
v0.3.x v1 v1 v1 v1
v0.2.x v0 v0 - -
v0.1.x v0 v0 - -

Releases

  • Development:
    • Fix Google Compute Engine netmask issue (i.e. retrieve real network configs).
    • Seamlessly use local CoreOS/etcd service as bootstrap seed server.
  • Version 0.3.2: October 4, 2014
    • Use 4x available CPU cores by default (will need a flag for this later).
  • Version 0.3.1: September 22, 2014
    • Open local relay endpoint on both IPv4 and IPv6 (bindings can remain oblivious).
    • Fix bootstrap crash in case of single-host networks (host space < 2 bits).
    • Fix race condition between tunnel construction request and finalization.
  • Version 0.3.0: August 11, 2014
    • Work around upstream Go bug #5395 on Windows.
    • Fix memory leak caused by unreleased connection references.
    • Fix tunnel lingering caused by missing close invocation in Iris overlay.
    • Fix message loss caused by clearing scheduled messages during a relay close.
    • Fix race condition between tunnel construction and operation.
    • Rewrite relay protocol to v1.0-draft2.
      • Proper protocol negotiation (magic string, version numbers).
      • Built in error fields to remote requests, no need for user wrappers.
      • Tunnel data chunking to support arbitrarily large messages.
      • Size based tunnel throttling opposed to the message count previously.
    • Migrate from github.com/karalabe to github.com/project-iris.
  • Version 0.2.0: March 31, 2014
    • Redesigned tunnels based on direct TCP connections.
    • Prioritized system messages over separate control connections.
    • Graceful connection and overlay tear-downs (still plenty to do).
    • Countless stability fixes (too many to enumerate)
  • Version 0.1-pre2 (hotfix): September 11, 2013
    • Fix fast subscription reply only if subscription succeeds.
    • Fix topic self report after a node failure.
    • Fix heart mechanism to report errors not panics, check duplicate monitoring.
    • Fix late carrier heartbeat startup.
    • Fix panic caused by balance requests pending during topic termination.
    • Fix corrupt topic balancer caused by stale parent after removal.
  • Version 0.1-pre: August 26, 2013
    • Initial RFC release.

Contributions

Currently my development aims are to stabilize the project and its language bindings. Hence, although I'm open and very happy for any and all contributions, the most valuable ones are tests, benchmarks and bug-fixes.

Due to the already significant complexity of the project, I kindly ask anyone willing to pinch in to first file an issue with their plans to achieve a best possible integration :).

Additionally, to prevent copyright disputes and such, a signed contributor license agreement is required to be on file before any material can be accepted into the official repositories. These can be filled online via either the Individual Contributor License Agreement or the Corporate Contributor License Agreement.

iris's People

Contributors

abursavich avatar funkygao avatar karalabe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iris's Issues

Tunnel construction race

There's a small race condition during constructing a tunnel. Sometimes - if there are a lot of threads running - the scheduler pauses a construction thread after sending the tunneling request, and fails to reschedule it until the construction finishes (but during finish if the origin isn't waiting for the result, it is assumed that a timeout occurred).

Tunnel close and send race

There's a small race condition (at least it looks like it) when a batch of messages are pushed into a tunnel and then the tunnel immediately closed. Dunno whether it's binding or relay side, will need a proper test and/or investigation.

Disallow sending empty messages

It should be decided whether empty messages make sense or not, but if not, they should be explicitly disallowed and violating connections dropped to prevent faulty API uses.

The current relay protocol spec cannot accept empty tunnel messages, so either a rework is needed there, or just disable empty messages everywhere.

Publish anomalies

During the pubsub throughput benchmarks there are a few weird things (a lot of out of bound events, seemingly looping deliveries). It would be worth an investigation.

Bootstrapper probing tests

Since the bootstrapper was transitioned to IPNets from IPs (issue #8), the missing probing tests can now be implemented since it's possible to specify a small enough IP mask.

There are probably some minor adjustments to be made to disable scanning during the probing tests (and vice versa would also be nice).

Client bindings: add support for direct messaging (resource management)

Issue #37 is a prerequisite for this.

Currently Iris supports only group messaging primitives. Although this works for many scenarios, there is a specific use case which is made quite hard and expensive by it. And that is custom resource management.

When an application wishes to use its own custom logic for managing some specific resources, it usually entails an origin doing a search for candidates, and those candidates then contacting the origin out of band for/with the resource. This out of band addressing currently requires the origin to form a unique cluster where it is the only participant, to allow remote nodes to find it.

Two specific scenarios come to mind:

  • In the case of the decentralized raytracer, when a user node wishes the have a model rendered, it needs to find idle workers (broadcast or publish), but then also needs to distribute the model to those workers and have the workers respond to the correct "job assigner". This would require a tunnel from the workers to the origin... but how to find the origin?
  • In the case of the a workflow system, an originating node issues a search into the system for destinations which contain a combination of resources needed. But then those need to respond to the searcher specifically.

The question is how to achieve this, without making it too easy for users to abuse the concept and introduce single points of failure? Or whether the issue raised could be solved without giving access to direct addresses?

Proper thread pool termination results in a deadlock?

A proposed fix (issue #4) for the thread pool termination bug (not waiting on workers to finish) results in the iris package tests dead locking. This might be a bug in the refactored thread pool or a buggy usage of pool termination. Further investigation is needed.

Messaging without registered service

[Memo to self]

An Iris client should support some form of messaging calls (typically requests) without registering as a service. The main usage would be initializing the service before accepting inbound messages.

A current hack could be to join as a temporary app group, initialize and then join as the service itself. Yep, ugly.

Describe use cases

We all know the usecases for projects like ZeroRPC & Serf.
It seems that they provide similar features (especially Serf).
How is iris any different?

Pastry graceful close timeout leads to some deadlock

The iris req/req tests sometimes display a deadlock after a few pastry nodes don't clean up fast enough. But it might as well be that deadlock is first and because of that the teardown times out. Dunno, will need to investigate this.

Maybe race condition in tunnel build/close

In the Erlang binding if I construct a tunnel and immediately close it down, on the remote side it usually hangs with - I think - Iris never replying to the tear-down request. This seems to be caused in the Iris node not finishing the tunnel construction but already receiving the tear-down, and loosing the close message somewhere along the way.

This should never happen as a tunnel should be fully constructed before any side receives the ack. So this needs investigation, but it seems I introduced a bug somewhere along the way.

The commit which illustrates this bug is in iris-erl 9b817dcedeb3b03bc22536e97dce6c46a84b9029

Separate routing and naming hashes?

An interesting idea is whether it would be worthwhile to separate the routing hashes and the naming hashes.

Since the number of nodes is relatively reduced (thousands most probably), the Pastry hashes shouldn't exceed 40-48 bits, otherwise the routing tables become sparse.

However, we could allow longer Scribe hashes to support significantly more topics. In essence only the first N bits of the scribe hash would result in routing and the rest only used for demultiplexing among topics with the same rendezvous point.

Bootstrap probe panic

Iris panics when trying to probe.

eth0 details:

          inet addr:10.240.145.75  Bcast:10.240.145.75  Mask:255.255.255.255
          inet6 addr: fe80::4001:aff:fef0:914b/64 Scope:Link
$ ./iris -dev
Entering developer mode
Generating random RSA key... done.
Generating random cluster name... done.

2014/08/20 04:40:52 main: booting iris overlay...
2014/08/20 04:40:52 scribe: booting with id 1021648966721.
panic: invalid argument to Intn

goroutine 30 [running]:
runtime.panic(0x5ad400, 0xc208001250)
    /usr/local/go/src/pkg/runtime/panic.c:279 +0xf5
math/rand.(*Rand).Intn(0xc2080001e0, 0xffffffffffffffff, 0x0)
    /usr/local/go/src/pkg/math/rand/rand.go:95 +0x71
math/rand.Intn(0xffffffffffffffff, 0x10)
    /usr/local/go/src/pkg/math/rand/rand.go:195 +0x34
github.com/project-iris/iris/proto/bootstrap.(*Bootstrapper).probe(0xc208046480)
    /go/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:242 +0x1b5
created by github.com/project-iris/iris/proto/bootstrap.(*Bootstrapper).Boot
    /go/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:142 +0x78

goroutine 16 [semacquire]:
sync.runtime_Semacquire(0xc208000788)
    /usr/local/go/src/pkg/runtime/sema.goc:199 +0x30
sync.(*WaitGroup).Wait(0xc2080960c8)
    /usr/local/go/src/pkg/sync/waitgroup.go:129 +0x14b
github.com/project-iris/iris/proto/pastry.(*Overlay).Boot(0xc208096000, 0x0, 0x0, 0x0)
    /go/src/github.com/project-iris/iris/proto/pastry/overlay.go:168 +0x426
github.com/project-iris/iris/proto/scribe.(*Overlay).Boot(0xc2080a16d0, 0xc2080009e8, 0x0, 0x0)
    /go/src/github.com/project-iris/iris/proto/scribe/overlay.go:81 +0xe5
github.com/project-iris/iris/proto/iris.(*Overlay).Boot(0xc2080987e0, 0x1f, 0x0, 0x0)
    /go/src/github.com/project-iris/iris/proto/iris/overlay.go:64 +0x50
main.main()
    /go/src/github.com/project-iris/iris/main.go:177 +0x522

goroutine 19 [finalizer wait]:
runtime.park(0x418600, 0x7852e8, 0x783dc9)
    /usr/local/go/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x7852e8, 0x783dc9)
    /usr/local/go/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
    /usr/local/go/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
    /usr/local/go/src/pkg/runtime/proc.c:1445

goroutine 21 [chan receive]:
github.com/project-iris/iris/system.func·001()
    /go/src/github.com/project-iris/iris/system/system.go:56 +0x54
created by github.com/project-iris/iris/system.init·1
    /go/src/github.com/project-iris/iris/system/system.go:59 +0x3d

goroutine 22 [syscall]:
os/signal.loop()
    /usr/local/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
    /usr/local/go/src/pkg/os/signal/signal_unix.go:27 +0x32

goroutine 23 [select]:
github.com/project-iris/iris/heart.(*Heart).beater(0xc20801a910)
    /go/src/github.com/project-iris/iris/heart/heart.go:131 +0x46b
created by github.com/project-iris/iris/heart.(*Heart).Start
    /go/src/github.com/project-iris/iris/heart/heart.go:63 +0x2f

goroutine 24 [select]:
github.com/project-iris/iris/proto/pastry.(*Overlay).acceptor(0xc208096000, 0xc2080a23c0, 0xc2080044e0)
    /go/src/github.com/project-iris/iris/proto/pastry/handshake.go:81 +0xd4a
created by github.com/project-iris/iris/proto/pastry.(*Overlay).Boot
    /go/src/github.com/project-iris/iris/proto/pastry/overlay.go:155 +0x339

goroutine 25 [select]:
github.com/project-iris/iris/proto/pastry.(*Overlay).manager(0xc208096000)
    /go/src/github.com/project-iris/iris/proto/pastry/maintenance.go:76 +0x1006
created by github.com/project-iris/iris/proto/pastry.(*Overlay).Boot
    /go/src/github.com/project-iris/iris/proto/pastry/overlay.go:160 +0x3a3

goroutine 26 [select]:
github.com/project-iris/iris/heart.(*Heart).beater(0xc20801a8c0)
    /go/src/github.com/project-iris/iris/heart/heart.go:131 +0x46b
created by github.com/project-iris/iris/heart.(*Heart).Start
    /go/src/github.com/project-iris/iris/heart/heart.go:63 +0x2f

goroutine 27 [IO wait]:
net.runtime_pollWait(0x7f42c7915580, 0x72, 0x0)
    /usr/local/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc208098220, 0x72, 0x0, 0x0)
    /usr/local/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc208098220, 0x0, 0x0)
    /usr/local/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).accept(0xc2080981c0, 0x6b2d60, 0x0, 0x7f42c79123c8, 0xb)
    /usr/local/go/src/pkg/net/fd_unix.go:409 +0x343
net.(*TCPListener).AcceptTCP(0xc20803c030, 0xecb861dd4, 0x0, 0x0)
    /usr/local/go/src/pkg/net/tcpsock_posix.go:234 +0x5d
github.com/project-iris/iris/proto/stream.(*Listener).accepter(0xc2080a4380, 0x3b9aca00)
    /go/src/github.com/project-iris/iris/proto/stream/stream.go:98 +0x241
created by github.com/project-iris/iris/proto/stream.(*Listener).Accept
    /go/src/github.com/project-iris/iris/proto/stream/stream.go:74 +0x39

goroutine 28 [select]:
github.com/project-iris/iris/proto/session.(*Listener).accepter(0xc208004600, 0x3b9aca00)
    /go/src/github.com/project-iris/iris/proto/session/handshake.go:128 +0x42b
created by github.com/project-iris/iris/proto/session.(*Listener).Accept
    /go/src/github.com/project-iris/iris/proto/session/handshake.go:109 +0x58

goroutine 29 [IO wait]:
net.runtime_pollWait(0x7f42c79154d0, 0x72, 0x0)
    /usr/local/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(*pollDesc).Wait(0xc208098290, 0x72, 0x0, 0x0)
    /usr/local/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(*pollDesc).WaitRead(0xc208098290, 0x0, 0x0)
    /usr/local/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(*netFD).readFrom(0xc208098230, 0x7f42c4e638a4, 0x5dc, 0x5dc, 0x0, 0x0, 0x0, 0x7f42c79123c8, 0xb)
    /usr/local/go/src/pkg/net/fd_unix.go:259 +0x3db
net.(*UDPConn).ReadFromUDP(0xc20803c038, 0x7f42c4e638a4, 0x5dc, 0x5dc, 0x0, 0x0, 0x0, 0x0)
    /usr/local/go/src/pkg/net/udpsock_posix.go:67 +0x129
github.com/project-iris/iris/proto/bootstrap.(*Bootstrapper).accept(0xc208046480)
    /go/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:198 +0x24a
created by github.com/project-iris/iris/proto/bootstrap.(*Bootstrapper).Boot
    /go/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:141 +0x60

goroutine 31 [runnable]:
github.com/project-iris/iris/proto/bootstrap.(*Bootstrapper).scan(0xc208046480)
    /go/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:279
created by github.com/project-iris/iris/proto/bootstrap.(*Bootstrapper).Boot
    /go/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:143 +0x90

Scribe broadcast multiplied

There is a bug during churn, when a scribe node is joining while a publish is in flight. If the joining node gets a publish event, it will forward it towards the topic root (even if it has already been there and begun dissemination), leading to message multiplication.

The reason is that the current scribe implementation does not track whether a publish is searching for the topic multicast tree, or is already in the dissemination phase. Using two separate operations instead of a single opPub should solve this. It may lose messages (it will anyway), but it will not duplicate.

Go routine leak under heavy failures

During system overload (i.e. pastry DOS tests), when connections get dropped all over the place, a few go routines leak out. Probably some tasks are not pooled and/or tracked, but rather started directly with go <...>.

Relay protocol docs

Although the protocol between a client lib and Iris is simple, it would be essential to document it properly to allow third party binding implementations (or even official binding contributions).

Group bursting messages on peer sessions

Currently all messages pass through the peer session links individually. It would probably be worth to cache small messages and send them out as one "message group" to significantly reduce the processing (especially encryption) overhead.

Note, only already buffered messages should be grouped and no wait whatsoever is allowed to ensure packet latency doesn't jump (I'm looking at you, Erlang). The max size limit I'd say could be around 1KB, but this obviously needs benchmarking.

Remove the topic load balancer

Currently the Iris clusters and topics use the same implementation internally. However, since the topics aren't meant to support requests and tunnels, the load balancer can and should be removed from underneath them. This would also allow spawning significantly more topics since they would essentially become much more lightweight.

[qos] A failed listener socket panics

If a system is overloaded with connections, after reaching the maximum open file allowance, the listener socket fails, closing the session sink. Since the overlay has no logic to handle runtime-failed listeners, it currently panics.

Although this should never occur during normal operation, a DOS attack might take advantage of this. Eventually when Iris reaches high enough stability this should be addresses.

Subscribe/publish race

There's a race condition (already known, maybe an open issue somewhere) that a subscribe and immediate publish may lose events. This might be easily fixable with the same trick that solved the tunnel init/allow race.

The issue is visible in the iris-go: pub/sub tests.

Relay handshake timeout

Enforce a time limit in which the relay handshake must complete. It should be on the order of a hundred milliseconds. This should help filter out faulty clients and/or other processes connecting by accident.

Windows: tunnel requests don't include dial addresses

When a node requests a tunnel into a remote one, no connection address is specified, and hence the remote side will crash since its not prepared for an empty address list.

Will need to:

  • Figure out why the empty list
  • Make sure a faulty empty list doesn't cause a crash

Lockup during massive churn

While hunting for message loss during churn, I managed to lock up the whole system. Not sure whether routing or handshaking is the culprit. Will need to add churn tests to both and see.

Unclosed tunnels sometimes linger

There's a strange anomaly where a tunnel lingers whilst both endpoints are dead already. Probably some cleanup missing during binding closure.

Relay graceful close

After the client requests a close, there should be a grace timeout in which no new requests are accepted, but already running ones should be allowed to complete to time out. Additionally, inbound ones should be either forwarded to the client, or re-routed to possible new clients.

Graceful tunnel termination should also be inspected, but that will require a larger redesign, so it will probably be done later. Just don't forget about it.

Relay should listen on both IPv4 and IPv6

Currently languages quite randomly decide whether to use IPv4 or IPv6 when connecting to a service, depending among others, probably on the position of the moon too. Since it is a painful game to play in the individual bindings, just get the relay to listen on both, and let individual clients pick whatever rocks their boat.

Tunnel timeout returns success

Is seems that a tunnel which timed out is returned as confirmed by the relay. Will need to write a test case for this in the Go binding and track it down.

Reorganize project

Iris is currently organized as one repo for the core messaging middleware and separate repos for the individual language bindings. Adding docker support would introduce yet another repo, and opening up the talks and playground would again add a lot of stuff.

It could be worthwhile to reorganize the iris family of repos into a single large one (i.e. similar to how camlistore is laid out). Proposals are welcome as to how best to this. The important thing would be that individual libs and the likes should be easily and naturally accessible using the core tools of the specific language environments.

Message security check fails during forward

There was a small bug introduced with the verification that ensured that no plaintext is allowed to cross a wire.

The message was flagged as secure after being encrypted, but the security flag is not transmitted, hence on the other side the flag got unset. This is was no problem for delivery (hence why all tests passed), but during forwarding the messages don't get re-crypted, thus the packet is dropped.

This has been corrected by setting all arriving messages knowingly to secure. The patch will be committed soon, but it is part of a large refactor still in progress. I've added this bug to mark the branch unstable for the moment.

Relay internals: Add lookup + numerical addressing

Currently all addressing binding side is done through strings. Client and API side this is good and should not change. However, there is an optimization possibility in the relay protocol.

If the client uses long identification strings (cluster/topic names), it places a double burden on the system: every time a message is sent to that address, the binding needs to pass the long string to the relay node. On relay side, Iris needs to hash that string into its final numerical scribe/pastry address for every message.

This could be made significantly faster by introducing a lookup mechanism into the relay protocol. Whenever a binding encounters a new id string, it looks up the string's hash id (relay request) and caches it. Whenever a message is to be sent to an address, instead of the textual id, the cached numerical id will be used.

The advantage is that currently Iris uses 6 byte internal ids for clusters and topics. It would thus be a much smaller address blob to transfer between binding and relay, and also remove the need relay side to post-process the address.


The question that arises is how to differentiate between clusters and topics in this new scheme, since the old one concatenated some prefix values to the textual address. Prefixing the binary address is obviously not good since it ruins the uniform distribution of the hashed addresses. Suffixing on the other hand might work. Will need to explore this a bit.

The same question arises for the sub-group optimization in the Scribe layer. The solution will probably be the same.

Worker pool - recursively adding items?

Hi everyone,

I've played a bit with the worker pool, which works fine when I'm adding tasks to it which it can crunch on.
However, while experimenting with a piece of C code using OpenMP and trying to see how it would look in Go, I understand that the way the pool is designed using a FIFO queue, there's a case where the pool deadlocks: Creating tasks recursively, while waiting for a result.
What seems like an academic, contrieved example, actually is (an irregular algorithm where n tasks are created recursively for benchmarking purposes)

Here's an example: http://play.golang.org/p/eSeh1ijy1g (I've simply copy&pasted the work pool package into the play sandbox, with an example at the top, for the sake of demonstration)

Basically, the resursively called func is scheduled, blocking a place in the queue, waiting for a result that never comes, because the last recursion step that actually would unwind the chain of recursion, never gets scheduled, because the pool's queue is already full.

You can increase the pool size, of course, healing symptoms, only to run into the problem when the recursion is deeper than the size of the pool.

I understand that even if this kind of problem isn't that common, I thought this is a good place to ask, because I'm sure someone has already thought how this could be solved elegantly in Go and can point me in the right direction.

Thank you,
Artjom

Documentation and code examples are missing

The README file should at least contain a quickstart section and the documentation (the book of iris) should be completed.

What exactly needs to be done in order to proceed?

Missing sections of the book:

  • Overview / Run, Forrest, Run
  • Nuts and bolts / Request/Reply pattern
  • Nuts and bolts / Broadcast pattern
  • Nuts and bolts / Tunnel pattern
  • Nuts and bolts / Publish/Subscribe
  • Skyscrapers & spaceships / Iris inside docker
  • Case studies / Embree decentralized
  • Case studies / RegionRank internals
  • Epilogue / The night is dark and full of terrors

Pending tunnels leave stale goroutines with racey quit

If there are tunnel constructions pending when a client connection drops (usually because of a crashed client or a faulty protocol implementation), the afterwards successfully established tunnels are left over with running goroutines, preventing the GC from cleaning them up.

This bug is a combination of multiple ones in both proto/iris/tunnel as well as the service/relay.

Overlay message loss during churn (graceful shutdown)

Currently the pastry overlay features no graceful shutdown mechanism to avoid message loss. At the lower session level graceful termination has been already implemented, but a leave operation should be added to pastry to prevent peers from sending further messages.

The proto/pastry/routing_test.go has been extended to simulate messaging during churn, but a powerful enough machine is needed to actually catch the bug (i.e. enough cores to have one linger at the exact "wrong" place during shutdown).

Crash during heavy subscribe/unsubscribe

If a topic is subscribed to and immediately unsubscribed, there is probably a race where certain fields aren't initialized yet, leading to a null pointer error.

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x8 pc=0x5553ae]

goroutine 3169783 [running]:
runtime.panic(0x6166a0, 0x781273)
/opt/google/go/src/pkg/runtime/panic.c:279 +0xf5
github.com/project-iris/iris/container/queue.(_Queue).Reset(0x0)
/home/karalabe/work/iris/src/github.com/project-iris/iris/container/queue/queue.go:107 +0x1e
github.com/project-iris/iris/pool.(_ThreadPool).Terminate(0xc208056570, 0x63f101)
/home/karalabe/work/iris/src/github.com/project-iris/iris/pool/thread.go:84 +0x87
github.com/project-iris/iris/service/relay.(_relay).process(0xc20cf3c0c0)
/home/karalabe/work/iris/src/github.com/project-iris/iris/service/relay/proto.go:618 +0x16e
created by github.com/project-iris/iris/service/relay.(_Relay).acceptRelay
/home/karalabe/work/iris/src/github.com/project-iris/iris/service/relay/relay.go:126 +0x572

goroutine 16 [chan receive, 83 minutes]:
main.main()
/home/karalabe/work/iris/src/github.com/project-iris/iris/main.go:200 +0x914

goroutine 19 [finalizer wait]:
runtime.park(0x415a20, 0x7852c8, 0x783da9)
/opt/google/go/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x7852c8, 0x783da9)
/opt/google/go/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
/opt/google/go/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
/opt/google/go/src/pkg/runtime/proc.c:1445

goroutine 21 [chan receive]:
github.com/project-iris/iris/system.func·001()
/home/karalabe/work/iris/src/github.com/project-iris/iris/system/system.go:56 +0x54
created by github.com/project-iris/iris/system.init·1
/home/karalabe/work/iris/src/github.com/project-iris/iris/system/system.go:59 +0x3d

goroutine 22 [syscall, 83 minutes]:
os/signal.loop()
/opt/google/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
/opt/google/go/src/pkg/os/signal/signal_unix.go:27 +0x32

goroutine 23 [select]:
github.com/project-iris/iris/heart.(_Heart).beater(0xc208040910)
/home/karalabe/work/iris/src/github.com/project-iris/iris/heart/heart.go:131 +0x46b
created by github.com/project-iris/iris/heart.(_Heart).Start
/home/karalabe/work/iris/src/github.com/project-iris/iris/heart/heart.go:63 +0x2f

goroutine 24 [select, 83 minutes]:
github.com/project-iris/iris/proto/pastry.(_Overlay).acceptor(0xc20812e000, 0xc2080e8390, 0xc208044540)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/pastry/handshake.go:81 +0xd4a
created by github.com/project-iris/iris/proto/pastry.(_Overlay).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/pastry/overlay.go:155 +0x339

goroutine 25 [select, 83 minutes]:
github.com/project-iris/iris/proto/pastry.(_Overlay).acceptor(0xc20812e000, 0xc2080e83c0, 0xc2080445a0)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/pastry/handshake.go:81 +0xd4a
created by github.com/project-iris/iris/proto/pastry.(_Overlay).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/pastry/overlay.go:155 +0x339

goroutine 26 [select]:
github.com/project-iris/iris/proto/pastry.(_Overlay).manager(0xc20812e000)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/pastry/maintenance.go:76 +0x1006
created by github.com/project-iris/iris/proto/pastry.(_Overlay).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/pastry/overlay.go:160 +0x3a3

goroutine 27 [select]:
github.com/project-iris/iris/heart.(_Heart).beater(0xc208040050)
/home/karalabe/work/iris/src/github.com/project-iris/iris/heart/heart.go:131 +0x46b
created by github.com/project-iris/iris/heart.(_Heart).Start
/home/karalabe/work/iris/src/github.com/project-iris/iris/heart/heart.go:63 +0x2f

goroutine 28 [IO wait]:
net.runtime_pollWait(0x7fc1f8e4a200, 0x72, 0x0)
/opt/google/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(_pollDesc).Wait(0xc2080e20d0, 0x72, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(_pollDesc).WaitRead(0xc2080e20d0, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(_netFD).accept(0xc2080e2070, 0x6b2e68, 0x0, 0x7fc1f8e433c8, 0xb)
/opt/google/go/src/pkg/net/fd_unix.go:419 +0x343
net.(_TCPListener).AcceptTCP(0xc208076028, 0xecbacce57, 0x0, 0x0)
/opt/google/go/src/pkg/net/tcpsock_posix.go:234 +0x5d
github.com/project-iris/iris/proto/stream.(_Listener).accepter(0xc2080ee3e0, 0x3b9aca00)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/stream/stream.go:98 +0x241
created by github.com/project-iris/iris/proto/stream.(_Listener).Accept
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/stream/stream.go:74 +0x39

goroutine 29 [select, 83 minutes]:
github.com/project-iris/iris/proto/session.(_Listener).accepter(0xc2080446c0, 0x3b9aca00)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/session/handshake.go:128 +0x42b
created by github.com/project-iris/iris/proto/session.(_Listener).Accept
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/session/handshake.go:109 +0x58

goroutine 32 [IO wait]:
net.runtime_pollWait(0x7fc1f8e4a150, 0x72, 0x0)
/opt/google/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(_pollDesc).Wait(0xc20809c220, 0x72, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(_pollDesc).WaitRead(0xc20809c220, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(_netFD).accept(0xc20809c1c0, 0x6b2e68, 0x0, 0x7fc1f8e433c8, 0xb)
/opt/google/go/src/pkg/net/fd_unix.go:419 +0x343
net.(_TCPListener).AcceptTCP(0xc2080d0010, 0xecbacce57, 0x0, 0x0)
/opt/google/go/src/pkg/net/tcpsock_posix.go:234 +0x5d
github.com/project-iris/iris/proto/stream.(_Listener).accepter(0xc20800e720, 0x3b9aca00)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/stream/stream.go:98 +0x241
created by github.com/project-iris/iris/proto/stream.(_Listener).Accept
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/stream/stream.go:74 +0x39

goroutine 33 [select, 83 minutes]:
github.com/project-iris/iris/proto/session.(_Listener).accepter(0xc208004120, 0x3b9aca00)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/session/handshake.go:128 +0x42b
created by github.com/project-iris/iris/proto/session.(_Listener).Accept
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/session/handshake.go:109 +0x58

goroutine 30 [IO wait]:
net.runtime_pollWait(0x7fc1f8e4a0a0, 0x72, 0x0)
/opt/google/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(_pollDesc).Wait(0xc2080e2140, 0x72, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(_pollDesc).WaitRead(0xc2080e2140, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(_netFD).readFrom(0xc2080e20e0, 0x7fc1f45898a4, 0x5dc, 0x5dc, 0x0, 0x0, 0x0, 0x7fc1f8e433c8, 0xb)
/opt/google/go/src/pkg/net/fd_unix.go:269 +0x3db
net.(_UDPConn).ReadFromUDP(0xc208076030, 0x7fc1f45898a4, 0x5dc, 0x5dc, 0x0, 0x0, 0x0, 0x0)
/opt/google/go/src/pkg/net/udpsock_posix.go:67 +0x129
github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).accept(0xc2080e0480)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:198 +0x24a
created by github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:141 +0x60

goroutine 31 [select]:
github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).probe(0xc2080e0480)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:267 +0x680
created by github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:142 +0x78

goroutine 48 [chan receive, 82 minutes]:
github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).scan(0xc2080e0480)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:349 +0x218
created by github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:143 +0x90

goroutine 64 [IO wait]:
net.runtime_pollWait(0x7fc1f8e49ff0, 0x72, 0x0)
/opt/google/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(_pollDesc).Wait(0xc20809c290, 0x72, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(_pollDesc).WaitRead(0xc20809c290, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(_netFD).readFrom(0xc20809c230, 0x7fc1f45378a4, 0x5dc, 0x5dc, 0x0, 0x0, 0x0, 0x7fc1f8e433c8, 0xb)
/opt/google/go/src/pkg/net/fd_unix.go:269 +0x3db
net.(_UDPConn).ReadFromUDP(0xc2080d0018, 0x7fc1f45378a4, 0x5dc, 0x5dc, 0x0, 0x0, 0x0, 0x0)
/opt/google/go/src/pkg/net/udpsock_posix.go:67 +0x129
github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).accept(0xc208080680)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:198 +0x24a
created by github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:141 +0x60

goroutine 65 [select]:
github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).probe(0xc208080680)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:267 +0x680
created by github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:142 +0x78

goroutine 66 [select]:
github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).scan(0xc208080680)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:341 +0x59b
created by github.com/project-iris/iris/proto/bootstrap.(_Bootstrapper).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/bootstrap/bootstrap.go:143 +0x90

goroutine 34 [select, 83 minutes]:
github.com/project-iris/iris/proto/iris.(_Overlay).tunneler(0xc2080e2460, 0xc2081d6580, 0x10, 0x10, 0xc208004540, 0xc2080044e0)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/iris/tunnel.go:87 +0x9cb
created by github.com/project-iris/iris/proto/iris.(_Overlay).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/iris/overlay.go:93 +0x416

goroutine 35 [IO wait]:
net.runtime_pollWait(0x7fc1f8e49f40, 0x72, 0x0)
/opt/google/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(_pollDesc).Wait(0xc20809de90, 0x72, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(_pollDesc).WaitRead(0xc20809de90, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(_netFD).accept(0xc20809de30, 0x6b2e68, 0x0, 0x7fc1f8e433c8, 0xb)
/opt/google/go/src/pkg/net/fd_unix.go:419 +0x343
net.(_TCPListener).AcceptTCP(0xc2080d0048, 0xecbacce57, 0x0, 0x0)
/opt/google/go/src/pkg/net/tcpsock_posix.go:234 +0x5d
github.com/project-iris/iris/proto/stream.(_Listener).accepter(0xc2081d89e0, 0x3b9aca00)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/stream/stream.go:98 +0x241
created by github.com/project-iris/iris/proto/stream.(_Listener).Accept
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/stream/stream.go:74 +0x39

goroutine 67 [select, 50 minutes]:
github.com/project-iris/iris/proto/iris.(_Overlay).tunneler(0xc2080e2460, 0xc2081d6590, 0x10, 0x10, 0xc2080fe660, 0xc2080fe600)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/iris/tunnel.go:87 +0x9cb
created by github.com/project-iris/iris/proto/iris.(_Overlay).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/iris/overlay.go:93 +0x416

goroutine 68 [IO wait]:
net.runtime_pollWait(0x7fc1f8e49e90, 0x72, 0x0)
/opt/google/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(_pollDesc).Wait(0xc2081e03e0, 0x72, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(_pollDesc).WaitRead(0xc2081e03e0, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(_netFD).accept(0xc2081e0380, 0x6b2e68, 0x0, 0x7fc1f8e433c8, 0xb)
/opt/google/go/src/pkg/net/fd_unix.go:419 +0x343
net.(_TCPListener).AcceptTCP(0xc2080fa2a8, 0xecbacce57, 0x0, 0x0)
/opt/google/go/src/pkg/net/tcpsock_posix.go:234 +0x5d
github.com/project-iris/iris/proto/stream.(_Listener).accepter(0xc2081e2a00, 0x3b9aca00)
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/stream/stream.go:98 +0x241
created by github.com/project-iris/iris/proto/stream.(_Listener).Accept
/home/karalabe/work/iris/src/github.com/project-iris/iris/proto/stream/stream.go:74 +0x39

goroutine 65 [syscall, 83 minutes]:
runtime.goexit()
/opt/google/go/src/pkg/runtime/proc.c:1445

goroutine 69 [IO wait]:
net.runtime_pollWait(0x7fc1f8e49de0, 0x72, 0x0)
/opt/google/go/src/pkg/runtime/netpoll.goc:146 +0x66
net.(_pollDesc).Wait(0xc2081e0450, 0x72, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:84 +0x46
net.(_pollDesc).WaitRead(0xc2081e0450, 0x0, 0x0)
/opt/google/go/src/pkg/net/fd_poll_runtime.go:89 +0x42
net.(_netFD).accept(0xc2081e03f0, 0x6b2e68, 0x0, 0x7fc1f8e433c8, 0xb)
/opt/google/go/src/pkg/net/fd_unix.go:419 +0x343
net.(_TCPListener).AcceptTCP(0xc2080fa2b8, 0xecbacce58, 0x0, 0x0)
/opt/google/go/src/pkg/net/tcpsock_posix.go:234 +0x5d
net.(_TCPListener).Accept(0xc2080fa2b8, 0x0, 0x0, 0x0, 0x0)
/opt/google/go/src/pkg/net/tcpsock_posix.go:244 +0x4b
github.com/project-iris/iris/service/relay.(_Relay).acceptor(0xc2081b5b90)
/home/karalabe/work/iris/src/github.com/project-iris/iris/service/relay/service.go:101 +0x62d
created by github.com/project-iris/iris/service/relay.(*Relay).Boot
/home/karalabe/work/iris/src/github.com/project-iris/iris/service/relay/service.go:72 +0x9b

goroutine 3168815 [runnable]:
github.com/project-iris/iris/service/relay.(_relay).process(0xc208140180)
/home/karalabe/work/iris/src/github.com/project-iris/iris/service/relay/proto.go:570
created by github.com/project-iris/iris/service/relay.(_Relay).acceptRelay
/home/karalabe/work/iris/src/github.com/project-iris/iris/service/relay/relay.go:126 +0x572
exit status 2

Relay kill switch

Currently almost everything can be tested easily in binding implementations, except proper clean-up in case of an Iris node failure. This would need each such test to kill off the Iris node after setting up the test, which would be very impractical.

A better solution would be to introduce a small kill switch into the relay protocol, through which each connection could trigger an immediate drop from the Iris node. This way we could simulate an Iris crash without needing to go through the hassle of actually producing one.

On relay close, clean first, reply later

Currently the relay service immediately acknowledges a client connection closure request. This however leads to a race condition where clients think they've been cleaned up, but in reality not yet.

This becomes obvious during testing (iris-go: req/rep), that one test influences another. If they are run individually, all is ok. If multiple one after the other, new requests get assigned to stale old connections.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.