Coder Social home page Coder Social logo

lasp-lang / partisan Goto Github PK

View Code? Open in Web Editor NEW
886.0 38.0 57.0 7.42 MB

High-performance, high-scalability distributed computing for the BEAM.

Home Page: https://partisan.dev

License: Apache License 2.0

Makefile 0.33% Erlang 99.02% Shell 0.61% Dockerfile 0.03%
erlang elixir distributed-systems failure-detection membership gossip-round gossip tcp

partisan's People

Contributors

ankhers avatar aramallo avatar benoitc avatar brunosantiagovazquez avatar cmeiklejohn avatar evanmcc avatar getong avatar gorbak25 avatar junghunyoo avatar kianmeng avatar lrascao avatar michalmuskala avatar seancribbs avatar tsloughter avatar vagabond avatar vitorenesduarte avatar xtian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

partisan's Issues

Delayed disconnect can break overlay

Because of a race condition with delivering disconnect messages, which do not contain either an epoch or birth number, a node can be connected to the overlay via random walk to the same node in quick succession (with an interleaved leave operation) and a late arrival of a disconnect message can permanently disconnect the node leaving the overlay disconnected.

Test failures for TLS

cc: @seancribbs

2017-06-04 22:08:10.417
Ending test case default_manager_test

%%% partisan_SUITE ==> default_manager_test (group with_tls): FAILED
%%% partisan_SUITE ==> {{assertEqual,[{module,partisan_SUITE},
               {line,888},
               {expression,"ExpectedRand"},
               {expected,0.4626377977231828},
               {value,0.7612390866620362}]},

partisan.cloud down

partisan.cloud appears to have expired and the site is no longer up. are there plans to restore it?

Registrar URL: whois.enom.com
Updated Date: 2021-01-15T10:20:31Z
Creation Date: 2019-01-14T22:50:35Z
Registry Expiry Date: 2021-01-14T22:50:35Z

update_members should only accept a list of maps

The default peer service manager currently supports the nodes list as both a list of atoms and a list of maps. This results in some confusing hacks and logic to handle.

Requiring the list to be the map representation of a node would clear it up and allow to simply use sets to find the difference and intersection of the current membership and the updated membership.

build with OTP 22 fail

I get the following error while building latest master with OTP 22:

===> Compiling _build/default/lib/partisan/src/partisan_gen_server.erl failed
_build/default/lib/partisan/src/partisan_gen_server.erl:937: sys:get_debug/3: Deprecated function. Incorrectly documented and in fact only for internal use. Can often be replaced with sys:get_log/1.

update members and removing nodes that never became members

I'll be digging into see how to fix this in partisan in the morning but opening this now that it is fresh in my mind.

I've discovered that using partisan_default_peer_service_manager:update_members/1 that partisan will keep trying to connect to a node that is no longer in the list given to update_members. So if it initially is called with [A, B] and then [B] if it was never able to connect to A it will not have it in the state's membership and thus not issue a leave (or whatever the term is for stopping trying to connect to a node) like it would for A had it been connected.

RPC test fails in 19.3 build

This is because the RPC call relies on a string:split, which doesn't exist in 19.3 I've just commented the test out for now, since we hope to not have to support 19.3 for very much longer....

Container issues

When running in GKE -- instances performing the second test of the test suite are running into eval_work_list failures with dets -- presumably there's some sort of table corruption, even though the tables are supposed to not be shared across invocations.

We either need to trace to figure out what is happening, log the direcetory and make sure the state directory is nuked between different iterations.

Remove identifiers

In HyParView, when the shuffle occurs and members of the remote passive view are replaced the nodes that are randomly evicted out of the passive side of the recipient should be returned to the sender for addition into their passive view; right now, selection of nodes on the recipient side for transmission to the sender are completely random.

Symmetric views

The HyParView implementation needs a test added to the test suite to ensure that the views are symmetric.

Hyparview full connectivity is sometimes not being achieved

So i have a configuration of a 9 node cluster (that i start as a rebar3 shell) using hyparview with a max active size of 3, i then issue a join operation on all nodes at once joining through the same contact node. On some runs (not all) i end up with an "island" of 3 nodes isolated from the rest, any obvious thing that i might be missing before digging into it?

Send assumes connection is established for channel.

Right now, in do_send_message in the default peer service manager, it's assumed that if a node is connected, that a connection is open for the channel.

Options:

  • Since the default channel is always the first established, use the default channel for the message if a message specific channel has not been established yet; or
  • Buffer the message until the channel for those types of messages is established.

Thoughts? @slfritchie @seancribbs @tsloughter

Add call emulation for `forward_message`.

currently PeerServiceManager:forward_message(Node, Target, Message) uses gen_server:cast to send its message on the recipient node. It would be nice to support the semantics of gen_server:call as well. This presents two issues:

  1. call sets up a monitor on the remote server that it is sending to (I'll address this in another issue).
  2. call blocks until there is a reply. The current code neither blocks nor conveys any reply.

I have some basic code that implements remote sends, casts, and calls. However, it doesn't really work well for calls, since the gen_server:call at the remote end (in partisan_peer_service_server) doesn't do any multiplexing and could be blocked for a long time, interfering with unrelated calls and casts on the same node to node edge. So that needs to be fixed, and replies need to be added.

Explicit leave operation

It's unclear how to explicitly leave from the cluster outside of a failure, given the algorithm in the paper doesn't directly discuss this.

Default membership and actor persistence

The default membership for partisan does not persist the actor for a given instance of the peer service manager: this is something we inherited from Helium's peer service. Because of the use of the ORSWOT, this doesn't necessarily cause a huge problem: each node could potentially be added and removed by two actors, doubling the size complexity. This is based on the original restriction placed on the Plumtree peer service where an actor could only remove itself.

It probably makes the most sense to have a node, when coming back online with it's original state, preserve it's actor identifier.

Peer service leave failed in two node cluster test

Hi!
I'm running two elixir:1.5.2 nodes using {:lasp, "~> 0.4.0"}

First, I've listed the members in node1

iex(node1@test-elixir-1)1> :partisan_peer_service.members
{:ok, [:"node1@test-elixir-1"]}

then, I've listed the members in node2 and joined node1

iex(node2@test-elixir-2)1> :partisan_peer_service.members
{:ok, [:"node2@test-elixir-2"]}

iex(node2@test-elixir-2)2> :partisan_peer_service.join :"node1@test-elixir-1"
00:29:36.255 [info] Node #{listen_addrs => [#{ip => {10,42,138,182},port => 21000}],name => 'node1@test-elixir-1'} connected, pid: <0.344.0>
:ok
iex(node2@test-elixir-2)3> 00:29:36.295 [info] Node #{listen_addrs => [#{ip => {10,42,138,182},port => 21000}],name => 'node1@test-elixir-1'} connected!
00:29:37.296 [info] Node #{channels => [undefined],listen_addrs => [#{ip => {10,42,138,182},port => 21000}],name => 'node1@test-elixir-1',parallelism => 1} connected, pid: <0.347.0>
00:29:37.345 [info] Node #{channels => [undefined],listen_addrs => [#{ip => {10,42,138,182},port => 21000}],name => 'node1@test-elixir-1',parallelism => 1} connected!

iex(node2@test-elixir-2)4> :partisan_peer_service.members
{:ok, [:"node2@test-elixir-2", :"node1@test-elixir-1"]}

now, in node1

iex(node1@test-elixir-1)2> 00:29:37.077 [info] Node #{channels => [undefined],listen_addrs => [#{ip => {10,42,250,32},port => 21000}],name => 'node2@test-elixir-2',parallelism => 1} connected, pid: <0.346.0>
00:29:37.099 [info] Node #{channels => [undefined],listen_addrs => [#{ip => {10,42,250,32},port => 21000}],name => 'node2@test-elixir-2',parallelism => 1} connected!

iex(node1@test-elixir-1)3> :partisan_peer_service.members
{:ok, [:"node2@test-elixir-2", :"node1@test-elixir-1"]}

leave in node2

iex(node2@test-elixir-2)5> :partisan_peer_service.leave :"node1@test-elixir-1"
:ok
iex(node2@test-elixir-2)6> :partisan_peer_service.members
{:ok, [:"node2@test-elixir-2"]}

and in node1

iex(node1@test-elixir-1)4> :partisan_peer_service.members
{:ok, [:"node2@test-elixir-2", :"node1@test-elixir-1"]}

and after node2 shutdown Ctrl+C, node1 start trying reconnection

iex(node1@test-elixir-1)5> 00:32:16.811 [error] connection socket {connection,#Port<0.14993>,gen_tcp,inet,false} has been remotely closed
00:32:16.812 [error] connection socket {connection,#Port<0.15163>,gen_tcp,inet,false} has been remotely closed
00:32:17.056 [info] Node #{channels => [undefined],listen_addrs => [#{ip => {10,42,250,32},port => 21000}],name => 'node2@test-elixir-2',parallelism => 1} is not connected; initiating.
00:32:17.059 [error] unable to connect to #{channels => [undefined],listen_addrs => [#{ip => {10,42,250,32},port => 21000}],name => 'node2@test-elixir-2',parallelism => 1} due to {error,econnrefused}
00:32:17.059 [info] Node #{channels => [undefined],listen_addrs => [#{ip => {10,42,250,32},port => 21000}],name => 'node2@test-elixir-2',parallelism => 1} failed connection: {error,normal}.

epmd module

I had a thought today after playing with epmdless again. Since partisan already keeps track of host and port, pretty much the main functionality of epmd, it might not be out of scope to include the ability to replace epmd when using partisan.

What do you think?

How to use Partisan ?

Hello, how are you here?

Is there an example of an application written in Elixir using Partisan? How can I be able to upload a cluster of Elixir applications using Partisan?

rebar_erl_vsn is unmaintained

Is there any reason to use rebar_erl_vsn ? it has not been updated on hex since June 2018 and trigger such warning:

===> The erlang version 22.0 is newer then the latest version known to rebar_erl_vsn (21). Features introduced between after 21 will not have flags.

partisan instead of teleport?

As I was once again looking to revive https://github.com/vagabond/teleport I realized partisan might actually be meant for this itself. Description of teleport https://vagabond.github.io/programming/2015/03/31/chasing-distributed-erlang

Basically, is partisan meant to be used for message passing between nodes in place of distributed Erlang? For some reason I had it in my head it was for fault detection and gossip only, not for my applications processes to be sending direct messages between each other.

Confusing results from the new multi-ip join

I'm seeing a weird result from upgrading to the current master of partisan using the default peer service manager.

We use DNS to discover the hosts we want to connect to, including their hostnames, ips and ports:

N = #{name => Node,
      listen_addrs => [#{ip => IP, port => PartisanPort}],
      parallelism => 1},
partisan_peer_service:join(N)

In the logs though I continue to see the error:

{unexpected_peer, <node()>, #{listen_addrs => [#{ip => {0,0,0,0}, port => 10200}]}, name => Node, ...}

Those are the defaults that I set through $IP and $PEER_PORT but are never the values sent with the join that gets called above.

I have been unable to track down where it could possibly be ending up deciding it wants to connect to {0,0,0,0}.

My guess was that the node is using defaults when it is connected to from another and tries to initiate. So node-2 connects to node-1 and now node-1 tries to connect to node-2 on {0,0,0,0} even though it was already connecting to node-2 on the IP it got from the DNS query.

But I can't find where that would be happening if it were the case.

Jitter on connect.

In the default manager, when a cluster of nodes learns about a new node in the cluster, they all try to connect to it immediately, which because of the listen queue, can lead to timeouts, causing the cluster to fail to connect to peers.

One option here is to use a jitter parameter, which would prevent nodes from connecting immediately to a new node they just learn about. In the case of parallelism 5, with 4 channels, this prevents 20 inbound connections from every node in the cluster hitting a new node at the exact same time.

Ensure connections before do_send_message

Before attempting to send a message on any of the connections that exist for any of the members of the active view in HyParView, or the full membership set in the default peer service, the connections should be refreshed using the establish_connections call.

tag ?

I am wondering if a release (tag) can't be cut one of these days on the current code or one of these interresting branches? That would help a lot to release a product above partisan :)

add remote monitoring

in order for the call emulation in #44 to work, and more generally for partisan to act as a full-featured disterl replacement (see #42), we'll need to add remote montioring. A good design for this doesn't really spring right to mind, I guess, so I am looking for feedback here.

My initial thought was just to add some monitoring metadata on top of the existing node to node data handling (it would work like hello, I guess?). But that can combine with remote node failures in a complicated way, so I need to read more code to have any better fleshed out ideas.

Race condition on send/join.

There's a race condition on join's and sends.

If a node calls join, and immediately attempts to send a message on that channel, the node might try to send before the connection is established with the default backend. This is because joining is async: the join returns OK immediately, but sending is unavailable until the connection is established, which is a callback into the peer service manager once the connection is open.

This is akin to nosuspend in erlang:send, but not the default behavior of partisan.

Two options exist for handling this:

  • Either buffer messages until the connection is open; however, the question remains on how we handle this buffer and ensure it doesn't grow indefinitely; or
  • Provide a sync_join option that can be used, where the join is a call and blocks until the join has been successfully processed.

This was discovered during adapting Riak Core to use partisan, which attempts to send immediately after a join, because a join is blocking to ensure the disterl connection has been established.

Thoughts? @tsloughter @seancribbs @slfritchie

TCP acknowledgements

All of the neighbor, neighbor_accepted and neighbor_rejected messages in the HyParView implementation need acknowledgements, and the messages buffered until so, to ensure the views are symmetric.

status, roadmap, doc?

With the coming stable and opensource release of barrel I am asking myself about the current status of partisan? Is there any roadmap around?

It is unclear what are all the features now that we some code related to orchestration strategy? How to use it? What are the gossip layers supported? Are the features of these layers all on par?

Is there any or user and developer oriented documentation somewhere that can be used to use completely partisan and eventually participate to it? Even a simple getting started would be helpful. Maybe is there any paper that show their usage (code) ?

Transient failures in default_manager_test (with ssl)

There's a transient failure with the default_manager_test that is potentially due to a race condition somewhere in the test with setup and teardown.

%%% partisan_SUITE ==> default_manager_test (group with_tls): FAILED
%%% partisan_SUITE ==> {test_case_failed,"Membership incorrect; node server_125659783_1@leviathan should have [{server_125659783_1,\n                                                                      server_125659783_1@leviathan},\n                                                                     {client_125659783_1,\n                                                                      client_125659783_1@leviathan},\n                                                                     {client_125659783_2,\n                                                                      client_125659783_2@leviathan},\n                                                                     {client_125659783_3,\n                                                                      client_125659783_3@leviathan}] but has [server_125659783_1@leviathan,\n                                                                                                              client_125659783_1@leviathan]"}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.