Coder Social home page Coder Social logo

iheartradio / kanaloa Goto Github PK

View Code? Open in Web Editor NEW
124.0 27.0 13.0 5.4 MB

Make your service more resilient by providing protection against traffic oversaturation

Home Page: http://iheartradio.github.io/kanaloa

License: Other

Scala 100.00%
backpressure scala traffic public

kanaloa's Introduction

Join the chat at https://gitter.im/iheartradio/kanaloa Build Status Coverage Status Download

Kanaloa

Make your service more resilient by providing protection against traffic oversaturation.

Check the website for detail - iheartradio.github.io/kanaloa

kanaloa's People

Contributors

gitter-badger avatar joprice avatar kailuowang avatar nsauro avatar rashadarafeh avatar williamho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kanaloa's Issues

Randomly failing test on travis

[error] x stop an underutilizationStreak
[error] 'Some(UnderUtilizationStreak(2015-12-14T16:16:45.093,3))' is not empty (AutoScalingSpec.scala:67)

Create automated stress tests for pushing dispatcher

Incentive

The tests should ensure the following:

  1. kanaloa doesn't introduce significant overhead. And we understand the overhead it does introduces.
  2. kanaloa is capable to deliver the backend's maximum throughput
    1. when traffic is below or at maximum backend throughput, kanaloa delivers at the same speed as the incoming traffic
    2. when traffic is above maximum backend throughput, kanaloa delivers the maximum backend throughput and reject the rest incoming traffic.
  3. kanaloa should be rejecting the vast majority of incoming traffic quickly when the backend is unresponsive. kanaloa should be able to handle such situation for a long time.
  4. when working with a clustered backends (multiple backend instance) kanaloa should handle the following situations
    1. when one or more backends become unresponsive, kanaloa should be able to deliver the combined maximum throughput of the rest working backends. And the vast majorities of the traffic should be handled by those working backends.
    2. when a new backend joins the cluster, kanaloa should be able to deliver the new maximum combined throughput, the transition make take some time, we need to understand how long it takes.
  5. kanaloa should not take too much system resource under all circumstances, e.g. when backend become unresponsive; when we have very high incoming volume (well it can't be unlimited but we need to know the boundary and it should be much higher than backend maximum throughput)

Ideas for implementation

Create an Akka.Http or play frontend serving an Http interface that talks to an tunable Akka mock backend that mimics a DB service. This mock backend should have a hidden number of optimal concurrent requests: χ, and the output should be a convex functon to the concurrent requests, where its max output is at concurrent requests number χ.
Use Gatling to test against it, tuning the backend to simulate a backend breakdown.

Potential for Work dropping

In the "starting" state, there exists a potential case where if there is Work already queued, and the Worker receives another Work message, it will overwrite the queued value.

case work: Work //this only happens on restarting with new routee because the old one died.
context become starting(Some(work, sender), false)

This ticket might be a NOP, if we decide to do away with this state entirely, but since this is such a small fix, would be super easy to sneak in.

Update documentation

  • Back pressure control
  • Shutdown gracefully
  • How to develop and test
  • Cluster aware back end
  • test results as follows

Test setup

  setUp(scn.inject(
    rampUsers(500) over (5 minutes) //mainly by throttle below
  )).throttle(
    reachRps(200) in (5.minutes),
    holdFor(3.minute)
  )
    .protocols(httpConf)
    .assertions(global.responseTime.percentile3.lessThan(5000)) //95% less than 5s

optimal-concurrency  = 10  //optimal concurrent requests the backend can handle
optimal-throughput =  100    //the opitmal throughput (msg / second) the backend can handle
buffer-size = 5000
frontend-timeout = 30s
overload-punish-factor = 0  //between 1 and 0, one gives the maximum punishment while 0 gives none

kanaloa {

  default-dispatcher {
    workTimeout = 1m

    updateInterval = 200ms

    workerPool {
      startingPoolSize = 20
    }

    autothrottle {
      numOfAdjacentSizesToConsiderDuringOptimization = 18
      chanceOfScalingDownWhenFull = 0.3
    }

    backPressure {
      referenceDelay = 1s
      durationOfBurstAllowed = 5s
    }

    metrics {
      enabled = on // turn it off if you don't have a statsD server and hostname set as an env var KANALOA_STRESS_STATSD_HOST
      statsd {
        namespace = kanaloa-stress
        host = ${?KANALOA_STRESS_STATSD_HOST}
        eventSampleRate = 0.25
      }
    }
  }

}


Without Kanaloa

screen shot 2016-08-18 at 11 18 01 am

non-punish_straight_table

non-punish_straight

With Kanaloa

screen shot 2016-08-18 at 11 20 15 am
screen shot 2016-08-18 at 11 24 16 am

non-punishment_kanaloa

non-punish_kanaloa

non-punish_rampup

Remove the requirements to have an implicit system when creating a dispatcher prop

right now to create a kanaloa dispatcher props you need an implicit actorSystem, not entirely sure but I think this is needed only by the metricsCollector. We should be able to change the code that metrics collector is instantiated using the actorRefFactory of the kanaloa Dispatcher. If that's the case we don't have to use an ActorSystem to create Prop which feels weird.

Release new version?

It's been awhile (39 commits away) since 3.0 release. Can you release a new version? Thanks

Improve work fail error message

"work failed after 0 try(s)" doesnt make much sense. The ResultCheck catch all failure should also be user friendly enough.

Port the autoscaling logic to a router.

basic idea.

  1. Create new RouterPoolActor that
    1. Uses message count in all routees to detect total queue length
    2. Monitor incoming message together with queue size to determine dispatch rate
    3. Record recent dispatch rate into a fixed sized log
  2. Create a new Resizer that
    1. Sample and record dispatch rate in history log
    2. does rest the autoscaling logic.

Replace WorkHistory with some circular data structure

Currently, in QueueProcessor, it tracks ResultHistory by way of a Vector[Boolean]. This can be simplified a bit to a circular data structure. One simple idea is to just wrap a Array[Byte] and to keep a simple pointer at the current index, which then resets to 0 when it reaches the end of the array.

Dispatcher API return a Future[Dispatcher]

When creating a Dispatcher right now, the Dispatcher constructor returns immediately, even though behind the scenes there is still some initialization going on.

We should change this to return a Future[Dispatcher], and only complete the Future when all underlying Actors have been fully initialized.

In order to achieve this, we will need to spawn an intermediate Actor which create a Promise. When all underlying Actors are ready, they will need to report to this Actor, and then the Actor can fill the Promise with the Dispatcher instance.

These means that some of the existing Actors will need to report there 'ready' status to this Actor. The details can be hashed out a bit more if/when we decide to do this.

QueueTerminated/NoMoreWork messages can lose Work (potentially)

Just tracking this:

Right now in the waitingForRoutee state, if there is Work queued, and the Worker receives a Terminated(queue) or a NoWorkLeft message, it will stop itself, and not execute the Work.

This also is a non issue, if we do away with this waitingForRoutee state.

Customizable way to handle timeouts

As suggested by @nsauro
One idea is to provide a Backoff trait, which gets called for every timeout. An instance would need to be created per worker, and it is guaranteed to be threadsafe code, since it would be called from within the Actor's receive function. The idea would be that the user can provide their own implementation of how they want to handle timeouts. IE: exponential backoff, etc.

The trait's interface would just be

def handleTimeout(msg?) : Option[FiniteDuration]

We could easily provide a few different, basic implementations, including this algorithm here.

Investigate respawning Workers if their Routee dies

The Worker is a bit complex, and naturally as a result, some things get lost in the fold. In order to start thinking of ways to reduce complexity, one idea floated is to eliminate the "waitingForRoutee" state, and instead, if a Worker loses its Routee, to just kill the Worker and respawn a new one.

More automated stress tests against multiple scenarios

Scenarios

  1. Load balancing adding new handler
  2. Load balancing losing a handler abruptly
  3. Overhead should be less than 1ms
  4. Overcapacity throughput should be throughput under optimal concurrency
  5. Under capacity traffic should have the same throughput/latency as round robin direct through
  6. Two nodes with 200 RPS and 100 RPS capacity. Make the 200RPS unresponsive(or really slow) which brings the capacity below the incoming requests.
  7. Two nodes each with 300RPS capacity. One of the cluster node became out of response (or really slow) , the left node still have enough capacity to serve the incoming request

Cleanup/simplify shutdown orchestration

The orchestration of shutdown between QueueProcessor, Queue and Workers is a bit duplicative in terms of who is watching who.

Currently the QueueProcessor receives the initial Shutdown message, and then it will tell the Queue to shutdown. The Queue then messages to the the Workers NoWorkLeft, which begins their termination. However if the Queue still has Work messages queued, it is unclear what happens to them(I think they just get dropped?). The Queue then enters the retiring state, where it waits for a RetiringTimeout message. Once it receives this, it then sends more NoWorkLeft messages to the Workers and then shuts itself down. This termination is then watched by both the QueueProcessor and the Workers, which they then further react to. The QueueProcessor will send Retire signals to the Workers.

I don't think there are any real adverse effects here, other than the potential Work being dropped, which I need to verify, but this could be simplified a bit.

statsd reporter buffer out when high traffic

realtime-job-service java.net.SocketException: No buffer space available
realtime-job-service at sun.nio.ch.DatagramChannelImpl.send0(Native Method) ~[na:1.8.0_66]
realtime-job-service at sun.nio.ch.DatagramChannelImpl.sendFromNativeBuffer(DatagramChannelImpl.java:521) ~[na:1.8.0_66]
realtime-job-service at sun.nio.ch.DatagramChannelImpl.send(DatagramChannelImpl.java:498) ~[na:1.8.0_66]
realtime-job-service at sun.nio.ch.DatagramChannelImpl.send(DatagramChannelImpl.java:462) ~[na:1.8.0_66]
realtime-job-service at kanaloa.reactive.dispatcher.metrics.StatsDActor.flush(StatsDClient.scala:225) [kanaloa_2.11-0.2.1.jar:0.2.1]
realtime-job-service at kanaloa.reactive.dispatcher.metrics.StatsDActor.kanaloa$reactive$dispatcher$metrics$StatsDActor$$doSend(StatsDClient.scala:192) [kanaloa_2.11-0.2.1.jar:0.2.1]
realtime-job-service at kanaloa.reactive.dispatcher.metrics.StatsDActor$$anonfun$receive$1.applyOrElse(StatsDClient.scala:169) [kanaloa_2.11-0.2.1.jar:0.2.1]
realtime-job-service at akka.actor.Actor$class.aroundReceive(Actor.scala:480) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at kanaloa.reactive.dispatcher.metrics.StatsDActor.aroundReceive(StatsDClient.scala:154) [kanaloa_2.11-0.2.1.jar:0.2.1]
realtime-job-service at akka.actor.ActorCell.receiveMessage(ActorCell.scala:525) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at akka.actor.ActorCell.invoke(ActorCell.scala:494) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at akka.dispatch.Mailbox.run(Mailbox.scala:224) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library-2.11.7.jar:na]
realtime-job-service at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library-2.11.7.jar:na]
realtime-job-service at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library-2.11.7.jar:na]
realtime-job-service at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library-2.11.7.jar:na]

remove maxProcessingTime

This was originally meant(I think) to time cap the execution of a PullingDispatcher. This just adds some complexity, and can probably be removed. If there is a need to time cap a Dispatcher, then a user can easily do this by scheduling a Shutdown message.

Migrate to Typed Actors

Now that the kanaloa system is growing bigger with more types of actors, it might be beneficial to use typed actors instead. This is up for triage.

Queue.Retiring message should be a QueueRejected reason

If something sends a message to a Queue, while it is retiring, it should reject it in the same fashion as it rejects work when it is overburdened. In this case, the "retiring" message should be a reason for QueueRejected.

Without this, a sender cannot correlate a this rejection with the message, and as such, cannot properly recover

A centralized way to collect metrics

Right now the dequeue rate metric is collected in Queue, this metric is used by AutoScaling but the approach it took is problematic as it counts Timeouts and Failures also as successful dequeue. We shouldn't count Timeouts as successful dequeue.
We should probably differentiate between a backend capacity rejection and failures due to other reasons (such as invalid input or external errors). This is indicated in #79

Cluster aware worker pool

Incentive

Right now kanaloa is not cluster aware, so, in a cluster, it will site in front of a cluster aware router and let the router forward all the messages to remote routees (backend actors). This issue is to mitigate the situation when a portion of routees start to become irresponsive (not processing messages) due to some anomalies. In such a case, although other routees are still working perfectly fine, the router will continue forward the same portion of traffic to the problematic ones. Ideally we want the problematic routees to be avoided or de-prioritized, and just let the working routees taking more work.

Solution

instead of using the cluster aware group, we use a pool to actually deploy Worker to the routee side, this way each Worker will be bound to a backend actor (through path or ref), and their pulling speed will also be bound to the handling speed of the backend actor. And we can have worker specific circuit breaker that breaks for a certain backend.

Challenges

1, worker needs to be serializable, which includes the ResultChecker which could be anonymous function (closure) right now. we might require a serializable ResultChecker interface for cluster aware kanaloa dispatchers
2, There might be dynamic settings (e.g. total pool size, global circuit breaker) that need to be distributed across cluster now, although there might be a way we can avoid this.

Supports non-actor backend

Such as taking in a simple function (Req)=> Future[Either[Any, Response]] and create an actor out of it.

`delayBeforeNextWork` values aren't applied correctly

A few issues:

  • the delayBeforeNextWork is only meant to be applied only once, after it is applied to either sending work to the routee, or applied in when it asks for Work. Currently, it is never reset
  • It's getting used 2x. Currently, when it is set, it will be used as a delay to ask for Work, and for when the Routee is sent a message

Lets remove all references to this as parameter functions, and only use the global one. This should help clear this up.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.