iheartradio / kanaloa Goto Github PK

View Code? Open in Web Editor NEW

124.0 27.0 13.0 5.4 MB

Make your service more resilient by providing protection against traffic oversaturation

Home Page: http://iheartradio.github.io/kanaloa

License: Other

Scala 100.00%

backpressure scala traffic public

kanaloa's Introduction

Kanaloa

Make your service more resilient by providing protection against traffic oversaturation.

Check the website for detail - iheartradio.github.io/kanaloa

kanaloa's People

Contributors

Stargazers

Watchers

Forkers

kailuowang joprice earlredwolf wq19880601 rashadarafeh joyfulvillage haiderny fantayeneh enzief adilakhter pu55yf3r

kanaloa's Issues

A work pipeline monitor

real time monitoring of all the work pipeline metrics

Randomly failing test on travis

[error] x stop an underutilizationStreak
[error] 'Some(UnderUtilizationStreak(2015-12-14T16:16:45.093,3))' is not empty (AutoScalingSpec.scala:67)

Get settings from typesafe configuration

Report expected wait time and process time when there are idle workers

The above two metrics is only reported when all workers are busy.

Create automated stress tests for pushing dispatcher

Incentive

The tests should ensure the following:

kanaloa doesn't introduce significant overhead. And we understand the overhead it does introduces.
kanaloa is capable to deliver the backend's maximum throughput
1. when traffic is below or at maximum backend throughput, kanaloa delivers at the same speed as the incoming traffic
2. when traffic is above maximum backend throughput, kanaloa delivers the maximum backend throughput and reject the rest incoming traffic.
kanaloa should be rejecting the vast majority of incoming traffic quickly when the backend is unresponsive. kanaloa should be able to handle such situation for a long time.
when working with a clustered backends (multiple backend instance) kanaloa should handle the following situations
1. when one or more backends become unresponsive, kanaloa should be able to deliver the combined maximum throughput of the rest working backends. And the vast majorities of the traffic should be handled by those working backends.
2. when a new backend joins the cluster, kanaloa should be able to deliver the new maximum combined throughput, the transition make take some time, we need to understand how long it takes.
kanaloa should not take too much system resource under all circumstances, e.g. when backend become unresponsive; when we have very high incoming volume (well it can't be unlimited but we need to know the boundary and it should be much higher than backend maximum throughput)

Ideas for implementation

Create an Akka.Http or play frontend serving an Http interface that talks to an tunable Akka mock backend that mimics a DB service. This mock backend should have a hidden number of optimal concurrent requests: χ, and the output should be a convex functon to the concurrent requests, where its max output is at concurrent requests number χ.
Use Gatling to test against it, tuning the backend to simulate a backend breakdown.

internal message Hold leaked to the requester sometimes

Improve the Little's law inspired backpressure control

https://www.ietf.org/mail-archive/web/iccrg/current/pdfB57AZSheOH.pdf

Also implement both back pressure

Potential for Work dropping

In the "starting" state, there exists a potential case where if there is Work already queued, and the Worker receives another Work message, it will overwrite the queued value.

kanaloa/core/src/main/scala/kanaloa/reactive/dispatcher/queue/Worker.scala

Lines 53 to 54 in a8e5af5

    
           case work: Work ⇒ //this only happens on restarting with new routee because the old one died. 
        
             context become starting(Some(work, sender), false)

This ticket might be a NOP, if we decide to do away with this state entirely, but since this is such a small fix, would be super easy to sneak in.

Update documentation

Test setup

  setUp(scn.inject(
    rampUsers(500) over (5 minutes) //mainly by throttle below
  )).throttle(
    reachRps(200) in (5.minutes),
    holdFor(3.minute)
  )
    .protocols(httpConf)
    .assertions(global.responseTime.percentile3.lessThan(5000)) //95% less than 5s

optimal-concurrency  = 10  //optimal concurrent requests the backend can handle
optimal-throughput =  100    //the opitmal throughput (msg / second) the backend can handle
buffer-size = 5000
frontend-timeout = 30s
overload-punish-factor = 0  //between 1 and 0, one gives the maximum punishment while 0 gives none

kanaloa {

  default-dispatcher {
    workTimeout = 1m

    updateInterval = 200ms

    workerPool {
      startingPoolSize = 20
    }

    autothrottle {
      numOfAdjacentSizesToConsiderDuringOptimization = 18
      chanceOfScalingDownWhenFull = 0.3
    }

    backPressure {
      referenceDelay = 1s
      durationOfBurstAllowed = 5s
    }

    metrics {
      enabled = on // turn it off if you don't have a statsD server and hostname set as an env var KANALOA_STRESS_STATSD_HOST
      statsd {
        namespace = kanaloa-stress
        host = ${?KANALOA_STRESS_STATSD_HOST}
        eventSampleRate = 0.25
      }
    }
  }

}

Without Kanaloa

With Kanaloa

Documentation!

Implement load shedding

dropping data on the floor

Circuit breaker per Worker

a worker with constant failure or timeout should refrain from pulling work as well.

explore smart load balancing

here is a paper to start with.
http://www.diva-portal.org/smash/get/diva2:706244/FULLTEXT01.pdf

Allow using actor ref as delegatee

Remove the requirements to have an implicit system when creating a dispatcher prop

right now to create a kanaloa dispatcher props you need an implicit actorSystem, not entirely sure but I think this is needed only by the metricsCollector. We should be able to change the code that metrics collector is instantiated using the actorRefFactory of the kanaloa Dispatcher. If that's the case we don't have to use an ActorSystem to create Prop which feels weird.

Explicitly set deploy local

so that client can turn on
akka.actor.serialize-messages = on

Release new version?

It's been awhile (39 commits away) since 3.0 release. Can you release a new version? Thanks

Queue.queryStatus doesn't indicate retiring

When the QueueProcessor queries a Queue for its status, the retiring state is not being captured.

Use reference conf file as default settings instead of hardcoded one.

Improve work fail error message

"work failed after 0 try(s)" doesnt make much sense. The ResultCheck catch all failure should also be user friendly enough.

release step should use validate instead

which will include tests and integration tests
see project/Publishing.scala for example of managing release steps

Port the autoscaling logic to a router.

basic idea.

Create new RouterPoolActor that
1. Uses message count in all routees to detect total queue length
2. Monitor incoming message together with queue size to determine dispatch rate
3. Record recent dispatch rate into a fixed sized log
Create a new Resizer that
1. Sample and record dispatch rate in history log
2. does rest the autoscaling logic.

Replace WorkHistory with some circular data structure

Currently, in QueueProcessor, it tracks ResultHistory by way of a Vector[Boolean]. This can be simplified a bit to a circular data structure. One simple idea is to just wrap a Array[Byte] and to keep a simple pointer at the current index, which then resets to 0 when it reaches the end of the array.

Dispatcher API return a Future[Dispatcher]

When creating a Dispatcher right now, the Dispatcher constructor returns immediately, even though behind the scenes there is still some initialization going on.

We should change this to return a Future[Dispatcher], and only complete the Future when all underlying Actors have been fully initialized.

In order to achieve this, we will need to spawn an intermediate Actor which create a Promise. When all underlying Actors are ready, they will need to report to this Actor, and then the Actor can fill the Promise with the Dispatcher instance.

These means that some of the existing Actors will need to report there 'ready' status to this Actor. The details can be hashed out a bit more if/when we decide to do this.

Implement load shedding feature

In some cases load shedding might be more useful than back pressure. We need to implement this based on proved algo.

QueueTerminated/NoMoreWork messages can lose Work (potentially)

Just tracking this:

Right now in the waitingForRoutee state, if there is Work queued, and the Worker receives a Terminated(queue) or a NoWorkLeft message, it will stop itself, and not execute the Work.

This also is a non issue, if we do away with this waitingForRoutee state.

allow custom error handling in worker

right now it's either fail and try or fail and forget, could use some ability to do more sophisticated error handling.

Customizable way to handle timeouts

As suggested by @nsauro
One idea is to provide a Backoff trait, which gets called for every timeout. An instance would need to be created per worker, and it is guaranteed to be threadsafe code, since it would be called from within the Actor's receive function. The idea would be that the user can provide their own implementation of how they want to handle timeouts. IE: exponential backoff, etc.

The trait's interface would just be

def handleTimeout(msg?) : Option[FiniteDuration]

We could easily provide a few different, basic implementations, including this algorithm here.

Investigate respawning Workers if their Routee dies

The Worker is a bit complex, and naturally as a result, some things get lost in the fold. In order to start thinking of ways to reduce complexity, one idea floated is to eliminate the "waitingForRoutee" state, and instead, if a Worker loses its Routee, to just kill the Worker and respawn a new one.

More automated stress tests against multiple scenarios

Scenarios

Load balancing adding new handler
Load balancing losing a handler abruptly
Overhead should be less than 1ms
Overcapacity throughput should be throughput under optimal concurrency
Under capacity traffic should have the same throughput/latency as round robin direct through
Two nodes with 200 RPS and 100 RPS capacity. Make the 200RPS unresponsive(or really slow) which brings the capacity below the incoming requests.
Two nodes each with 300RPS capacity. One of the cluster node became out of response (or really slow) , the left node still have enough capacity to serve the incoming request

Cleanup/simplify shutdown orchestration

The orchestration of shutdown between QueueProcessor, Queue and Workers is a bit duplicative in terms of who is watching who.

Currently the QueueProcessor receives the initial Shutdown message, and then it will tell the Queue to shutdown. The Queue then messages to the the Workers NoWorkLeft, which begins their termination. However if the Queue still has Work messages queued, it is unclear what happens to them(I think they just get dropped?). The Queue then enters the retiring state, where it waits for a RetiringTimeout message. Once it receives this, it then sends more NoWorkLeft messages to the Workers and then shuts itself down. This termination is then watched by both the QueueProcessor and the Workers, which they then further react to. The QueueProcessor will send Retire signals to the Workers.

I don't think there are any real adverse effects here, other than the potential Work being dropped, which I need to verify, but this could be simplified a bit.

statsd reporter buffer out when high traffic

realtime-job-service java.net.SocketException: No buffer space available
realtime-job-service at sun.nio.ch.DatagramChannelImpl.send0(Native Method) ~[na:1.8.0_66]
realtime-job-service at sun.nio.ch.DatagramChannelImpl.sendFromNativeBuffer(DatagramChannelImpl.java:521) ~[na:1.8.0_66]
realtime-job-service at sun.nio.ch.DatagramChannelImpl.send(DatagramChannelImpl.java:498) ~[na:1.8.0_66]
realtime-job-service at sun.nio.ch.DatagramChannelImpl.send(DatagramChannelImpl.java:462) ~[na:1.8.0_66]
realtime-job-service at kanaloa.reactive.dispatcher.metrics.StatsDActor.flush(StatsDClient.scala:225) [kanaloa_2.11-0.2.1.jar:0.2.1]
realtime-job-service at kanaloa.reactive.dispatcher.metrics.StatsDActor.kanaloa$reactive$dispatcher$metrics$StatsDActor$$doSend(StatsDClient.scala:192) [kanaloa_2.11-0.2.1.jar:0.2.1]
realtime-job-service at kanaloa.reactive.dispatcher.metrics.StatsDActor$$anonfun$receive$1.applyOrElse(StatsDClient.scala:169) [kanaloa_2.11-0.2.1.jar:0.2.1]
realtime-job-service at akka.actor.Actor$class.aroundReceive(Actor.scala:480) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at kanaloa.reactive.dispatcher.metrics.StatsDActor.aroundReceive(StatsDClient.scala:154) [kanaloa_2.11-0.2.1.jar:0.2.1]
realtime-job-service at akka.actor.ActorCell.receiveMessage(ActorCell.scala:525) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at akka.actor.ActorCell.invoke(ActorCell.scala:494) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at akka.dispatch.Mailbox.run(Mailbox.scala:224) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [akka-actor_2.11-2.4.0.jar:na]
realtime-job-service at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library-2.11.7.jar:na]
realtime-job-service at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library-2.11.7.jar:na]
realtime-job-service at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library-2.11.7.jar:na]
realtime-job-service at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library-2.11.7.jar:na]

Break statsD reporter to a separate module

It's not very urgent though because there is no extra external dependency required by this feature.

remove maxProcessingTime

This was originally meant(I think) to time cap the execution of a PullingDispatcher. This just adds some complexity, and can probably be removed. If there is a need to time cap a Dispatcher, then a user can easily do this by scheduling a Shutdown message.

a better global circuit breaker

currently the CB is per worker. not optimal

statsd enable flag in configuration

so that statsD report can be turned off in an override config file.

Metrics report includes routee in the path

that way we can have per routee metrics

Migrate to Typed Actors

Now that the kanaloa system is growing bigger with more types of actors, it might be beneficial to use typed actors instead. This is up for triage.

Use type class to get rid of the need for Backend constructors

A type class for creating an ActorRef is probably easier to use.

Queue.Retiring message should be a QueueRejected reason

If something sends a message to a Queue, while it is retiring, it should reject it in the same fashion as it rejects work when it is overburdened. In this case, the "retiring" message should be a reason for QueueRejected.

Without this, a sender cannot correlate a this rejection with the message, and as such, cannot properly recover

A centralized way to collect metrics

Right now the dequeue rate metric is collected in Queue, this metric is used by AutoScaling but the approach it took is problematic as it counts Timeouts and Failures also as successful dequeue. We shouldn't count Timeouts as successful dequeue.
We should probably differentiate between a backend capacity rejection and failures due to other reasons (such as invalid input or external errors). This is indicated in #79

Make BackendAdaptor a trait instead of a function alias.

To avoid some goofy effect from $conforms which can implicitly turn a T to a Identity T => T

Result checker optionally indicate a retry or fail.

Make error message max length configurable

I'd like to be able to configure the max length of errors. Currently, it's using the default value of a method here:

kanaloa/src/main/scala/kanaloa/reactive/dispatcher/queue/Worker.scala

Line 154 in f6cdb6b

protected def descriptionOf(any: Any, maxLength: Int = 100): String = {

Cluster aware worker pool

Incentive

Right now kanaloa is not cluster aware, so, in a cluster, it will site in front of a cluster aware router and let the router forward all the messages to remote routees (backend actors). This issue is to mitigate the situation when a portion of routees start to become irresponsive (not processing messages) due to some anomalies. In such a case, although other routees are still working perfectly fine, the router will continue forward the same portion of traffic to the problematic ones. Ideally we want the problematic routees to be avoided or de-prioritized, and just let the working routees taking more work.

Solution

instead of using the cluster aware group, we use a pool to actually deploy Worker to the routee side, this way each Worker will be bound to a backend actor (through path or ref), and their pulling speed will also be bound to the handling speed of the backend actor. And we can have worker specific circuit breaker that breaks for a certain backend.

Challenges

1, worker needs to be serializable, which includes the ResultChecker which could be anonymous function (closure) right now. we might require a serializable ResultChecker interface for cluster aware kanaloa dispatchers
2, There might be dynamic settings (e.g. total pool size, global circuit breaker) that need to be distributed across cluster now, although there might be a way we can avoid this.

Supports non-actor backend

Such as taking in a simple function (Req)=> Future[Either[Any, Response]] and create an actor out of it.

`delayBeforeNextWork` values aren't applied correctly

A few issues:

the delayBeforeNextWork is only meant to be applied only once, after it is applied to either sending work to the routee, or applied in when it asks for Work. Currently, it is never reset
It's getting used 2x. Currently, when it is set, it will be used as a delay to ask for Work, and for when the Routee is sent a message

Lets remove all references to this as parameter functions, and only use the global one. This should help clear this up.

	case work: Work ⇒ //this only happens on restarting with new routee because the old one died.
	context become starting(Some(work, sender), false)

iheartradio / kanaloa Goto Github PK

kanaloa's Introduction

Kanaloa

kanaloa's People

Contributors

Stargazers

Watchers

Forkers

kanaloa's Issues

Incentive

Ideas for implementation

Test setup

Without Kanaloa

With Kanaloa

Incentive

Solution

Challenges

Recommend Projects

Recommend Topics

Recommend Org