Coder Social home page Coder Social logo

dropbox / load_management Goto Github PK

View Code? Open in Web Editor NEW
53.0 17.0 15.0 149 KB

This repository contains Go utilities for managing isolation and improving reliability of multi-tenant systems.

License: Apache License 2.0

Go 100.00%
go throttling admission-controller

load_management's Introduction

Dropbox Load Management Libraries

Build Status GoDoc

This repository contains Go utilities for managing isolation and improving reliability of multi-tenant systems. These utilities might be helpful additions to your production toolbox if:

  1. You program in Go.
  2. Your system is shared by multiple tenants and there is an expectation that one tenant's activity on the system should not adversely affect another tenant's workload.
  3. There is some way to interpose on the shared resource, like a proxy in front of the database or within the database code directly.
  4. Preventing disastrous worst case behavior is a higher priority than extracting the highest performance possible from the system.
  5. You may not have the luxury of designing limitations into your system from scratch, but may instead have to progressively apply stricter limits in production.

Contained within this repository are utilities pulled from production systems at Dropbox where they were developed to improve the reliability of multiple petabyte-scale storage system used by most of the company. While the utilities compose together to provide load management, they are also useful in isolation. The specific utilities are:

  • admission_control: Semaphores with explicit timeout and ordering properties.
  • load_manager: Combine scorecard, admission control into a coherent solution for selectively shedding load.
  • scorecard: Tag traffic and decide what to do with traffic using the tags.

Collectively, these utilities give us the pieces we need to be able to limit the number of active requests for the tagged traffic. The benefit of this is that it can protect processes from overusing shared resources at the risk of possibly underutilizing idle resources.

Quick Start

This section provides a guide to quickly understanding what's available in this repository and how it could be used to improve a production system. It complements the documentation by providing a high level overview and introduction.

First, we will setup an instance of the load manager in the global state of our application. This will allow us to maintain a single load manager throughout the lifetime of our process. It looks something like this:

const mainQueueName = "main"

var loadManager *load_manager.LoadManager
var loadManagerOnce sync.Once

func LoadManager() *load_manager.LoadManager {
    loadManagerOnce.Do(func() {
        loadManager = load_manager.NewLoadManager(
            map[string]ac.AdmissionController{
                mainQueueName: ac.NewAdmissionController(1),
            },
            ac.NewAdmissionController(1),
            scorecard.NewScorecard(scorecard.NoRules()),
            nil, // No canary scorecard
            scorecard.NoTags(), // No default tags.
        )
    })
    return loadManager
}

This will give us a single load manager that we can access from anywhere within our server process; we can then use the load manager to decide whether to accept and process traffic. In a typical usage scenario it looks something like this:

// Sleep for 10 seconds and then returns nil.
func Sleep(ctx context.Context, conn net.Conn, data []byte) error {
    maybeHostPort := conn.RemoteAddr().String()
    host, _, _ := net.SplitHostPort(maybeHostPort)
    requestSize := int(math.Log10(len(data)))
    tags := []scorecard.Tag{
        scorecard.MakeStringTag("handler", SleepHandlerName),
        scorecard.MakeStringTag("source_ip", host),
        scorecard.MakeIntTag("request_size", requestSize),
    }
    resource := LoadManager().GetResource(ctx, mainQueueName, tags)
    if !resource.Acquired() {
        return fmt.Errorf("Error acquiring resource")
    }
    defer resource.Release()
    // begin original func SleepHandler here
    time.Sleep(10 * time.Second)
    return nil
}

What we see in this example is that a preamble consisting of just a few lines can provide some protection for our original, unprotected, function. The first few lines build up a set of "tags" for the traffic that describe the context of the current execution. These tags could be anything from the name of the function Sleep to an identifier of the end user who initiated the request way off in some distant microservice. Tags are quite literally strings that describe the attributes we'd like to use when deciding to accept or reject a request.

We provide these tags to the load manager and get a resource that allows us to decide whether or not to continue processing the function Sleep. Internally, the load manager uses adminssion control to track the number of goroutines holding a resource and prevents too many from concurrently entering the original function. The code above uses a defer to make sure the resource is released back to the load manager for subsequent requests.

Out of the box, the load manager will enforce limitations on the number of concurrent active requests using the configured AdmissionController(s). To preferentially throttle load, we need to provide the load manager with a scorecard containing a set of "rules" that describe how we want to shape our traffic. A rule is a pair that consists of a pattern for matching tags and an integer specifying the number of requests that match the tag to be concurrently admitted to the system. For example, we could specify the rule source_ip:127.0.0.1,10 to specify that any given client host is permitted to hold ten tickets concurrently. An attempt to acquire an eleventh ticket will be rejected.

For maximum flexibility, the rules permit the use of wild-cards and conjunctions. For example, manually providing a "source_ip:X" rule is relatively straightforward, but as the number of values of X increases, it becomes toil-some to keep the rules in sync. Instead, we can provide a rule that matches any source, like source_ip:*,10. This will allow each tag that matches the rule, e.g. source_ip:1.1.1.1 or source_ip:2.2.2.2, to have up to 10 concurrent requests. Thus, the total number of requests permitted will be the lesser of 10 times the number of distinct source_ips in our system or the load manager's limit on the number of concurrently-outstanding tickets.

Conjunctions further increase the flexibility of rules by allowing a rule to match only when a request matches a set of tags. Such a rule combines the tags using the ";" operator and could look like this: source_ip:1.1.1.1;handler:Sleep,1. This particular rule would allow the upstream client host to have at most one concurrent request to the Sleep endpoint at any given time. Of course, conjunctions may be combined with wild cards to say something like, source_ip:*;handler:*,1 which prevent any handler from being invoked more than once concurrently from any given client host.

This full example is provided in the package simple

Usage Details

Please see Godoc for usage and implementation notes.

Contributions

Contributions to this library are welcome, but because we keep our open source release in sync with the upstream, not all pull requests will be merged. If you are working on this library, please do talk to us in advance to talk through your intended changes and how we can merge them into the upstream.

To submit a pull request, please sign our Contributor License Agreement and indicate in your pull request that you've signed it.

Further Reading

The ideas contained within these packages are not new. We were inspired to implement ideas first presented by Facebook in the process of evaluating the multi-tenancy of Edgestore, our distributed metadata storage (read more here).

The basis for our implementations can be found in the following resources:

load_management's People

Contributors

danieltahara avatar dmpichugin avatar fdai-dbx avatar lk4d4 avatar opaugam-dropbox avatar tanaylathia-dbx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

load_management's Issues

Using dropbox open source load-management library

Hi Team,

I wish to use the rate limiter present in an Open-source project by drop-box. However, I have a question :

The library's last commit was 5 months ago. Is the library actively maintained? More explicitly will any go-lang upgrades/bug fixes be applied to the project going forward.

Regards,
jasdeep

Implement Negation Operator for Scorecard

There are cases when the distribution of things matching a given * tag is non-uniform, and we wish to set limits for all but one or two specific values. Implement a ~ or negation operator for this case.

Implement ForceTrack() for Scorecard

Ideally, for suspicious requests admitted to LoadManager, we would still be able to track the requests in scorecard. For that we need a Force function.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.