Coder Social home page Coder Social logo

veneur's Introduction

Go Stripe

Go Reference Build Status Coverage Status

The official Stripe Go client library.

Requirements

  • Go 1.15 or later

Installation

Make sure your project is using Go Modules (it will have a go.mod file in its root if it already is):

go mod init

Then, reference stripe-go in a Go program with import:

import (
	"github.com/stripe/stripe-go/v76"
	"github.com/stripe/stripe-go/v76/customer"
)

Run any of the normal go commands (build/install/test). The Go toolchain will resolve and fetch the stripe-go module automatically.

Alternatively, you can also explicitly go get the package into a project:

go get -u github.com/stripe/stripe-go/v76

Documentation

For a comprehensive list of examples, check out the API documentation.

See video demonstrations covering how to use the library.

For details on all the functionality in this library, see the Go documentation.

Below are a few simple examples:

Customers

params := &stripe.CustomerParams{
	Description:      stripe.String("Stripe Developer"),
	Email:            stripe.String("[email protected]"),
	PreferredLocales: stripe.StringSlice([]string{"en", "es"}),
}

c, err := customer.New(params)

PaymentIntents

params := &stripe.PaymentIntentListParams{
	Customer: stripe.String(customer.ID),
}

i := paymentintent.List(params)
for i.Next() {
	pi := i.PaymentIntent()
}

if err := i.Err(); err != nil {
	// handle
}

Events

i := event.List(nil)
for i.Next() {
	e := i.Event()

	// access event data via e.GetObjectValue("resource_name_based_on_type", "resource_property_name")
	// alternatively you can access values via e.Data.Object["resource_name_based_on_type"].(map[string]interface{})["resource_property_name"]

	// access previous attributes via e.GetPreviousValue("resource_name_based_on_type", "resource_property_name")
	// alternatively you can access values via e.Data.PrevPreviousAttributes["resource_name_based_on_type"].(map[string]interface{})["resource_property_name"]
}

Alternatively, you can use the event.Data.Raw property to unmarshal to the appropriate struct.

Authentication with Connect

There are two ways of authenticating requests when performing actions on behalf of a connected account, one that uses the Stripe-Account header containing an account's ID, and one that uses the account's keys. Usually the former is the recommended approach. See the documentation for more information.

To use the Stripe-Account approach, use SetStripeAccount() on a ListParams or Params class. For example:

// For a list request
listParams := &stripe.CustomerListParams{}
listParams.SetStripeAccount("acct_123")
// For any other kind of request
params := &stripe.CustomerParams{}
params.SetStripeAccount("acct_123")

To use a key, pass it to API's Init function:

import (
	"github.com/stripe/stripe-go/v76"
	"github.com/stripe/stripe-go/v76/client"
)

stripe := &client.API{}
stripe.Init("access_token", nil)

Google AppEngine

If you're running the client in a Google AppEngine environment, you'll need to create a per-request Stripe client since the http.DefaultClient is not available. Here's a sample handler:

import (
	"fmt"
	"net/http"

	"google.golang.org/appengine"
	"google.golang.org/appengine/urlfetch"

	"github.com/stripe/stripe-go/v76"
	"github.com/stripe/stripe-go/v76/client"
)

func handler(w http.ResponseWriter, r *http.Request) {
	c := appengine.NewContext(r)
	httpClient := urlfetch.Client(c)

	sc := client.New("sk_test_123", stripe.NewBackends(httpClient))

	params := &stripe.CustomerParams{
		Description: stripe.String("Stripe Developer"),
		Email:       stripe.String("[email protected]"),
	}
	customer, err := sc.Customers.New(params)
	if err != nil {
		fmt.Fprintf(w, "Could not create customer: %v", err)
	}
	fmt.Fprintf(w, "Customer created: %v", customer.ID)
}

Usage

While some resources may contain more/less APIs, the following pattern is applied throughout the library for a given $resource$:

Without a Client

If you're only dealing with a single key, you can simply import the packages required for the resources you're interacting with without the need to create a client.

import (
	"github.com/stripe/stripe-go/v76"
	"github.com/stripe/stripe-go/v76/$resource$"
)

// Setup
stripe.Key = "sk_key"

// Set backend (optional, useful for mocking)
// stripe.SetBackend("api", backend)

// Create
resource, err := $resource$.New(&stripe.$Resource$Params{})

// Get
resource, err = $resource$.Get(id, &stripe.$Resource$Params{})

// Update
resource, err = $resource$.Update(id, &stripe.$Resource$Params{})

// Delete
resourceDeleted, err := $resource$.Del(id, &stripe.$Resource$Params{})

// List
i := $resource$.List(&stripe.$Resource$ListParams{})
for i.Next() {
	resource := i.$Resource$()
}

if err := i.Err(); err != nil {
	// handle
}

With a Client

If you're dealing with multiple keys, it is recommended you use client.API. This allows you to create as many clients as needed, each with their own individual key.

import (
	"github.com/stripe/stripe-go/v76"
	"github.com/stripe/stripe-go/v76/client"
)

// Setup
sc := &client.API{}
sc.Init("sk_key", nil) // the second parameter overrides the backends used if needed for mocking

// Create
$resource$, err := sc.$Resource$s.New(&stripe.$Resource$Params{})

// Get
$resource$, err = sc.$Resource$s.Get(id, &stripe.$Resource$Params{})

// Update
$resource$, err = sc.$Resource$s.Update(id, &stripe.$Resource$Params{})

// Delete
$resource$Deleted, err := sc.$Resource$s.Del(id, &stripe.$Resource$Params{})

// List
i := sc.$Resource$s.List(&stripe.$Resource$ListParams{})
for i.Next() {
	$resource$ := i.$Resource$()
}

if err := i.Err(); err != nil {
	// handle
}

Accessing the Last Response

Use LastResponse on any APIResource to look at the API response that generated the current object:

c, err := coupon.New(...)
requestID := coupon.LastResponse.RequestID

Similarly, for List operations, the last response is available on the list object attached to the iterator:

it := coupon.List(...)
for it.Next() {
    // Last response *NOT* on the individual iterator object
    // it.Coupon().LastResponse // wrong

    // But rather on the list object, also accessible through the iterator
    requestID := it.CouponList().LastResponse.RequestID
}

See the definition of APIResponse for available fields.

Note that where API resources are nested in other API resources, only LastResponse on the top-level resource is set.

Automatic Retries

The library automatically retries requests on intermittent failures like on a connection error, timeout, or on certain API responses like a status 409 Conflict. Idempotency keys are always added to requests to make any such subsequent retries safe.

By default, it will perform up to two retries. That number can be configured with MaxNetworkRetries:

import (
	"github.com/stripe/stripe-go/v76"
	"github.com/stripe/stripe-go/v76/client"
)

config := &stripe.BackendConfig{
    MaxNetworkRetries: stripe.Int64(0), // Zero retries
}

sc := &client.API{}
sc.Init("sk_key", &stripe.Backends{
    API:     stripe.GetBackendWithConfig(stripe.APIBackend, config),
    Uploads: stripe.GetBackendWithConfig(stripe.UploadsBackend, config),
})

coupon, err := sc.Coupons.New(...)

Configuring Logging

By default, the library logs error messages only (which are sent to stderr). Configure default logging using the global DefaultLeveledLogger variable:

stripe.DefaultLeveledLogger = &stripe.LeveledLogger{
    Level: stripe.LevelInfo,
}

Or on a per-backend basis:

config := &stripe.BackendConfig{
    LeveledLogger: &stripe.LeveledLogger{
        Level: stripe.LevelInfo,
    },
}

It's possible to use non-Stripe leveled loggers as well. Stripe expects loggers to comply to the following interface:

type LeveledLoggerInterface interface {
	Debugf(format string, v ...interface{})
	Errorf(format string, v ...interface{})
	Infof(format string, v ...interface{})
	Warnf(format string, v ...interface{})
}

Some loggers like Logrus and Zap's SugaredLogger support this interface out-of-the-box so it's possible to set DefaultLeveledLogger to a *logrus.Logger or *zap.SugaredLogger directly. For others it may be necessary to write a thin shim layer to support them.

Expanding Objects

All expandable objects in stripe-go take the form of a full resource struct, but unless expansion is requested, only the ID field of that struct is populated. Expansion is requested by calling AddExpand on parameter structs. For example:

//
// *Without* expansion
//
c, _ := charge.Get("ch_123", nil)

c.Customer.ID    // Only ID is populated
c.Customer.Name  // All other fields are always empty

//
// With expansion
//
p := &stripe.ChargeParams{}
p.AddExpand("customer")
c, _ = charge.Get("ch_123", p)

c.Customer.ID    // ID is still available
c.Customer.Name  // Name is now also available (if it had a value)

How to use undocumented parameters and properties

stripe-go is a typed library and it supports all public properties or parameters.

Stripe sometimes launches private beta features which introduce new properties or parameters that are not immediately public. These will not have typed accessors in the stripe-go library but can still be used.

Parameters

To pass undocumented parameters to Stripe using stripe-go you need to use the AddExtra() method, as shown below:

	params := &stripe.CustomerParams{
		Email: stripe.String("[email protected]")
	}

	params.AddExtra("secret_feature_enabled", "true")
	params.AddExtra("secret_parameter[primary]","primary value")
	params.AddExtra("secret_parameter[secondary]","secondary value")

	customer, err := customer.Create(params)

Properties

You can access undocumented properties returned by Stripe by querying the raw response JSON object. An example of this is shown below:

customer, _ = customer.Get("cus_1234", nil);

var rawData map[string]interface{}
_ = json.Unmarshal(customer.LastResponse.RawJSON, &rawData)

secret_feature_enabled, _ := string(rawData["secret_feature_enabled"].(bool))

secret_parameter, ok := rawData["secret_parameter"].(map[string]interface{})
if ok {
	primary := secret_parameter["primary"].(string)
	secondary := secret_parameter["secondary"].(string)
} 

Webhook signing

Stripe can optionally sign the webhook events it sends to your endpoint, allowing you to validate that they were not sent by a third-party. You can read more about it here.

Testing Webhook signing

You can use stripe.webhook.GenerateTestSignedPayload to mock webhook events that come from Stripe:

payload := map[string]interface{}{
	"id":          "evt_test_webhook",
	"object":      "event",
	"api_version": stripe.APIVersion,
}
testSecret := "whsec_test_secret"

payloadBytes, err := json.Marshal(payload)

signedPayload := webhook.GenerateTestSignedPayload(&webhook.UnsignedPayload{Payload: payloadBytes, Secret: testSecret})
event, err := webhook.ConstructEvent(signedPayload.Payload, signedPayload.Header, signedPayload.Secret)

if event.ID == payload["id"] {
	// Do something with the mocked signed event
} else {
	// Handle invalid event payload
}

Writing a Plugin

If you're writing a plugin that uses the library, we'd appreciate it if you identified using stripe.SetAppInfo:

stripe.SetAppInfo(&stripe.AppInfo{
	Name:    "MyAwesomePlugin",
	URL:     "https://myawesomeplugin.info",
	Version: "1.2.34",
})

This information is passed along when the library makes calls to the Stripe API. Note that while Name is always required, URL and Version are optional.

Telemetry

By default, the library sends telemetry to Stripe regarding request latency and feature usage. These numbers help Stripe improve the overall latency of its API for all users, and improve popular features.

You can disable this behavior if you prefer:

config := &stripe.BackendConfig{
	EnableTelemetry: stripe.Bool(false),
}

Mocking clients for unit tests

To mock a Stripe client for a unit tests using GoMock:

  1. Generate a Backend type mock.
mockgen -destination=mocks/backend.go -package=mocks github.com/stripe/stripe-go/v76 Backend
  1. Use the Backend mock to initialize and call methods on the client.
import (
	"example/hello/mocks"
	"testing"

	"github.com/golang/mock/gomock"
	"github.com/stretchr/testify/assert"
	"github.com/stripe/stripe-go/v76"
	"github.com/stripe/stripe-go/v76/account"
)

func UseMockedStripeClient(t *testing.T) {
	// Create a mock controller
	mockCtrl := gomock.NewController(t)
	defer mockCtrl.Finish()
	// Create a mock stripe backend
	mockBackend := mocks.NewMockBackend(mockCtrl)
	client := account.Client{B: mockBackend, Key: "key_123"}

	// Set up a mock call
	mockBackend.EXPECT().Call("GET", "/v1/accounts/acc_123", gomock.Any(), gomock.Any(), gomock.Any()).
		// Return nil error
		Return(nil).
		Do(func(method string, path string, key string, params stripe.ParamsContainer, v *stripe.Account) {
			// Set the return value for the method
			*v = stripe.Account{
				ID: "acc_123",
			}
		}).Times(1)

	// Call the client method
	acc, _ := client.GetByID("acc_123", nil)

	// Asset the result
	assert.Equal(t, acc.ID, "acc_123")
}

Beta SDKs

Stripe has features in the beta phase that can be accessed via the beta version of this package. We would love for you to try these and share feedback with us before these features reach the stable phase. To install a beta version of stripe-go use the commit notation of the go get command to point to a beta tag:

go get -u github.com/stripe/stripe-go/[email protected]

Note There can be breaking changes between beta versions.

We highly recommend keeping an eye on when the beta feature you are interested in goes from beta to stable so that you can move from using a beta version of the SDK to the stable version.

If your beta feature requires a Stripe-Version header to be sent, set the stripe.APIVersion field using the stripe.AddBetaVersion function to set it:

Note The APIVersion can only be set in beta versions of the library.

stripe.AddBetaVersion("feature_beta", "v3")

Support

New features and bug fixes are released on the latest major version of the Stripe Go client library. If you are on an older major version, we recommend that you upgrade to the latest in order to use the new features and bug fixes including those for security vulnerabilities. Older major versions of the package will continue to be available for use, but will not be receiving any updates.

Development

Pull requests from the community are welcome. If you submit one, please keep the following guidelines in mind:

  1. Code must be go fmt compliant.
  2. All types, structs and funcs should be documented.
  3. Ensure that make test succeeds.

Test

The test suite needs testify's require package to run:

github.com/stretchr/testify/require

Before running the tests, make sure to grab all of the package's dependencies:

go get -t -v

It also depends on stripe-mock, so make sure to fetch and run it from a background terminal (stripe-mock's README also contains instructions for installing via Homebrew and other methods):

go get -u github.com/stripe/stripe-mock
stripe-mock

Run all tests:

make test

Run tests for one package:

go test ./invoice

Run a single test:

go test ./invoice -run TestInvoiceGet

For any requests, bug or comments, please open an issue or submit a pull request.

veneur's People

Contributors

aditya-stripe avatar an-stripe avatar andresgalindo-stripe avatar arnavdugar-stripe avatar asf-stripe avatar aubrey-stripe avatar chimeracoder avatar choo-stripe avatar chualan-stripe avatar clin-stripe avatar cory-stripe avatar dependabot[bot] avatar eriwo-stripe avatar evanj avatar gphat avatar joshu-stripe avatar kiran avatar kklipsch-stripe avatar krisreeves-stripe avatar mimran-stripe avatar noahgoldman avatar prudhvi avatar redsn0w422 avatar rma-stripe avatar sdboyer-stripe avatar sjung-stripe avatar tummychow avatar vasi-stripe avatar yanske avatar yasha-stripe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

veneur's Issues

sentry support

veneur should catch panics and report them to sentry. Error reporting to sentry might also be nice, but since errors are generated on a packet-by-packet basis, the existing metrics should cover a lot of our reporting needs there.

Solutions for Tee-ing statsd traffic

I have a weird edgecase scenario and can't tell if veneur can support it. I'd like to tee off some statsd traffic that is sent to a veneur instance. I'd like to send it to both datadog and a local prometheus (I assumed via the statsd exporter). But I don't see a sink for statsd or prometheus. Any advice?

Datadog api key is leaking into logs in default config

Starting up the software with no configuration overrides etc. places logging into debug mode which results in log messages like this:

time="2018-08-24T20:13:02Z" level=debug msg="POSTed successfully" action=flush endpoint="https://app.datadoghq.com/api/v1/series?api_key=<MY ACTUAL API KEY>" request_headers="map[Content-Encoding:[deflate] Traceid:[7918023244848592696] Parentid:[3942115212271789064] Spanid:[1544772588644824963] Resource:[flush] Content-Type:[application/json]]" request_length=29313 response="{\"status\": \"ok\"}" response_headers="map[Content-Length:[16] Dd-Pool:[propjoe] X-Content-Type-Options:[nosniff] Strict-Transport-Security:[max-age=15724800;] Date:[Fri, 24 Aug 2018 20:13:02 GMT] Content-Type:[text/json]]" status="202 Accepted"

While I think there is some value in outputting the datadog apikey in a debug message I think that both for security and sanity that logging should probably not be set to debug level by default. Or that examples like this one (https://varnull.adityamukerjee.net/2018/04/05/observing-kubernetes-services-with-veneur/) should probably show how to override the log level.

Dynamic per tag api key support for signalfx sink

Hi!

The intended use case for this is having a per-service API Key, without having to hard code them all in the config file. I'd be adding this to the SignalFX sink.

This idea is @prudhvi's, and I'd be working with him to handle the meat of the implementation. He said he'd chatted with someone and that he'd gotten affirmation that a PR with this logic would be accepted, if it was generic enough.

I had a rough design in mind, let me know if it sounds reasonable. In a ticker polling loop:

Configuration Through Environment Variables?

I'm wondering if there is any interest in enabling configuration through environment variables. This seems to be the default way to configure things with Docker and would be quite easy to enable using the https://github.com/kelseyhightower/envconfig library with the Config struct that is currently used. It could either be a flag that can be used as an alternative to --config or it could default to reading from the environment if --config is not set.

If there is interest in doing this, I'd be willing to contribute the code.

datadog metrics error

Hi Friends,
I am running a standalone local veneur docker server with a valid DATADOG_APIKEY Configuration.

After the server starts up I see consistently DataDog connectivity failures. Please help

sr_veneur    | time="2019-11-01T22:05:58Z" level=warning msg="Could not POST" action=flush endpoint="https://app.datadoghq.com/api/v1/series?api_key=\"<valid key replaced for github purpose>\"" error=403 request_headers="map[Content-Encoding:[deflate] Content-Type:[application/json] Ot-Tracer-Sampled:[true] Ot-Tracer-Spanid:[72e562337c4bff4f] Ot-Tracer-Traceid:[696dc5e6b593e6c0]]" request_length=1196 response="<html><body><h1>403 Forbidden</h1>\nRequest forbidden by administrative rules.\n</body></html>\n" response_headers="map[Cache-Control:[no-cache] Content-Type:[text/html] Date:[Fri, 01 Nov 2019 22:05:58 GMT]]" status="403 Forbidden"
sr_veneur    | time="2019-11-01T22:05:58Z" level=info msg="Completed flush to Datadog" metrics=114
sr_veneur    | time="2019-11-01T22:05:58Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:05:58Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=debug msg=Flushing worker=0
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=debug msg="Checkpointing flushed spans for X-Ray" dropped_spans=6 flushed_spans=6
sr_veneur    | time="2019-11-01T22:06:08Z" level=debug msg="Worker count chosen" workers=1
sr_veneur    | time="2019-11-01T22:06:08Z" level=debug msg="Chunk size chosen" chunkSize=107
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=info msg="Completed flush to SignalFx" metrics=107 success=true
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Could not POST" action=flush endpoint="https://app.datadoghq.com/api/v1/series?api_key=\"<valid key replaced for github purpose>\"" error=403 request_headers="map[Content-Encoding:[deflate] Content-Type:[application/json] Ot-Tracer-Sampled:[true] Ot-Tracer-Spanid:[490c3bc208624814] Ot-Tracer-Traceid:[3d89f85a11d4f6d9]]" request_length=1090 response="<html><body><h1>403 Forbidden</h1>\nRequest forbidden by administrative rules.\n</body></html>\n" response_headers="map[Cache-Control:[no-cache] Content-Type:[text/html] Date:[Fri, 01 Nov 2019 22:06:08 GMT]]" status="403 Forbidden"
sr_veneur    | time="2019-11-01T22:06:08Z" level=info msg="Completed flush to Datadog" metrics=107
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:08Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=debug msg=Flushing worker=0
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=debug msg="Checkpointing flushed spans for X-Ray" dropped_spans=6 flushed_spans=7
sr_veneur    | time="2019-11-01T22:06:18Z" level=debug msg="Worker count chosen" workers=1
sr_veneur    | time="2019-11-01T22:06:18Z" level=debug msg="Chunk size chosen" chunkSize=114
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=info msg="Completed flush to SignalFx" metrics=114 success=true
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Could not POST" action=flush endpoint="https://app.datadoghq.com/api/v1/series?api_key=\"<valid key replaced for github purpose>\"" error=403 request_headers="map[Content-Encoding:[deflate] Content-Type:[application/json] Ot-Tracer-Sampled:[true] Ot-Tracer-Spanid:[47e7e8050e834905] Ot-Tracer-Traceid:[2a94b0994f734c21]]" request_length=1196 response="<html><body><h1>403 Forbidden</h1>\nRequest forbidden by administrative rules.\n</body></html>\n" response_headers="map[Cache-Control:[no-cache] Content-Type:[text/html] Date:[Fri, 01 Nov 2019 22:06:18 GMT]]" status="403 Forbidden"
sr_veneur    | time="2019-11-01T22:06:18Z" level=info msg="Completed flush to Datadog" metrics=114
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"
sr_veneur    | time="2019-11-01T22:06:18Z" level=warning msg="Error sending segment" error="write udp 127.0.0.1:56908->127.0.0.1:2000: write: connection refused"

S3 plugin - bucket is not in region

Created bucket in region us-west-2 "devops-veneur-test2" with inline policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<<LOCAL_ACCOUNT_ID>>:root"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::devops-veneur-test2",
                "arn:aws:s3:::devops-veneur-test2/*"
            ]
        }
    ]
}

Created IAM user with inline policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::devops-veneur-test2/*",
                "arn:aws:s3:::devops-veneur-test2"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets",
                "s3:HeadBucket"
            ],
            "Resource": "*"
        }
    ]
}

Created Access Key / Secret Access Key for above user.

Configured veneur /root/.aws/credentials with above credentials
I am able to

aws s3 cp /etc/veneur.cfg s3://devops-veneur-test2/ --region us-west-2

Configured veneur.cfg with:

aws_access_key_id: "<<ACCESS_KEY>>"
aws_secret_access_key: "<<SECRET_ACCESS_KEY>>"
aws_region: "us-west-2"
aws_s3_bucket: "devops-veneur-test2"

Veneur logs:

time="2018-12-03T14:40:19Z" level=info msg="Set mutex profile fraction" MutexProfileFraction=0 previousMutexProfileFraction=0
time="2018-12-03T14:40:19Z" level=info msg="Set block profile rate (nanoseconds)" BlockProfileRate=0
time="2018-12-03T14:40:19Z" level=info msg="Preparing workers" number=32
time="2018-12-03T14:40:19Z" level=info msg="Successfully created AWS session"
time="2018-12-03T14:40:19Z" level=info msg="S3 archives are enabled"
time="2018-12-03T14:40:19Z" level=info msg="Starting server" version=cdcfb315b61057a0ee5c8bb3fd3b3bca0a69059b
time="2018-12-03T14:40:19Z" level=info msg="Starting span workers" n=0
time="2018-12-03T14:40:19Z" level=info msg="Starting span sink" sink=metric_extraction
time="2018-12-03T14:40:19Z" level=info msg="Listening on UDP address" address="0.0.0.0:8125" listeners=4 protocol=statsd
time="2018-12-03T14:40:19Z" level=info msg="Tracing sockets are not configured - not reading trace socket"
time="2018-12-03T14:40:19Z" level=info msg="Starting gRPC server" address="0.0.0.0:8128"
time="2018-12-03T14:40:19Z" level=info msg="Starting Event worker"
time="2018-12-03T14:40:19Z" level=info msg="HTTP server listening" address="0.0.0.0:8127"
time="2018-12-03T14:40:29Z" level=error msg="Error posting to s3" error="BucketRegionError: incorrect region, the bucket is not in 'us-west-2' region\n\tstatus code: 301, request id: , host id: " metrics=2
time="2018-12-03T14:40:39Z" level=error msg="Error posting to s3" error="BucketRegionError: incorrect region, the bucket is not in 'us-west-2' region\n\tstatus code: 301, request id: , host id: " metrics=37
time="2018-12-03T14:40:49Z" level=error msg="Error posting to s3" error="BucketRegionError: incorrect region, the bucket is not in 'us-west-2' region\n\tstatus code: 301, request id: , host id: " metrics=37

Consider a better default config file for your container.

I recently did an upgrade from a fairly old version of the tool (4.x) to 11.x and found that with the inclusion and default use of your default config file my deployment went from working to spewing errors right and left.

It's my opinion that all sinks should probably have an enabled true/false flag but also maybe not providing values that cause them to activate by default would be nice. A short list of things I can remember modifying the config file to disable, splunk, falconer, signalfx, gprc, xray...

Not sending float numbers that are less than 1 for Counters to datadog.

Hi,

We recently started using veneur to replace datadog agent and exporting data to datadog. We are not using a release code but used the code from github on July 17th. We have observed that for Counter float value that is less than 1 is not getting sent to datadog. I did a workaround to multiply the same metric with 10 (our number is around 0.2) and it is able to be successfully pushed to datadog.

Is this a known issue? Or is it fixed in the later code/release?

Thanks,
Jingyuan

Force all metrics to be sent to the global aggregator

We want to use Veneur in a mode where we run a set of "leaf" Veneurs which receive stats from applications, and forward ALL metrics to a global Veneur. This is because we don't care about host tags, and so we don't want to pay for them :). We've gotten away with having a single Veneur instance so far, but we are starting to run into CPU limits, and we figure this scheme will allow us to scale substantially farther without needing to shard.

We are going to hack this into a local fork and make sure it works for us. If it does, would you be possibly interested in a pull request to add this as some sort of configuration option? I can understand that this may be a strange enough use case to not want it, and I don't want to do the work to figure out how to make this configurable if it won't be accepted.

Thanks!

Documentation unclear on how local Veneur should be used

Hi, first of all, thanks for making this!

I would like a clarification (that I couldn't find in the current README) on how a local Veneur is intended to be used (i.e. what's the best practice?):

  • should a local Veneur instance replace DogStatsD from the dd-agent entirely?

OR

  • should DogStatsD merely be configured to forward its metrics to the local Veneur instance?

I expect the answer is the latter scenario, based on the comment for udp_address in the configuration section of the README, but I was unable to find a firm, explicit recommendation in the documentation.

veneur-proxy doesn't work on Kubernetes via GRPC

I've been trying to use veneur-proxy on our setup that only has one veneur-global and a bunch of veneur-local's running as sidecars on each pod that needs to send metrics to Datadog.

The setup with just one veneur-global works fine, but after fiddling around with the proxy, the farthest I've got was:

time="2019-11-11T15:29:14Z" level=debug msg="Found TCP port" port=8128
time="2019-11-11T15:29:14Z" level=debug msg="Got destinations" destinations="[http://172.21.153.68:8128]" service=veneur-global
time="2019-11-11T15:29:41Z" level=error msg="Proxying failed" duration=2.684632ms error="failed to forward to the host 'http://172.21.153.68:8128' (cause=forward, metrics=1): failed to send 1 metrics over gRPC: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp: address http://172.21.153.68:8128: too many colons in address\"" protocol=grpc
time="2019-11-11T15:29:44Z" level=debug msg="About to refresh destinations" acceptingForwards=false consulForwardGRPCService=veneur-global consulForwardService= consulTraceService=

That error is similar to when the forward GRPC address is added with the http:// prefix and looking at the code it seams that the "http://" is hardcoded added by the KubernetesDiscoverer while it's not added by Consul.

What is the reasoning for that? I don't want to fallback to sending this over HTTP just because the discoverer can't handle this.

Thanks for your time.

Package rename in vendored dependencies

Hello! We're working on integrating Veneur into our (mono)repo, and are running up against some vendoring issues.

Specifically, x/net/http in this repo is older than the one we're using, which contains some package renames (namely lex/httplex moved into http/httpguts).

Here's the specific change: golang/net@cbb82b5

I've bumped the version locally to the latest upstream, and everything looks good, so I'd be more than happy to offer that up as a patch.

Thanks!

Aggregated histogram metrics are not forwarded to global veneur

I have a local veneur forwarding metrics to a global veneur. If I tag a histogram metric with veneurglobalonly aggregations of that metric are not forwarded to the global veneur (e.g. $ echo "testing.histo:1234|h|#veneurglobalonly,blah:yes" | nc -w 1 -u localhost 18125)

max:
image

99percentile:
image

Can this behaviour be supported?

Non-numeric gauge blocks all metrics in flush interval from being sent

When using the latest related version of veneur inside of Stripe, I noticed that gauges with non-numeric values cause veneur to emit an error message, and fail to deliver all of the metrics contained in the flush interval.

You can reproduce this issue by running this command:

echo -n "this_is_a_bad_metric:nan|g|#shell" >/dev/udp/localhost/8200

You should see this error message in the log:

[2017-05-11 21:37:48.262346] time="2017-05-11T21:37:48Z" level=error msg="Could not render JSON" action=flush error="json: unsupported value: NaN" 

I was only able to reproduce this using a gauge, a counter didn't produce the error in the log. This somewhat confusing, because looking at the code, I thought that this packet would fail to parse:

return nil, fmt.Errorf("Invalid number for metric value: %s", valueChunk)

Which would trigger an error higher up the stack:

log.WithFields(logrus.Fields{

Causing just that one metric to be excluded. This doesn't appear to be happening for some reason.

Bucket alignment

After migrating from dd-agent to veneur we have the following problem on DataDog:

screen shot 2017-10-20 at 11 35 54

On the left dd-agent aligns flushes at aligned intervals (0s, 10s, 20s, ... after minute boundaries), and the datapoints are only defined at those x-axis-values, while on the right the 10s buckets veneur (from many different instances) are aligned to 0s, 5s, 10s, 15s, ... intervals.

The alignment should happen equally to how dd-agent is doing this for correct behavior in DataDog.

The way dd-agent works is by

bucket_start = 1508492676 % interval
// bucket_start = 1508492670

while veneur uses a ticker. If every instance starts at a distinct time, and you have a suffienciently large number of instances, the flushes are uniformly distributed over that 10s interval.

Proposed solution would be something like this:
https://stackoverflow.com/questions/33267106/golang-time-ticker-how-to-begin-on-even-timestamps

Upgrade github.com/gogo/protobuf to later version

https://github.com/gogo/protobuf/releases is at 1.1.1 right now and the veneur dependency is pinned to 0.5.0 (https://github.com/stripe/veneur/blob/master/Gopkg.toml#L45)

1.x protobuf introduced some new types and fields in generated protobuf types that aren't available in 0.5.0.

We ran into an issue pulling in an API that pinned the version and dep was unable to resolve a dependency without using an override.

Overriding to 1.1.1 seemed to work, but figured I'd drop a line here to see if we can just roll that dependency forward

Veneur CPU usage based on metric packets processed per second.

Hi Team
We are seeing veneur using 1 logical CPU for about 60k metrics processed per second. We are sending stats to Veneur via java datadog client over Unix Domain Socket. Just wanted to know if such usage is expected or if we configured something bad.

Attaching the signalfx charts that show the stats:

veneur.worker.metrics_processed_total chart:

Screen Shot 2019-11-11 at 2 01 18 PM

Logical CPU's used by Veneur chart:

Screen Shot 2019-11-11 at 2 01 30 PM

Veneur panic

Our veneur's v1.3.1 started crashing sporadically past thurday.
Yesterday, I compiled from master, and updated all, but the crash seams to persist.

The problem is that it's ocasional and I can't seam to be able to debug ingested metrics.

Thank you for any help.

LOG:
time="2018-11-27T13:38:11Z" level=info msg="Completed flush to Datadog" metrics=2403
fatal error: fault
unexpected fault address 0x10f9000
	/go/src/github.com/stripe/veneur/worker.go:259 +0x3ee fp=0xc000ac7fa8 sp=0xc000ac7c28 pc=0x10f854e
github.com/stripe/veneur.(*Worker).Work(0xc000362fc0)
	/go/src/github.com/stripe/veneur/worker.go:311 +0x9fa fp=0xc000ac7c28 sp=0xc000ac7a90 pc=0x10f8ffa
github.com/stripe/veneur.(*Worker).ProcessMetric(0xc000362fc0, 0xc000ac7dc8)
	/usr/local/go/src/runtime/signal_unix.go:397 +0x275 fp=0xc000ac7a90 sp=0xc000ac7a40 pc=0x441aa5
runtime.sigpanic()
	/usr/local/go/src/runtime/panic.go:608 +0x72 fp=0xc000ac7a40 sp=0xc000ac7a10 pc=0x42bf72
runtime.throw(0x13cc01a, 0x5)
goroutine 134 [running]:

[signal SIGSEGV: segmentation violation code=0x2 addr=0x10f9000 pc=0x10f8ffa]
net.(*conn).Read(0xc00000c050, 0xc0004e8000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*netFD).Read(0xc000401180, 0xc0004e8000, 0x2000, 0x2000, 0x408a7b, 0xc00002e000, 0x125a980)
	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x179
internal/poll.(*FD).Read(0xc000401180, 0xc0004e8000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*pollDesc).waitRead(0xc000401198, 0xc0004e8000, 0x2000, 0x2000)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
internal/poll.(*pollDesc).wait(0xc000401198, 0x72, 0xffffffffffffff00, 0x15b5280, 0x206e558)
	/usr/local/go/src/runtime/netpoll.go:173 +0x66
internal/poll.runtime_pollWait(0x7f131c6a9950, 0x72, 0xc0006c2870)
goroutine 36 [IO wait]:

	/go/src/github.com/stripe/veneur/cmd/veneur/main.go:94 +0x2eb
main.main()
	/go/src/github.com/stripe/veneur/server.go:1074 +0x6f
github.com/stripe/veneur.(*Server).Serve(0xc0005dc600)
goroutine 1 [chan receive, 244 minutes]:

	/go/src/github.com/stripe/veneur/server.go:273 +0x95c
created by github.com/stripe/veneur.NewFromConfig
	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc000ac7fd8 sp=0xc000ac7fd0 pc=0x45bab1
runtime.goexit()
	/go/src/github.com/stripe/veneur/server.go:277 +0x51 fp=0xc000ac7fd0 sp=0xc000ac7fa8 pc=0x10fe351
github.com/stripe/veneur.NewFromConfig.func1(0xc0005dc600, 0xc000362fc0)


CONFIG:
statsd_listen_addresses:
 - udp://0.0.0.0:8125
tls_key: ""
tls_certificate: ""
tls_authority_certificate: ""
forward_address: ""
forward_use_grpc: false
interval: "10s"
synchronize_with_interval: false
stats_address: "localhost:8125"
http_address: "0.0.0.0:8127"
grpc_address: "0.0.0.0:8128"
indicator_span_timer_name: ""
percentiles:
  - 0.99
  - 0.95
aggregates:
  - "min"
  - "max"
  - "median"
  - "avg"
  - "count"
  - "sum"
num_workers: 96
num_readers: 4
num_span_workers: 10
span_channel_capacity: 100
metric_max_length: 4096
trace_max_length_bytes: 16384
read_buffer_size_bytes: 4194304
debug: false
debug_ingested_spans: false
debug_flushed_metrics: false
mutex_profile_fraction: 0
block_profile_rate: 0
sentry_dsn: ""
enable_profiling: false
datadog_api_hostname: https://app.datadoghq.com
datadog_api_key: "DATADOG_KEY"
datadog_flush_max_per_body: 25000
datadog_trace_api_address: ""
datadog_span_buffer_size: 16384
aws_access_key_id: ""
aws_secret_access_key: ""
aws_region: ""
aws_s3_bucket: ""
flush_file: ""

ADDITIONAL INFO:
They receive metrics via UDP at a rate no lower than 1000/s.
We healthcheck via TCP/8127 /healthcheck and the servers are replaced on failure. (AWS ASG)

Common tags on sinks: overwrite or not?

One of the issues that came up in #386 was whether it was appropriate for the gRPC sink's "common tags" to supersede any tags on the spans it ingests. This struck @asf-stripe as odd, as it did me, once he mentioned it - but i then checked the datadog sink and noticed that it does, indeed, overwrite tags on the incoming span.

Given that the application of common tags is itself somewhat uneven (#388), and that i didn't see (on quick ctrl-F-based perusal) any direct discussion of it in the README, i thought it worth opening an issue to explicitly clarify what the intended behavior is.

Veneur sinks support for Wavefront

Not sure this is the right way or forum to ask this question, however we would like integrate veneur with Wavefront.
is there already support available for that sink or is there any work going towards it which we can keep track on?

Need to change Sirupsen/logrus to sirupsen/logrus

Problem

It appears that the official library package has been changed to the lowercased version. (https://github.com/sirupsen/logrus).

This causes problems when attempting to use the test hook

import "github.com/Sirupsen/logrus/hooks/test"

because pulling down that package makes references to the downcased package names which results in an error when attempting to run go test:

cannot use logger (type *"github.com/stripe/veneur/vendor/github.com/sirupsen/logrus".Logger) as type *"github.com/stripe/veneur/vendor/github.com/Sirupsen/logrus".Logger in argument

Proposed solution

Update the bundled in package to the downcased version in the vendor directory and change all the imports to the downcased version of the name

Local instance tags on aggregated metrics

When metrics are forwarded from the local instances to a global aggregator, tags that are configured on the local instances aren't included.

What we'd like to do is to have each local instance include a cluster tag, which will get forwarded with the aggregated percentiles/counters, so that we can have one central aggregator but still be able to group metrics by the cluster which it's being sent from (host level isn't important to us, but cluster level is).

Can you explain the reasoning behind making this decision? Is it possible to add this as an option to allow aggregation based on some tags?

Allow listening for local metrics/SSF on a UNIX domain socket

Right now, veneur opens a UDP socket, and optionally a TCP socket to listen for data; these are both fine, but not the best choices for local operation. Some highlights of the downsides of an internet socket for local metrics include:

  • for UDP: MTU and max packet sizes and MTU start to matter, especially on larger events and metrics with long tag values.
  • TCP: Connection overhead; you end up going through the entire (the entire simulated with some shortcuts, but) TCP/IP stack for sending data
  • TCP: makes you wait for an ack.
  • Both: less flexibility when running veneur on a container host - have to make that port accessible to the container with potential network shenanigans.

The solution to this is fairly well-used in the industry: UNIX domain sockets (aka local sockets). This is a path in the file system that clients connect to (e.g. Ruby, go) and then can treat like a normal socket.

Major upsides of UNIX domain sockets are:

  • Reliability: Data sent into the local socket is guaranteed to make it out the other end, in order. Buffer sizes are more forgiving and allow for more data to be sent than e.g. in UDP datagrams.
  • Performance: data sent goes into a kernel buffer directly with no waiting for acks.
  • Safety: We can be sure that only local clients can ever connect to veneur, even if we make config mistakes.
  • More safety / clarity: The file in the file system has an owner and permissions (which makes it easier to grok who can connect).
  • Operability in containers: You can run a single veneur on the container host and mount the socket into a container like any other file.

This will also require a mild re-work of client libraries (most dogstatsd clients speak only UDP, and you'll need to be more careful sending data, as they can and do block if you send with default socket options), but I think implementing UNIX domain sockets and letting client libraries start using them will be a major win for everyone.

Unusual behaviour with aggregation

We’re aggregating metrics from our rails app, and seeing a few strange behaviours when we turn this on for two clusters. We don’t have any clear understanding of what’s going on here, but hoping you might be able to shed some light, or help us debug!

For a brief rundown on our architecture, we have a Veneur local agent running on each box that runs rails, and then a global aggregator collecting everything and shipping it off to Datadog. We’ve also patched the global aggregator consider local tags, as well as global.

The problem we’re seeing is when two clusters with a similar name (example: production-api-thing and production-api-thing-other) are both aggregated through the same global aggregator, we see a strange pattern in the way metrics arrive at Datadog.

dd_veneur_debug_board___datadog

The problem occurs when both clusters are aggregated. If only one or the other is being aggregated, we do not see this behaviour, only when both are. The metrics flip as seen between 14:00 and 14:30, but there doesn’t seem to be a pattern in how long it takes to flip between.

In this time, we don’t see anything untoward in the metrics we’re collecting from Veneur itself, since we’re still emitting the same number of metrics, they’re just being aggregated by the wrong bucket.

Have you any suggestions how we might dig into this further, or what might be the problem?

Does Veneur retry failed datadog flushes?

I'm not that good at Go so I couldn't find my answer from the code.

Does Veneur retry sending failed datadog flushes? I've seen fair amount of flushes fail due to http errors (timeouts) and I'm wondering if those are just warnings or could be dataloss.
...: time="2018-10-10T10:22:44-05:00" level=warning msg="Could not execute request" action=flush error="net/http: request canceled (Client.Timeout exceeded while awaiting headers)" host=app.datadoghq.com path=/api/v1/series

Are these requests retried? And if so, how many retries before the segment is lost?

Error in assertion about maximum required size.

I apologize for the incorrect comment in my original Java code about the maximum size for a merging digest. In fact, the bound is 2 * ceiling(compression), not as in the code.

This changes this line: https://github.com/stripe/veneur/blob/master/tdigest/merging_digest.go#L75

Here is a proof:

In the algorithm, we convert q values to k values using

k = compression * (asin(2 * q - 1) + \pi / 2) / \pi

For q = 1, k_max = compression. This is the largest possible value k can have and the number of centroids is less than or equal to ceiling(2 * compression). We only consider cases where compression >= 1 which implies that we have 2 or more centroids.

Take n as the number of centroids. The sum of the k-sizes for these centroids must be exactly k_max.

If n is even, then suppose that n > 2 ceiling(k_max). Further, to avoid collapse to a smaller number of centroids, we know that all adjacent pairs of centroid weights must be greater than 1. But we can form k_max consecutive pairs which must add up to more than k_max. This implies that the weights on the items after the k_max pairs must be negative which is impossible. Thus, n > 2 ceiling(k_max) is impossible for even n.

If n is odd, then we will have n > 2 ceiling(k_max) + 1 if n > 2 ceiling(k_max). Again, we can form at least k_max pairs, each with k_size greater than 1. There will be at least 3 centroids after these pairs. But this implies that the k-size of the remaining centroids will be less than or equal to one. But the first two of these must have k-size together greater or equal to one so the remaining centroids must have negative k-size. Thus, n > 2 ceiling(k_max) + 1 > 2 ceiling(k_max) is impossible for n odd.

By contradiction, n <= 2 ceiling(compression)

RFC: TCP/TLS support?

Summary: Would you be interested in merging a patch that adds TCP statsd, and requires TLS client certificates?

Details:

For slightly bizarre reasons, we need to listen for stats on a port that is exposed to the Internet (most of our application is in Google App Engine, with other services on Google Compute Engine. They are not on the same network). For our services that do this, we require a TLS client certificate, to ensure that only our trusted clients can connect. As a result, I've hacked veneur to support the following:

  1. Accepting stats from a TCP socket. This has one caveat: messages must be terminated with a \n, which not all clients do (many, including Datadog's only include \n if packing multiple metrics in a single packet, and omit it for the last one).

  2. Requiring a TLS client certificate signed by a given key to accept the connection.

This isn't 100% in production for us (Bluecore) yet, but we are testing it now, with the goal of using this to deliver most of our stats from App Engine to Datadog. It would be convenient for us if you would merge our patch. However, I understand this is a pretty esoteric feature and we might be the only users in the world. I'm happy to maintain a private fork if that makes more sense.

Thanks!

Public Docker image

Feature request:
I see that this has a maintained Dockerfile, which has allowed me to compile veneur with minimal effort. However, I can't seem to find an official docker image for running veneur. While I can certainly fork and tweak this repo to fit into my own build pipeline, I feel like I'm redoing effort that others have already done. Would it be possible to release a docker image with release tags? I'm happy to provide a PR for a Dockerfile that builds a production worthy image, but it would have to fit into Stripe's build pipeline.

[1.8.1] Unable to specify "zero" SSF Listener Addresses via environment configuration

Hi there!

I ran into an issue today upgrading to the latest docker image of Veneur. There appears to be a bug when trying to set up the ssf_listener_addresses field via the environmental configuration suggested here: https://github.com/stripe/veneur/tree/master/public-docker-images#running. As far as I can tell, the initial configuration is provided by the example.yaml file, which specifies "default" values for ssf_listener_addresses. If I wanted to change those I'd have to overwrite them by providing -e VENEUR_SSFLISTENERADDRESSES="" to the docker run command. One small difference is that I'm using my Kubernetes deployment to pass the environment variables rather than directly using docker run

Due to the way envconfig works, I can't specify an empty string and have it translate to an empty slice. If I specify an empty string for this value, envconfig will parse it as a slice of size 1. If I don't specify a value for VENEUR_SSFLISTENERADDRESSES, I'll get the default value from the example.yaml. In the first case, an empty string will be passed to this line which causes a fatal error. In the latter case, I appear to be hanging infinitely when flushing metrics. Some quick debugging on my own end led me to believe it was because I had invalid addresses specified as my ssf_listener_addresses in the example.yaml file. My debug logs just showed workers 1-96 flushing without ever actually doing anything.

It looks like this could be fixed by commenting out the actual values for ssf_listener_addresses in the example.yaml file. I could also have just created my own docker image, but this seemed faster initially. I have confirmed that this issue is not present in release 1.7.0, presumably because the multiple ssf listener address functionality had yet to be added.

I hope this is enough information, but if not let me know. I also understand this may be an edge case you don't want to fix.

Thanks,

Chad

Histograms .avg uneven

We have a infra-structure of servers that directly inject metrics into an infra-structure of veneur's.
These veneur's forward metrics to a single veneur aggregator.

Some of the servers metrics include measures of time to process requests of an API infra-structure. On this API we also directly inject "requests" metrics into a single veneur.

We are comparing the metrics from both sides and we have uneven results, mainly on histogram's .avg metrics.
When we used local dd-agent on the servers the metrics would match. The dd-agent would receive localhost UDP.
We removed local dd-agent due to the need to free CPU.

The servers don't use local's veneur, directly sending metrics via UDP to veneur's, injecting about 56K metrics / second, about 18K distinct metrics into the veneur's.

We use veneur 1.3.1 on the veneur's, and 1.7.0 on the aggregator.
The veneur's flushes in 10s into the aggregator.
The aggregator flushes in 20s intervals.

veneurs.cfg.txt
aggregator.cfg.txt
both.sysctl.conf.txt

Is this scenario feasible or would you advise another?

Logrus configuration

By default, logrus outputs to stderr. When running in GCE with the Stackdriver agent, stderr lines end up with severity 'warning', even when the level in the text-format is 'Info'.

As the logrus readme says this is how it is supposed to be used:

  if Environment == "production" {
    log.SetFormatter(&log.JSONFormatter{})
  } else {
    // The TextFormatter is default, you don't actually have to do this.
    log.SetFormatter(&log.TextFormatter{})
  }

Can we integrate something like that in Veneur? That way I believe stackdriver picks up the log level. I'm not sure though. In either case, stuff like "Completed flush to Datadog" and "Completed forward to upstream Veneur" should go to stdout, right?

Change the README and Website to explain in plain English what this does

Please see the discussion over on https://news.ycombinator.com/item?id=17586185

Someone did finally explain what this package does and this is what they said:

"The use case: you have more than a hundred machines emitting lots of monitoring data, much of which is uninteresting except in aggregate form. Instead of paying to store millions of data points and then computing the aggregates later, Veneur can calculate things like percentiles itself and only forwards those. ... It also has reliability benefits when operating over a network that might drop UDP packets, such as the public internet."

That makes sense to me. But when I first came across this package, I could not figure out what it does even after reading the entire README! I've been a programmer for decades, so I don't think it's me. I've never heard of "observability data" and a search didn't help. The Wikipedia article on observability left me even more confused. And for a while I thought it might have to do with observables, like in RxJS....

Same criticism applies to the website. "Veneur is a distributed, fault-tolerant observability pipeline. It's fast, memory-efficient, and capable of aggregating global metrics"

I have no idea what that means. Global weather metrics? Something to do with the globe I guess...

I did finally figure out what this package does by following the link at the bottom of the website. This article https://stripe.com/blog/introducing-veneur-high-performance-and-global-aggregation-for-datadog

"high performance and global aggregation for Datadog"

Getting closer.

Which led me to DataDog, where I finally get plain English: "Modern monitoring & analytics. See inside any stack, any app, at any scale, anywhere."

What a journey! README -> Veneur Website -> Blog Post -> Datadog website -> "Oh this is about monitoring applications and collecting that monitoring data!

I suggest something at the very top of your README and on your website along the lines of ""If you monitor applications using Datadog, Veneur can save you money and increase reliability".

What's the upside for you? If I come across a package and can't use it now, I'll bookmark it and come back to it when I need it. That won't happen if I can't quickly figure out what category the software belongs in and how to tag it.

Can this be used to aggregate `host` into one?

Hey there!

We have plenty of dd agents which are each adding their own host metric and therefore multiplying the number of metrics we have for datadog.

Would it be possible to use Veneur to aggregate host to be only one value? We don't use the host split the different agents give us.

Unix domain socket support for statsd metrics

Hi Team
Would you accept a PR to add support for UDS listening support for statsd metrics.
If so would you accept for listening metrics on DATAGRAM socket, I see that SSF span traces already support UDS via STREAM sockets.

But for statsd metrics thinking DATAGRAM is more suited to mimic UDP behavior and that the standard datadog clients support sending stats to DATAGRAM socket and we prefer the support for listening on datagram socket instead of stream socket for the same reason.
https://github.com/DataDog/datadog-agent/wiki/Unix-Domain-Sockets-support
https://github.com/DataDog/datadog-agent/wiki/Unix-Domain-Sockets-support#client-libraries-state

Unable to send to SignalFX

We have built binary and its sending StatsD metrics to DD which is one of the sinks.
But the SignalFX metrics are not coming even though it says success.

Then we changed the ingest URL to rubbish string and we got exact same "INFO[0110] Completed flush to SignalFx" message - so we now thing NONE of the SFX config is recognised.

We have tried this in Docker and native go binary.
Again if the DD metrics work from same go app, its just getting the other SignalFX sinc to work.

Here is the SignalFX specific veneur confg and logs ....

log.txt
config.yaml.zip

veneur-prometheus over reports counts and histograms

The Problem

Prometheus instrumented applications report an ever increasing number for counters (and histograms). That is if 10 events come in and you query it you will get 10, if 5 more come in, you will get 15, etc. Until the application is restarted and the count starts over.

Our bridge code is not accounting for this when sending to statsd/dogstatsd. So from statsd perspective it thinks that 10 things happened, then 15 things happened, etc. This means that for long running processes you will get dramatic over reporting of events.

You can verify this behavior by observing this test: https://github.com/stripe/veneur/blob/kklipsch/prometheus/cmd/veneur-prometheus/main_test.go#L22

Proposal

To fix this behavior the most straight forward approach is to take 2 points and subtract them to calculate the counter and histogram values. In the main case this works fine. There are 3 special cases that need to be accounted for:

  • Restart of veneur-prometheus: On restart you will lose your statefulness on the previous value. This means that during restarts you will miss statistics. This can be mitigated somewhat by saving state, but that becomes very complicated (what if the restart takes a really long time, do you want to report to statsd all of the events that happened during that period in a single flush interval to statsd?). Propose to ignore this case and just understand the stats gap that is generated by restarts.
  • Restart of the application being monitored clears counts: This will cause the counter to go smaller than the previous sample. In that case you can't know how many increments have happened on the stat (for instance previous value was 1250, restart happens and then when you poll again the value is 600). You know a minimum of 600 events happened, but it could be any number above that as the counter could have gotten up to an arbitrary X before the restart happened. Propose to report the minimum value for this case.
  • Restart of the application being monitored allows for histogram buckets to change. If the restart was part of a version change, the static buckets in histograms (and summaries I believe) can change. This means that you'd need to keep track of your histogram metrics by bucket and the code needs to be aware of this changing and continue to operate if it happens.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.