Comments (6)
Veneur itself doesn't retry flushes to Datadog (though you could use an http proxy for that, if you wanted). The entire pipeline is assumed to be mildly lossy, given that metrics are themselves received over UDP, which provides no delivery guarantees. Sporadic, occasional metric failures are tolerated.
That said, if you're seeing a lot of timeouts, something is probably up. We ourselves don't see many timeouts running Veneur at scale, so I'm curious what's going on here. Is your outbound network connection spotty? Are you sending a particularly large payload with each flush (a lot of metrics, or a long flush cycle)?
from veneur.
We're seeing a small number of sustained errors. I've got flush_max_per_body set as 25000- which is the default in the example. I don't know if this is inline with what you're seeing, but 10 to 15 errors every 15 minutes across the various DCs I've deployed Veneur to.
(Broken down by DC)
These are servers from all over the world going to aws us-east-1, so I'm expecting some errors just not sure how many.
from veneur.
Digging into the native datadog agent, it does look like it has some retry logic here: https://github.com/DataDog/datadog-agent/blob/d3e74927d78a5982d9978ed8540bd6b2c61ab437/pkg/forwarder/transaction.go#L144 under certain failure cases- namely request errors such as timeouts.
from veneur.
Yeah, that's definitely not in line with what we've experienced. We're not using Datadog ourselves at the moment, so I can't compare against current data, but timeouts in Veneur are quite rare - less than one per day - except during a Datadog outage (and their status page is green right now).
Just to clarify: when you say that this is from servers all around the world going to aws us-east-1, that's from tracing the location of where app.datadoghq.com
resolves (us-east-1)?
We do use haproxy for external egress from our network, and haproxy does have built in retrial. So it's possible that we wouldn't have noticed the connection timeouts within Veneur, if haproxy was retrying and the success rate of the retried request was high enough. As a quick test, I'd recommend trying running requests through a proxy like haproxy and seeing if that fixes the issue.
from veneur.
Yep. Every DC we have resolved to us-east-1 ELBs. We're talking directly to datadog without a proxy.
It seems the path forward is for me to add some basic retry logic into the datadog requests. We're not moving away from Datadog anytime soon so retry logic is desired. I've put in a fair amount of effort to get our dogstatsd pipeline reliable so I'm not stopping now.
from veneur.
Opened #561 for a I'm-new-to-go-and-I-think-this-is-a-valid-fix fix
from veneur.
Related Issues (20)
- Veneur forarding to Datadog - avg higher than max, avg missing data
- Support SO_REUSEPORT on darwin/etc HOT 1
- Conflicting documentation
- AddTags only respected when using metric sink routing
- Veneur facing client timeout for large metric count
- Security Vulnerabilities - v14.1.0-release-prod
- security issue: veneur.org is no longer owned by stripe HOT 4
- 13.2.0 Docker images contain 13.1.0 HOT 1
- Flush sinks on shutdown
- How Can I send metrics collected by datadog agent or apm trace? HOT 4
- Global Histograms & Timers do not work with HTTP forwarding
- "SSL certificate problem: self signed certificate in certificate chain error" while building alpine docker image.
- Debian dockerfile build is failing during the tests HOT 2
- Alpine image for 13.2.0 is missing in Dockerhub HOT 1
- Is there any upgrade plan to golang:1.17 alpine image ? HOT 5
- Release HOT 2
- Veneur is unable to parse packets from the go dogstatsd client
- Veneur does not like New Relic Insert Key config
- Publish multi arch[amd64/arm64] docker images for veneur
- Stripe/veneur docker images fail using default configuration with obtuse error HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from veneur.