Coder Social home page Coder Social logo

Comments (6)

ChimeraCoder avatar ChimeraCoder commented on June 28, 2024

Veneur itself doesn't retry flushes to Datadog (though you could use an http proxy for that, if you wanted). The entire pipeline is assumed to be mildly lossy, given that metrics are themselves received over UDP, which provides no delivery guarantees. Sporadic, occasional metric failures are tolerated.

That said, if you're seeing a lot of timeouts, something is probably up. We ourselves don't see many timeouts running Veneur at scale, so I'm curious what's going on here. Is your outbound network connection spotty? Are you sending a particularly large payload with each flush (a lot of metrics, or a long flush cycle)?

from veneur.

volfco avatar volfco commented on June 28, 2024

We're seeing a small number of sustained errors. I've got flush_max_per_body set as 25000- which is the default in the example. I don't know if this is inline with what you're seeing, but 10 to 15 errors every 15 minutes across the various DCs I've deployed Veneur to.

image

(Broken down by DC)

These are servers from all over the world going to aws us-east-1, so I'm expecting some errors just not sure how many.

from veneur.

volfco avatar volfco commented on June 28, 2024

Digging into the native datadog agent, it does look like it has some retry logic here: https://github.com/DataDog/datadog-agent/blob/d3e74927d78a5982d9978ed8540bd6b2c61ab437/pkg/forwarder/transaction.go#L144 under certain failure cases- namely request errors such as timeouts.

from veneur.

ChimeraCoder avatar ChimeraCoder commented on June 28, 2024

Yeah, that's definitely not in line with what we've experienced. We're not using Datadog ourselves at the moment, so I can't compare against current data, but timeouts in Veneur are quite rare - less than one per day - except during a Datadog outage (and their status page is green right now).

Just to clarify: when you say that this is from servers all around the world going to aws us-east-1, that's from tracing the location of where app.datadoghq.com resolves (us-east-1)?

We do use haproxy for external egress from our network, and haproxy does have built in retrial. So it's possible that we wouldn't have noticed the connection timeouts within Veneur, if haproxy was retrying and the success rate of the retried request was high enough. As a quick test, I'd recommend trying running requests through a proxy like haproxy and seeing if that fixes the issue.

from veneur.

volfco avatar volfco commented on June 28, 2024

Yep. Every DC we have resolved to us-east-1 ELBs. We're talking directly to datadog without a proxy.

It seems the path forward is for me to add some basic retry logic into the datadog requests. We're not moving away from Datadog anytime soon so retry logic is desired. I've put in a fair amount of effort to get our dogstatsd pipeline reliable so I'm not stopping now.

from veneur.

volfco avatar volfco commented on June 28, 2024

Opened #561 for a I'm-new-to-go-and-I-think-this-is-a-valid-fix fix

from veneur.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.