Coder Social home page Coder Social logo

Strange spikes in latency with toxy about toxy HOT 10 OPEN

h2non avatar h2non commented on May 22, 2024
Strange spikes in latency with toxy

from toxy.

Comments (10)

h2non avatar h2non commented on May 22, 2024

Can toxy not handle that rate of requests?

It should. I noticed that in some scenarios when replaying requests with payload under high concurrency (>50 rps) there could be some performance issues or errors.

Is there some default poison applied that could be causing this?

No. There's no poison enabled by default.

Just a couple of questions that can help me:

  • Are you using some request or response interceptor?
  • What poison are you using, if any?
  • Which kind of traffic is handled by toxy? (e.g: HTTP with large/small payloads)

from toxy.

shorea avatar shorea commented on May 22, 2024

It should. I noticed that in some scenarios when replaying requests with payload under high concurrency (>50 rps) there could be some performance issues or errors.
I can try dropping the request rate down to 40 per second and seeing if that makes a difference.

  • Are you using some request or response interceptor?
    Nope
  • What poison are you using, if any?
    None right now, Trying to get a baseline before applying poisons and that's where we noticed the spikes. The code snippet above is verbatim the script I'm running.
  • Which kind of traffic is handled by toxy? (e.g: HTTP with large/small payloads)
    I'm connecting to toxy via HTTP and toxy itself it establishing an https connection to DynamoDB. The requests themselves are just PutItem requests with very small payload sizes and very small responses (<1 KB both ways).

from toxy.

h2non avatar h2non commented on May 22, 2024

Thanks! I'll do some benchmark tests based on your scenario and I'll let you know with the conclusions.

from toxy.

h2non avatar h2non commented on May 22, 2024

I've just added some benchmark suites.
I'm covering similar scenarios like in rocky. All is working fine, even with better performance than I personally expected.

Those tests are not based on your scenario, so I did one specific test which is more close to yours: forward 2KB payload without poisoning with a concurrency of 60 rps to a remote HTTPS server.
Here're the results:

Requests    [total]             600
Duration    [total, attack, wait]       34.676810741s, 9.982575757s, 24.694234984s
Latencies   [mean, 50, 95, 99, max]     7.351279208s, 7.039588307s, 15.731271949s, 24.86061736s, 24.86061736s
Bytes In    [total, mean]           3326400, 5544.00
Bytes Out   [total, mean]           1033800, 1723.00
Success     [ratio]             100.00%
Status Codes    [code:count]            200:600  

I also ran the same suite with out TLS transport. Results:

# Running benchmark suite: forward+payload
Requests    [total]             600
Duration    [total, attack, wait]       24.555032664s, 9.985355351s, 14.569677313s
Latencies   [mean, 50, 95, 99, max]     3.467563026s, 3.184269332s, 10.463356674s, 20.235857461s, 20.235857461s
Bytes In    [total, mean]           3325800, 5543.00
Bytes Out   [total, mean]           1033800, 1723.00
Success     [ratio]             100.00%
Status Codes    [code:count]            200:600  

Here're some conclusions:

  • TLS handshake seems to be expensive in some cases, and it could be the main performance downgrade, so communicating in raw will be (obviously) faster.
  • RTT delay / jitter (I'm in Europe and the server is located in USA).
  • I'm using a wireless connection and there's some network congestion.
  • High concurrency (> 50 rps) implies some performance bottlenecks, but no critical ones (however I need to dig into it).
  • RSS memory doesn't increases too much (~ 50MB) and it's stable (no evident memory leaks).
  • CPU usage is not high (< 20%).
  • As I wrote before, I can confirm that stressing the server for a couple of minutes with high concurrency (60 rps) implies some performance issues. I need to dig into this.

from toxy.

shorea avatar shorea commented on May 22, 2024

Thanks for digging into this @h2non! I reran our simulation with 40 RPS and we only had two 20 second spikes as opposed to the five or six we normally get with 60 RPS for the same period. It's strange that it is always 20 seconds for every spike, even for your benchmark there are approx 20 second spikes. I wonder what could be causing such consistent latency? Is there any information I could provide that would help you debug this?

from toxy.

h2non avatar h2non commented on May 22, 2024

Definitively I've to dig into this, however note that benchmark tests forwarding to loopback interface has no performance downgrades in that sense.

When I've time I'll work in multiple testing scenarios to discover where's the potential bottleneck. I'll let you know.

from toxy.

shorea avatar shorea commented on May 22, 2024

@h2non wondering if you've had time to run some different scenarios. We are picking this work back up and would like to use toxy if possible as it's the most robust solution we've seen for our use case.

from toxy.

h2non avatar h2non commented on May 22, 2024

Hi @shorea.

I didn't forget about this, but it's not simple to mitigate the real problem here.
Lately I'm working to push a new product to production and unfortunately I don't have too much time, but my availability will increase considerably in about 2 weeks.

I'll let you know with the diagnostic then.

from toxy.

breedloj avatar breedloj commented on May 22, 2024

Hey @h2non

I work with @shorea and wanted to provide an update on this issue. In short we were able to alleviate this issue by capping the maximum number of sockets to 500 (via https.globalAgent.maxSockets) and giving Node more memory to work with (--max-old-space-size=8192) and turning off the idle garbage collector (--nouse-idle-notification). I do not have much experience with Node but have seen similar issues in Java applications that ended up being heap/GC related. FWIW, we have been able to run 30 minute slices of traffic without latency spikes using these settings but in more extended runs of 2+ hours we ultimately do hit a large spike. I'm guessing this is somewhat expected behavior given these settings as we're effectively holding off on smaller GCs while giving Node more memory space, effectively meaning that once we do exhaust the available memory we're going to hit a larger GC sweep. Regardless this seems to have unblocked us for the time being.

from toxy.

h2non avatar h2non commented on May 22, 2024

Glad to hear news about this. Honestly, I haven't had time to dig into this, so I appreciate your update. Initially I didn't thought it was directly related to GC/memory issues since the RSS memory was stable during my stress testing for more than 15 min.

My opinion is that there's a clear memory leak somewhere, if not, you should not be forced to increase the V8 heap limit.

I would recommend you to take a look to the following utilities:
https://github.com/node-inspector/node-inspector
https://github.com/node-inspector/v8-profiler
https://github.com/bnoordhuis/node-heapdump

This kind of issues are hard to debug, but I would like to invest time on this soon since it's a challenging problem to solve.

from toxy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.