Coder Social home page Coder Social logo

[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout about pulsar HOT 13 CLOSED

lhotari avatar lhotari commented on June 22, 2024
[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout

from pulsar.

Comments (13)

frankjkelly avatar frankjkelly commented on June 22, 2024 1

Very cool - I was not aware of that setting.
That said the current description for that parameter isn't helpful in understanding just how useful it can be :-) https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=optional
If true, export buffered metrics

from pulsar.

dao-jun avatar dao-jun commented on June 22, 2024 1

Very cool - I was not aware of that setting. That said the current description for that parameter isn't helpful in understanding just how useful it can be :-) https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=optional If true, export buffered metrics

I'll improvement the description of the configuration field.

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

While trying to reproduce this issue, I ran into another issue where metrics stop working completely.
GET call to /metrics/ will return an empty response.

I have Pulsar 3.0.4 deployed using chart 3.4.0 rc1

helm repo add --force-update apache-pulsar-dist-dev https://dist.apache.org/repos/dist/dev/pulsar/helm-chart/3.4.0-candidate-1/
helm repo update
helm install -n pulsar pulsar apache-pulsar-dist-dev/pulsar --version 3.4.0 --set affinity.anti_affinity=false --set broker.replicaCount=1 --wait --debug

After this start port forwarding to broker pod:

kubectl port-forward -n pulsar $(kubectl get pods -n pulsar -l component=broker -o jsonpath='{.items[0].metadata.name}') 8080:8080

I create a large number of topics with very long tenant, namespace and topic names:

git clone https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew shadowJar
# create the topics
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics
# ensure that all topics are loaded
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioLoadAll

Sample topic name for a single topic is persistent://aVecxoNHhQeRENVQLqFKGgVhpVHrspZsbNrcZIYgiFgSGAXMgaqrygIxlpWeVJHUVAhJmlABdRJVUYDNrEPnrpIiFkPxiyjjxWwQCsUJTwhxLqIeOpmiZvudXVKwnVemuPxVaukUsOsrsSQeNdzrulFTkYbfKpPgIqEwYqNgpqHFeQFRPbMAXwhcCxpvSGRplDCIMMnxbLnSEOUbvPFnTKhjxPOqZUBqnjdXRIIsqLGsqTlEKoojLexBteBDmXgVhMxaUSjVrGUgWbkfqpgPEhknVrUnmpCoycGhApByzvXmHWZJcJhbnwFamawjHlvDzaeFlQLvyjqxXFSQqGQxjeGflzeSGypsLijZfNnKxOpTJZQkgWPKBQqFDGalBjZBgyhGiCoBNAxXgRpxUNBtKnLaeOZGSqlzWicaqQgBDteQiksNCUarRYSAKBXyoiYsjUaxiaosDXHsYcLLEUpjdFNOHpFPrWdejBHPFDLltdQhlzjwYBrRAkfFCuXgQKtlcPPfPausmeUvrwnQBrNFApLjPTUqufLnJrdCksSCcOVetltZaghNmLTchETGpPhuCGgAFzuZjIFqglZPVCBlRqjLaGYHtxnsANfVzUTavdiKgicKWHRdlofsMgDdIlGhowLunxiFgJjEQTtKtftUDVJrykrkqsLaqjZeEuCTMXTDDkKFvKAApPKxqoOkhiWYhWaHRwGWWRnoEpGPmbRGdRDqqtyfQSXrzlQceTYFeWFSUXIgvjavdjnqFZLjhqKKZGEIEgRQQWwxTKLEsUrWqdvYqVXnIZXfzXmKBoOZtyGvYRTbhZoLHwUImdaqjpNXCJjJgpOQefSMnzLrCxZZoLcjTaIlzWJovIEPVsEzqoQxznxPIcWGYDIeGotegFqGneLDeVvqPpwvhQkdJcEpoNDkNJSkKwQpOWRuxjIFQDBtkqmBDmSxuGxUpgkkOUstlHHGcpjNsuZAjfAHHNkRWHjOzFTQKryFNhQaKGzr005/YGKBcGdELHIcIRdpHllGnDxxFdbMvYENMsnVyaGrzCEQlClLzlbBQTyNtrysvLHboFnQtWrTWdJxMuiycgPInCuGzzITtjdLjCGRuYnPWugOGmnGDSpgKMaKXSlmxDSFZHdAZFLZljytpOVFYmJaCSjQYVhdVTkPPvHglVpyXRiOMpJfZHBqnWWEfHvxKguKTaHkwnWBbcvkwBowhwNhoDhQqvhmDSmLrBIVaCkufPpgBZGFmUjGuHQdPmfXiaLDvexPAuUuoSIChLDBsPNZEuddahamZOmRXBDCBNBTVFdimjHwoQmCbWFTbBjjhqBSHzTZqTbJJAzymaYlfMIgIhkSzrZVCQcDxebYTXIrEkIyQcXuMKAZSMxAJsJvzRDrcxoAObihSaUxRxFHvWVXOObjHlaGOGttfNOIDUcszFVKvQSgcsHBEuxzwRmDpMuGoWDnRMKzjUfYToJDLfKJxgDnKwcTnRmIOWINAqxCZiaeCcvdTToUoSXxPVRDaSYekHgGYLfPYUyGgFLHgpcRHJHZROPnsdxSFNAcYIvrbQswSEbWjADlsAYGMOimuIvcMshBrNNYtcdehCzzlzGuxPXsGcoDHzamPezRvfduoKSFSWCJJFuYxsBnGfEZOSfWFxtqWpSdcNnyPyrwVPNWyQaDWKCYQeIZbxXrPatPCPnlixzxwaxJJJphyKtplRnSpbDfDrASIAFExAyITwSMFRdfhELCcRVARtaRSXjEihdZaZmFUerIkdQDuMZaarWbWeThwEcDHPUatUXWHfTBJaCJgjJiCvwgSrrhqRKuCleJHxGxujqeLTNDvGfvDnxznUNsEPzkudWnovfoudDuavzBOpuRzoSkNJAJMzXfdShYNtukpuxfjKLwQGfxWCUBClcAOjsTMsQkCzNaFSQlukCmoginaKnAdGMFjPzUuFncfYmQWzzhBGhCWIRkCQYPEQiAFeSwlLYssjNcVHcacrxjGudnPDXGCybKhPtk008/MqmGcLFhesqqROTDnTBqolaXFeLDMahTOmpuxcYoLLTRekLSnIHsOdizlxidvuJuoGGzCBIamVoLbsiWJDvrujPXYObyCYsVEMAcswIkUrqTooVIcVCRuHYmtdrHdxPAiLgOWQKEHIHpPeFethszwWYZGSYwhyjcRhtUWfUNtztPEmiTOUzWkaPSAYTnqhfUhUZTCFjRYMsUDYcxHwPFJBwrnDtGZanKJUmowHQHRAFXWSVMTSCSeCABjEbBafSAtWnzyrahNlDWEyeamWWZTPrRczQQOEKyQFHmFStKKeILqliysnUQcWtuofVQZtKdMQJQIqdueNchYpIaSRiWlkyekOZVMOyDGeWVTaAFKxYzHHOPTqNtllnnIqZjPFHkBmFXRkgFUVSwONvuNdIIzITlmzlRBtCucTLdTdPdmDhKVGZEPUSaVmujfFQgSHjmsXmNRqVeoAFBVJvDYuCoctatemUDgkttMZHRCccfWLZAluYOvLyqtxxgMSyjsBWpEYktbgFjZDMxJkrCsbTUlslfzWRhlGkhsDtwNwColLMVZKIdSxdowUcMwFnlmuaHrIBgiITbfUichnESKgnWkfMGwbJTMTMNFoqdElriwJoZfPgqYIHKVlbKRnqXwaACdOGSyjVHYltgacxzqOKfAZQzpWtOfpHbYsuEFVBobuIQXcFYdCxLRVkvrDpZCsjegdppuoRKqdZWGBUUARUAKKXcDiWMvWWCxcSDeLxcKvAIGGeZMolpVPUAqcDBNfPphPkBskEFzqzgDYDOViLsIwylMohMYmWXQjCREBSczgmoIlbkfXEAhmwuWcVRNuKmbUVjHIuQIhdzunSFcBriEdHeujDQyqcEOYOVohupMVloQWlbrFDRUKuTYugyKverkGPKSEeXwbtcbVjqWaijnTFbtHHmQVkHJrkRcwfWfWtlDoKDgCNZGOhokAuTsaZVleulSWpsUICIlUPNGyVLuCBPUKTowVHzISLvctEQ010
This is to ensure that the topics consume a lot of bytes in the output.

Metrics stopped working:

❯ curl -v http://localhost:8080/metrics/
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /metrics/ HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 11 Apr 2024 10:03:10 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 0
< Server: Jetty(9.4.54.v20240208)
<
* Connection #0 to host localhost left intact

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

In the above case, I noticed that the default scraping interval was 10 seconds for Prometheus in the Apache Pulsar Helm chart. It's possible that the scraping was part of how the broker got into a bad state and the metrics response is empty. It confirms that the current solution has multiple issue.

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

With the topics created with java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics, this /metrics/ empty response issue reproduces consistently.
After a broker restart, the problem starts happening after triggering loading of all topics with java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioLoadAll

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

Ran out of heap in the test, trying with more heap. Also got java.io.IOException: Packet len 15589885 is out of range!, increasing jute.maxbuffer:

affinity:
  anti_affinity: false
bookkeeper:
  resources:
    requests:
      memory: 800m
      cpu: 1
  configData:
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms128m -Xmx400m -XX:MaxDirectMemorySize=300m
broker:
  replicaCount: 1
  resources:
    requests:
      memory: 2500m
      cpu: 1
  configData:
    metricsBufferResponse: "true"
    zookeeperSessionExpiredPolicy: "shutdown"
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms128m -Xmx1024m -XX:MaxDirectMemorySize=1024m
proxy:
  replicaCount: 1
zookeeper:
  resources:
    requests:
      memory: 600m
      cpu: 1
  configData:
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms200m -Xmx500m
kube-prometheus-stack:
  prometheus-node-exporter:
    hostRootFsMount: 
      enabled: false

from pulsar.

dao-jun avatar dao-jun commented on June 22, 2024

I think it might be caused by the long duration of scraping metrics? I remember some metrics need to read bookkeeper.

from pulsar.

dao-jun avatar dao-jun commented on June 22, 2024

#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period.

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period.

@dao-jun yes, that helps to some level. I'll be refactoring and improving it further to address the problems that remain (for example, the infinite queuing of requests).

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

storing this in load_test.js:

import http from 'k6/http';
import { sleep } from 'k6';

export let options = {
    vus: 100, // Number of virtual users
    iterations: 10000, // Total number of request iterations
};

export default function () {
    http.get('http://localhost:8080/metrics/');
}

running it with k6:

brew install k6
k6 run load_test.js

It reproduces the queuing issue, at least when metricsBufferResponse isn't enabled. There are multiple problems showing up in testing with Pulsar 3.0.4 in local k8s environment.

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

I was able to stretch it more by running in standalone with more memory:

PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log

when running the k6 test, it can be seen that response times increase until it hits 30 seconds, which is the request timeout in k6.

the response time increases linearly:
582 1121 1577 2033 2499 2970 3440 3917 4398 4859 5330 5806 6269 6763 7199 7647 8097 8561 9015 9482 9916 10372 10820 11252 11708 12160 12669 13117 13554 13991 14504 14966 15411 15901 16364 16832 17270 17745 18215 18690 19157 19589 20055 20538 20974 21409 21884 22323 22787 23247 23750 24188 24629 25065 25531 26008 26593 27133 27723 28303 28879 29439

this can be seen in https://gist.github.com/lhotari/137be0ab5d16101f0a88e6628e1d34b3 .

It can be seen that even after k6 is closed, the request processing continues. On some of the orphan requests, the response size is reported to be the full size of the output. That means that the metrics has been re-generated for the queued and orphaned requests. For example 384122607 (384MB) is the response size of the full output.

2024-04-11T19:29:13,672+0300 [prometheus-stats-51-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to send metrics, likely the client or this server closed the connection due to a timeout (373 ms elapsed): org.eclipse.jetty.io.EofException
2024-04-11T19:29:13,672+0300 [prometheus-stats-51-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [11/Apr/2024:19:28:48 +0300] "GET /metrics/ HTTP/1.1" 200 384122607 "-" "k6/0.50.0 (https://k6.io/)" 25073
2024-04-11T19:29:14,077+0300 [prometheus-stats-51-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to send metrics, likely the client or this server closed the connection due to a timeout (405 ms elapsed): org.eclipse.jetty.io.EofException
2024-04-11T19:29:14,077+0300 [prometheus-stats-51-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [11/Apr/2024:19:28:49 +0300] "GET /metrics/ HTTP/1.1" 200 0 "-" "k6/0.50.0 (https://k6.io/)" 24889

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

Tested standalone with metricsBufferResponse=true optimization created in #14453.

PULSAR_PREFIX_metricsBufferResponse=true docker/pulsar/scripts/apply-config-from-env.py conf/standalone.conf
PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log

It seems that it's able to keep up even with the 100 virtual user k6 test introduced above.

from pulsar.

lhotari avatar lhotari commented on June 22, 2024

the complete 10000 request test passed, downloading 3.6TB of metrics data in under 10 minutes (this was with the local Pulsar standalone on a Mac M3 Max).

❯ k6 run load_test.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: load_test.js
        output: -

     scenarios: (100.00%) 1 scenario, 100 max VUs, 10m30s max duration (incl. graceful stop):
              * default: 10000 iterations shared among 100 VUs (maxDuration: 10m0s, gracefulStop: 30s)


     data_received..................: 3.6 TB 6.2 GB/s
     data_sent......................: 880 kB 1.5 kB/s
     http_req_blocked...............: avg=44.33µs min=1µs      med=3µs     max=5.5ms    p(90)=4.1µs   p(95)=6µs
     http_req_connecting............: avg=35.38µs min=0s       med=0s      max=4.19ms   p(90)=0s      p(95)=0s
     http_req_duration..............: avg=5.74s   min=196.38ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
       { expected_response:true }...: avg=5.74s   min=196.38ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
     http_req_failed................: 0.00%  ✓ 0         ✗ 10000
     http_req_receiving.............: avg=66.89ms min=47.39ms  med=67.33ms max=192.11ms p(90)=76.73ms p(95)=82.04ms
     http_req_sending...............: avg=24.05µs min=3µs      med=14µs    max=5.43ms   p(90)=26µs    p(95)=43µs
     http_req_tls_handshaking.......: avg=0s      min=0s       med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=5.67s   min=4.25ms   med=5.67s   max=6.25s    p(90)=5.92s   p(95)=6.07s
     http_reqs......................: 10000  17.325729/s
     iteration_duration.............: avg=5.74s   min=197.65ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
     iterations.....................: 10000  17.325729/s
     vus............................: 3      min=3       max=100
     vus_max........................: 100    min=100     max=100


running (09m37.2s), 000/100 VUs, 10000 complete and 0 interrupted iterations
default ✓ [======================================] 100 VUs  09m37.2s/10m0s  10000/10000 shared iters

from pulsar.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.