Comments (13)
Very cool - I was not aware of that setting.
That said the current description for that parameter isn't helpful in understanding just how useful it can be :-) https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=optional
If true, export buffered metrics
from pulsar.
Very cool - I was not aware of that setting. That said the current description for that parameter isn't helpful in understanding just how useful it can be :-) https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=optional
If true, export buffered metrics
I'll improvement the description of the configuration field.
from pulsar.
While trying to reproduce this issue, I ran into another issue where metrics stop working completely.
GET call to /metrics/
will return an empty response.
I have Pulsar 3.0.4 deployed using chart 3.4.0 rc1
helm repo add --force-update apache-pulsar-dist-dev https://dist.apache.org/repos/dist/dev/pulsar/helm-chart/3.4.0-candidate-1/
helm repo update
helm install -n pulsar pulsar apache-pulsar-dist-dev/pulsar --version 3.4.0 --set affinity.anti_affinity=false --set broker.replicaCount=1 --wait --debug
After this start port forwarding to broker pod:
kubectl port-forward -n pulsar $(kubectl get pods -n pulsar -l component=broker -o jsonpath='{.items[0].metadata.name}') 8080:8080
I create a large number of topics with very long tenant, namespace and topic names:
git clone https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew shadowJar
# create the topics
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics
# ensure that all topics are loaded
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioLoadAll
Sample topic name for a single topic is persistent://aVecxoNHhQeRENVQLqFKGgVhpVHrspZsbNrcZIYgiFgSGAXMgaqrygIxlpWeVJHUVAhJmlABdRJVUYDNrEPnrpIiFkPxiyjjxWwQCsUJTwhxLqIeOpmiZvudXVKwnVemuPxVaukUsOsrsSQeNdzrulFTkYbfKpPgIqEwYqNgpqHFeQFRPbMAXwhcCxpvSGRplDCIMMnxbLnSEOUbvPFnTKhjxPOqZUBqnjdXRIIsqLGsqTlEKoojLexBteBDmXgVhMxaUSjVrGUgWbkfqpgPEhknVrUnmpCoycGhApByzvXmHWZJcJhbnwFamawjHlvDzaeFlQLvyjqxXFSQqGQxjeGflzeSGypsLijZfNnKxOpTJZQkgWPKBQqFDGalBjZBgyhGiCoBNAxXgRpxUNBtKnLaeOZGSqlzWicaqQgBDteQiksNCUarRYSAKBXyoiYsjUaxiaosDXHsYcLLEUpjdFNOHpFPrWdejBHPFDLltdQhlzjwYBrRAkfFCuXgQKtlcPPfPausmeUvrwnQBrNFApLjPTUqufLnJrdCksSCcOVetltZaghNmLTchETGpPhuCGgAFzuZjIFqglZPVCBlRqjLaGYHtxnsANfVzUTavdiKgicKWHRdlofsMgDdIlGhowLunxiFgJjEQTtKtftUDVJrykrkqsLaqjZeEuCTMXTDDkKFvKAApPKxqoOkhiWYhWaHRwGWWRnoEpGPmbRGdRDqqtyfQSXrzlQceTYFeWFSUXIgvjavdjnqFZLjhqKKZGEIEgRQQWwxTKLEsUrWqdvYqVXnIZXfzXmKBoOZtyGvYRTbhZoLHwUImdaqjpNXCJjJgpOQefSMnzLrCxZZoLcjTaIlzWJovIEPVsEzqoQxznxPIcWGYDIeGotegFqGneLDeVvqPpwvhQkdJcEpoNDkNJSkKwQpOWRuxjIFQDBtkqmBDmSxuGxUpgkkOUstlHHGcpjNsuZAjfAHHNkRWHjOzFTQKryFNhQaKGzr005/YGKBcGdELHIcIRdpHllGnDxxFdbMvYENMsnVyaGrzCEQlClLzlbBQTyNtrysvLHboFnQtWrTWdJxMuiycgPInCuGzzITtjdLjCGRuYnPWugOGmnGDSpgKMaKXSlmxDSFZHdAZFLZljytpOVFYmJaCSjQYVhdVTkPPvHglVpyXRiOMpJfZHBqnWWEfHvxKguKTaHkwnWBbcvkwBowhwNhoDhQqvhmDSmLrBIVaCkufPpgBZGFmUjGuHQdPmfXiaLDvexPAuUuoSIChLDBsPNZEuddahamZOmRXBDCBNBTVFdimjHwoQmCbWFTbBjjhqBSHzTZqTbJJAzymaYlfMIgIhkSzrZVCQcDxebYTXIrEkIyQcXuMKAZSMxAJsJvzRDrcxoAObihSaUxRxFHvWVXOObjHlaGOGttfNOIDUcszFVKvQSgcsHBEuxzwRmDpMuGoWDnRMKzjUfYToJDLfKJxgDnKwcTnRmIOWINAqxCZiaeCcvdTToUoSXxPVRDaSYekHgGYLfPYUyGgFLHgpcRHJHZROPnsdxSFNAcYIvrbQswSEbWjADlsAYGMOimuIvcMshBrNNYtcdehCzzlzGuxPXsGcoDHzamPezRvfduoKSFSWCJJFuYxsBnGfEZOSfWFxtqWpSdcNnyPyrwVPNWyQaDWKCYQeIZbxXrPatPCPnlixzxwaxJJJphyKtplRnSpbDfDrASIAFExAyITwSMFRdfhELCcRVARtaRSXjEihdZaZmFUerIkdQDuMZaarWbWeThwEcDHPUatUXWHfTBJaCJgjJiCvwgSrrhqRKuCleJHxGxujqeLTNDvGfvDnxznUNsEPzkudWnovfoudDuavzBOpuRzoSkNJAJMzXfdShYNtukpuxfjKLwQGfxWCUBClcAOjsTMsQkCzNaFSQlukCmoginaKnAdGMFjPzUuFncfYmQWzzhBGhCWIRkCQYPEQiAFeSwlLYssjNcVHcacrxjGudnPDXGCybKhPtk008/MqmGcLFhesqqROTDnTBqolaXFeLDMahTOmpuxcYoLLTRekLSnIHsOdizlxidvuJuoGGzCBIamVoLbsiWJDvrujPXYObyCYsVEMAcswIkUrqTooVIcVCRuHYmtdrHdxPAiLgOWQKEHIHpPeFethszwWYZGSYwhyjcRhtUWfUNtztPEmiTOUzWkaPSAYTnqhfUhUZTCFjRYMsUDYcxHwPFJBwrnDtGZanKJUmowHQHRAFXWSVMTSCSeCABjEbBafSAtWnzyrahNlDWEyeamWWZTPrRczQQOEKyQFHmFStKKeILqliysnUQcWtuofVQZtKdMQJQIqdueNchYpIaSRiWlkyekOZVMOyDGeWVTaAFKxYzHHOPTqNtllnnIqZjPFHkBmFXRkgFUVSwONvuNdIIzITlmzlRBtCucTLdTdPdmDhKVGZEPUSaVmujfFQgSHjmsXmNRqVeoAFBVJvDYuCoctatemUDgkttMZHRCccfWLZAluYOvLyqtxxgMSyjsBWpEYktbgFjZDMxJkrCsbTUlslfzWRhlGkhsDtwNwColLMVZKIdSxdowUcMwFnlmuaHrIBgiITbfUichnESKgnWkfMGwbJTMTMNFoqdElriwJoZfPgqYIHKVlbKRnqXwaACdOGSyjVHYltgacxzqOKfAZQzpWtOfpHbYsuEFVBobuIQXcFYdCxLRVkvrDpZCsjegdppuoRKqdZWGBUUARUAKKXcDiWMvWWCxcSDeLxcKvAIGGeZMolpVPUAqcDBNfPphPkBskEFzqzgDYDOViLsIwylMohMYmWXQjCREBSczgmoIlbkfXEAhmwuWcVRNuKmbUVjHIuQIhdzunSFcBriEdHeujDQyqcEOYOVohupMVloQWlbrFDRUKuTYugyKverkGPKSEeXwbtcbVjqWaijnTFbtHHmQVkHJrkRcwfWfWtlDoKDgCNZGOhokAuTsaZVleulSWpsUICIlUPNGyVLuCBPUKTowVHzISLvctEQ010
This is to ensure that the topics consume a lot of bytes in the output.
Metrics stopped working:
❯ curl -v http://localhost:8080/metrics/
* Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /metrics/ HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 11 Apr 2024 10:03:10 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 0
< Server: Jetty(9.4.54.v20240208)
<
* Connection #0 to host localhost left intact
from pulsar.
In the above case, I noticed that the default scraping interval was 10 seconds for Prometheus in the Apache Pulsar Helm chart. It's possible that the scraping was part of how the broker got into a bad state and the metrics response is empty. It confirms that the current solution has multiple issue.
from pulsar.
With the topics created with java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics
, this /metrics/
empty response issue reproduces consistently.
After a broker restart, the problem starts happening after triggering loading of all topics with java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioLoadAll
from pulsar.
Ran out of heap in the test, trying with more heap. Also got java.io.IOException: Packet len 15589885 is out of range!
, increasing jute.maxbuffer
:
affinity:
anti_affinity: false
bookkeeper:
resources:
requests:
memory: 800m
cpu: 1
configData:
PULSAR_EXTRA_OPTS: >
-Djute.maxbuffer=20000000
PULSAR_MEM: >
-Xms128m -Xmx400m -XX:MaxDirectMemorySize=300m
broker:
replicaCount: 1
resources:
requests:
memory: 2500m
cpu: 1
configData:
metricsBufferResponse: "true"
zookeeperSessionExpiredPolicy: "shutdown"
PULSAR_EXTRA_OPTS: >
-Djute.maxbuffer=20000000
PULSAR_MEM: >
-Xms128m -Xmx1024m -XX:MaxDirectMemorySize=1024m
proxy:
replicaCount: 1
zookeeper:
resources:
requests:
memory: 600m
cpu: 1
configData:
PULSAR_EXTRA_OPTS: >
-Djute.maxbuffer=20000000
PULSAR_MEM: >
-Xms200m -Xmx500m
kube-prometheus-stack:
prometheus-node-exporter:
hostRootFsMount:
enabled: false
from pulsar.
I think it might be caused by the long duration of scraping metrics? I remember some metrics need to read bookkeeper.
from pulsar.
#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period.
from pulsar.
#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period.
@dao-jun yes, that helps to some level. I'll be refactoring and improving it further to address the problems that remain (for example, the infinite queuing of requests).
from pulsar.
storing this in load_test.js
:
import http from 'k6/http';
import { sleep } from 'k6';
export let options = {
vus: 100, // Number of virtual users
iterations: 10000, // Total number of request iterations
};
export default function () {
http.get('http://localhost:8080/metrics/');
}
running it with k6:
brew install k6
k6 run load_test.js
It reproduces the queuing issue, at least when metricsBufferResponse
isn't enabled. There are multiple problems showing up in testing with Pulsar 3.0.4 in local k8s environment.
from pulsar.
I was able to stretch it more by running in standalone with more memory:
PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log
when running the k6 test, it can be seen that response times increase until it hits 30 seconds, which is the request timeout in k6.
the response time increases linearly:
582 1121 1577 2033 2499 2970 3440 3917 4398 4859 5330 5806 6269 6763 7199 7647 8097 8561 9015 9482 9916 10372 10820 11252 11708 12160 12669 13117 13554 13991 14504 14966 15411 15901 16364 16832 17270 17745 18215 18690 19157 19589 20055 20538 20974 21409 21884 22323 22787 23247 23750 24188 24629 25065 25531 26008 26593 27133 27723 28303 28879 29439
this can be seen in https://gist.github.com/lhotari/137be0ab5d16101f0a88e6628e1d34b3 .
It can be seen that even after k6 is closed, the request processing continues. On some of the orphan requests, the response size is reported to be the full size of the output. That means that the metrics has been re-generated for the queued and orphaned requests. For example 384122607
(384MB) is the response size of the full output.
2024-04-11T19:29:13,672+0300 [prometheus-stats-51-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to send metrics, likely the client or this server closed the connection due to a timeout (373 ms elapsed): org.eclipse.jetty.io.EofException
2024-04-11T19:29:13,672+0300 [prometheus-stats-51-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [11/Apr/2024:19:28:48 +0300] "GET /metrics/ HTTP/1.1" 200 384122607 "-" "k6/0.50.0 (https://k6.io/)" 25073
2024-04-11T19:29:14,077+0300 [prometheus-stats-51-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to send metrics, likely the client or this server closed the connection due to a timeout (405 ms elapsed): org.eclipse.jetty.io.EofException
2024-04-11T19:29:14,077+0300 [prometheus-stats-51-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [11/Apr/2024:19:28:49 +0300] "GET /metrics/ HTTP/1.1" 200 0 "-" "k6/0.50.0 (https://k6.io/)" 24889
from pulsar.
Tested standalone with metricsBufferResponse=true
optimization created in #14453.
PULSAR_PREFIX_metricsBufferResponse=true docker/pulsar/scripts/apply-config-from-env.py conf/standalone.conf
PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log
It seems that it's able to keep up even with the 100 virtual user k6 test introduced above.
from pulsar.
the complete 10000 request test passed, downloading 3.6TB of metrics data in under 10 minutes (this was with the local Pulsar standalone on a Mac M3 Max).
❯ k6 run load_test.js
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: load_test.js
output: -
scenarios: (100.00%) 1 scenario, 100 max VUs, 10m30s max duration (incl. graceful stop):
* default: 10000 iterations shared among 100 VUs (maxDuration: 10m0s, gracefulStop: 30s)
data_received..................: 3.6 TB 6.2 GB/s
data_sent......................: 880 kB 1.5 kB/s
http_req_blocked...............: avg=44.33µs min=1µs med=3µs max=5.5ms p(90)=4.1µs p(95)=6µs
http_req_connecting............: avg=35.38µs min=0s med=0s max=4.19ms p(90)=0s p(95)=0s
http_req_duration..............: avg=5.74s min=196.38ms med=5.74s max=6.34s p(90)=5.99s p(95)=6.14s
{ expected_response:true }...: avg=5.74s min=196.38ms med=5.74s max=6.34s p(90)=5.99s p(95)=6.14s
http_req_failed................: 0.00% ✓ 0 ✗ 10000
http_req_receiving.............: avg=66.89ms min=47.39ms med=67.33ms max=192.11ms p(90)=76.73ms p(95)=82.04ms
http_req_sending...............: avg=24.05µs min=3µs med=14µs max=5.43ms p(90)=26µs p(95)=43µs
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s
http_req_waiting...............: avg=5.67s min=4.25ms med=5.67s max=6.25s p(90)=5.92s p(95)=6.07s
http_reqs......................: 10000 17.325729/s
iteration_duration.............: avg=5.74s min=197.65ms med=5.74s max=6.34s p(90)=5.99s p(95)=6.14s
iterations.....................: 10000 17.325729/s
vus............................: 3 min=3 max=100
vus_max........................: 100 min=100 max=100
running (09m37.2s), 000/100 VUs, 10000 complete and 0 interrupted iterations
default ✓ [======================================] 100 VUs 09m37.2s/10m0s 10000/10000 shared iters
from pulsar.
Related Issues (20)
- [Bug] maven build fails with Java 22 HOT 1
- [Bug] nslookup in apachepulsar/pulsar:3.3.0 isn't compatible with kubernetes search domains
- Pulsar Standalone: --wipe-data does not work with RocksDB backend in 3.2.3
- [Doc] Search doesn't work on pulsar website HOT 2
- [Bug] [docs] Pulsar 3.3 javadoc is in Chinese HOT 1
- [Bug] Broker became irresponsive due to too many open files error HOT 2
- [Doc] Document the removal of compaction
- [Bug] Major compaction is not recovered automatically after the disk is writable again
- [Bug] `status.html` can't access using 3.3.0 image
- [Bug] Ledger can not recover with Digest Mismatch Error HOT 5
- [Bug] Pulsar Functions Runtime doesn't properly enable direct byte buffer access for Netty on Java 17+
- [Bug] Pulsar broker CPU stratification problem HOT 5
- [Bug] [broker] broker log a full thread dump when a deadlock is detected in healthcheck every time
- [Doc][Improve] Backlog increase during subscription replication
- [Doc] add golang in transaction support list
- [improve]Perform health checks on the endpoints passed in by serviceUrl
- [Bug] Unable to initialize Stream metadata
- Jetty Upgrade: 12.x.x or latest HOT 4
- [Bug][broker] cursor will read in dead loop when do tailing-read with enableTransaction
- [Bug] Pulsar Functions ignores compressionType and crypto config for producers created with Context produce/newOutputMessage methods
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pulsar.