We setup two new clusters with replication and we were seeing really bad performance w

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

100% CPU Usage on Bookies,about apache/pulsar

Comments (35)

merlimat commented on July 1, 2024

Ok, there have been several changes in the Bookie RocksDB index that got picked in 1.15. At the same time we haven't updated the cluster settings in conf/bookkeeper.conf.

These are the new config settings with their default values:

dbStorage_rocksDB_writeBufferSizeMB=64
dbStorage_rocksDB_sstSizeInMB=64
dbStorage_rocksDB_blockSize=65536
dbStorage_rocksDB_bloomFilterBitsPerKey=10
dbStorage_rocksDB_blockCacheSize=268435456 # 256 MBytes
dbStorage_rocksDB_numLevels=-1
dbStorage_rocksDB_numFilesInLevel0=4
dbStorage_rocksDB_maxSizeInLevel1MB=256

In production we're using these values, you can adjust the memory size to your machines.

# This is where the data is stored before flushing. This will come from JVM direct memory.
dbStorage_writeCacheMaxSizeMb=8192

# Read-ahead cache to speed up backlog draining. From direct memory
dbStorage_readAheadCacheMaxSizeMb=4096

# Give 4GBytes to RocksDB block cache. Ideally this cache should be big enough
# to contain most of the locations index data/bookkeeper/ledgers/current/locations/
dbStorage_rocksDB_blockCacheSize=4294967296

In particular, I believe the blockCacheSize is what is affecting the performance in your case.

I'll update the conf files to include the new config variables.

from pulsar.

sschepens commented on July 1, 2024

Hmm tried those configs, still seeing 100% cpu usage.

from pulsar.

merlimat commented on July 1, 2024

How big is the directory DB in data/bookkeeper/ledgers/current/locations/ ? Are you using SSDs or HDDs?

Can you enable bookie stats?

For example, to report stats to SLF4J logger, add this to bookkeeper.conf :

statsProviderClass=org.apache.bookkeeper.stats.CodahaleMetricsProvider
codahaleStatsSlf4jEndpoint=stats

This should report the stats every 1 minute in the bookie log file.

One other thing you can try is to reduce the read-ahead cache, even try to disable it:

dbStorage_readAheadCacheBatchSize=0

from pulsar.

DongbinNie commented on July 1, 2024

I might encounter the same issue with the latest master branch build running on SAS disk servers with the default bookkeepr configuration, the data/bookkeeper/ledgers/current/locations/ is about 100MB.

And the measused throughput with perf consumer is very strange:
2016-11-16 10:22:40,138 - INFO - [main:PerformanceConsumer@256] - Throughput received: 199.998 msg/s -- 0.391 Mbit/s
2016-11-16 10:22:50,138 - INFO - [main:PerformanceConsumer@256] - Throughput received: 199.979 msg/s -- 0.391 Mbit/s
2016-11-16 10:23:00,139 - INFO - [main:PerformanceConsumer@256] - Throughput received: 99.995 msg/s -- 0.195 Mbit/s
2016-11-16 10:23:10,139 - INFO - [main:PerformanceConsumer@256] - Throughput received: 199.990 msg/s -- 0.391 Mbit/s
2016-11-16 10:23:20,140 - INFO - [main:PerformanceConsumer@256] - Throughput received: 99.995 msg/s -- 0.195 Mbit/s

is there any rate limiter in the processing(bookie, broker, client)?

Some bookies exhaust full of one cpu core with following stack:
"BookieReadThread-3181-9-1" #20 prio=5 os_prio=0 tid=0x00007f26fd35d800 nid=0x56ef runnable [0x00007f25a3e8e000]
java.lang.Thread.State: RUNNABLE
at org.rocksdb.RocksDB.get(Native Method)
at org.rocksdb.RocksDB.get(RocksDB.java:656)
at org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorageRocksDB.get(KeyValueStorageRocksDB.java:202)
at org.apache.bookkeeper.bookie.storage.ldb.EntryLocationIndex.getLocation(EntryLocationIndex.java:82)
at org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage.fillReadAheadCache(DbLedgerStorage.java:429)
at org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage.getEntry(DbLedgerStorage.java:404)
at org.apache.bookkeeper.bookie.LedgerDescriptorImpl.readEntry(LedgerDescriptorImpl.java:90)
at org.apache.bookkeeper.bookie.Bookie.readEntry(Bookie.java:1080)
at org.apache.bookkeeper.proto.ReadEntryProcessor.processPacket(ReadEntryProcessor.java:72)
at org.apache.bookkeeper.proto.PacketProcessorBase.safeRun(PacketProcessorBase.java:82)
at org.apache.bookkeeper.util.SafeRunnable.run(SafeRunnable.java:31)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)

I looked at the Bookkeeper's code a little bit, it seems the org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage.fillReadAheadCache(long, long, long) is not such efficient that it looks up every entry's location from rocksdb other than using rocksdb's iterator interface to retrieve the locations in batch behavior.

Can I disable rocksdb?

Thanks

from pulsar.

DongbinNie commented on July 1, 2024

I modified perf consumer to print out throughput every second, it seems related to the default dbStorage_readAheadCacheBatchSize=1000

2016-11-16 17:16:12,415 - INFO - [main:PerformanceConsumer@263] - Throughput received: 999.650 msg/s -- 1.952 Mbit/s
2016-11-16 17:16:13,416 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:14,416 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:15,416 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:16,417 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:17,417 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:18,418 - INFO - [main:PerformanceConsumer@263] - Throughput received: 999.642 msg/s -- 1.952 Mbit/s
2016-11-16 17:16:19,418 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:20,418 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:21,419 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:22,419 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:23,420 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:24,420 - INFO - [main:PerformanceConsumer@263] - Throughput received: 999.599 msg/s -- 1.952 Mbit/s
2016-11-16 17:16:25,420 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:26,421 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:27,421 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:28,421 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:29,422 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:30,422 - INFO - [main:PerformanceConsumer@263] - Throughput received: 999.589 msg/s -- 1.952 Mbit/s
2016-11-16 17:16:31,423 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:32,423 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:33,423 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:34,424 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:35,424 - INFO - [main:PerformanceConsumer@263] - Throughput received: 0.000 msg/s -- 0.000 Mbit/s
2016-11-16 17:16:36,425 - INFO - [main:PerformanceConsumer@263] - Throughput received: 999.622 msg/s -- 1.952 Mbit/s

from pulsar.

DongbinNie commented on July 1, 2024

I also tried to increased dbStorage_rocksDB_blockCacheSize to 512M, the throughput still the same and the cpu usage still 100%.

After enabled stat, the log:
2016-11-16 16:56:37,546 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.read-cache-hits, count=21671, min=6.879999999999999E-4, max=0.035417, mean=0.0015175175097276263, stddev=0.0019345367031356107, median=0.001241, p75=0.0014857499999999999, p95=0.002438899999999998, p98=0.0035251799999999945, p99=0.005444550000000007, p999=0.03535641900000001, mean_rate=22.58734472301873, m1=84.39411867058821, m5=48.11240752194036, m15=20.77283687896956, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,546 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.read-cache-hits-fail, count=0, min=0.0, max=0.0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,546 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.read-cache-misses, count=44, min=5377.533544999999, max=8165.4839249999995, mean=6163.4561557045445, stddev=682.8165566878679, median=5878.1123275, p75=6319.618259999999, p95=7783.444525749999, p98=8165.4839249999995, p99=8165.4839249999995, p999=8165.4839249999995, mean_rate=0.04586050502350609, m1=0.16778861957075686, m5=0.09613923674913623, m15=0.041518573863973486, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,546 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.read-cache-misses-fail, count=0, min=0.0, max=0.0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,546 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.read-entry, count=21739, min=0.001761, max=6258.103631, mean=11.848359065175098, stddev=268.50722523997473, median=0.002434, p75=0.00310275, p95=0.004722399999999999, p98=0.0058926999999999955, p99=0.007628020000000005, p999=6248.241737416001, mean_rate=22.658196992965564, m1=84.5629030485838, m5=48.20859873250287, m15=20.814361490027295, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,547 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.read-entry-fail, count=1, min=0.324559, max=0.324559, mean=0.324559, stddev=0.0, median=0.324559, p75=0.324559, p95=0.324559, p98=0.324559, p99=0.324559, p999=0.324559, mean_rate=0.0010422832855265462, m1=1.383502708828376E-4, m5=0.0012784533730485269, m15=8.072816662657008E-4, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,547 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.readahead-batch-count, count=44, min=1000.0, max=1000.0, mean=1000.0, stddev=0.0, median=1000.0, p75=1000.0, p95=1000.0, p98=1000.0, p99=1000.0, p999=1000.0, mean_rate=0.045860470597819974, m1=0.16778861957075686, m5=0.09613923674913623, m15=0.041518573863973486, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,547 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.readahead-batch-count-fail, count=0, min=0.0, max=0.0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,547 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.readahead-batch-size, count=44, min=332000.0, max=332000.0, mean=332000.0, stddev=0.0, median=332000.0, p75=332000.0, p95=332000.0, p98=332000.0, p99=332000.0, p999=332000.0, mean_rate=0.04586046636091089, m1=0.16778861957075686, m5=0.09613923674913623, m15=0.041518573863973486, rate_unit=events/second, duration_unit=milliseconds
2016-11-16 16:56:37,547 - INFO [metrics-logger-reporter-thread-1:MarkerIgnoringBase@107] - type=TIMER, name=bookie.storage.readahead-batch-size-fail, count=0, min=0.0, max=0.0, mean=0.0, stddev=0.0, median=0.0, p75=0.0, p95=0.0, p98=0.0, p99=0.0, p999=0.0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/second, duration_unit=milliseconds

from pulsar.

DongbinNie commented on July 1, 2024

I changed ledgerStorageClass=org.apache.bookkeeper.bookie.SortedLedgerStorage in bookkeeper.conf, the throughput is healthy now.

from pulsar.

sschepens commented on July 1, 2024

I changed ledgerStorageClass=org.apache.bookkeeper.bookie.SortedLedgerStorage in bookkeeper.conf, the throughput is healthy now.

Seems to work, thanks!

For example, to report stats to SLF4J logger, add this to bookkeeper.conf :
statsProviderClass=org.apache.bookkeeper.stats.CodahaleMetricsProvider
codahaleStatsSlf4jEndpoint=stats

@merlimat couldn't get this to work, Bookkeeper is complaining it can't find that class.

from pulsar.

merlimat commented on July 1, 2024

@merlimat couldn't get this to work, Bookkeeper is complaining it can't find that class.

Ouch, I think that's because we are not including the dependency on org.apache.bookkeeper:codahale-metrics-provider in the distribution tgz.

from pulsar.

merlimat commented on July 1, 2024

I changed ledgerStorageClass=org.apache.bookkeeper.bookie.SortedLedgerStorage in bookkeeper.conf, the throughput is healthy now.

@sschepens @DongbinNie If you don't have many topics per single bookie (< 10K). The SortedLedgerStorage it's probably a very good choice.

If you don't want to wipe out the bookie you can convert the indexes from Sorted/Interleaved storage and RocksDb indexes with these commands :

# RocksDB to interleavead/sorted storage
bin/bookkeeper shell convert-to-interleaved-storage

# Interleaved to RocksDB
bin/bookkeeper shell convert-to-db-storage

from pulsar.

sschepens commented on July 1, 2024

@merlimat what other choice do we have? the default storage now is triggering really bad performance

from pulsar.

merlimat commented on July 1, 2024

The default storage implementation in Apache BookKeeper is SortedLedgerStorage. The "older" implementation was the InterleavedLedgerStorage, which is still available and has the same index format as the SortedLedgerStorage. The 2 are pretty much interchangeable, with the main difference being that SortedLedgerStorage maintains an in memory write-cache of entries, and then sorts them before flushing the entry-log. This gives better read-throughput when draining backlog.

At Yahoo we have developed DbLedgerStorage, which share some of goals of SortedLedgerStorage, but it's designed to scale to 100s of thousands of active ledgers per bookie. Again we have a write cache / read cache and we keep the offset indexes in RocksDB. This implementation is quite stable and used in prod since long time. Probably there may be some
configuration parameters that need to be adjusted in some way for different machine specs.

Can you share the exact steps of your test, so that I can try to reproduce the bad behavior locally?

from pulsar.

DongbinNie commented on July 1, 2024

Can you share the exact steps of your test, so that I can try to reproduce the bad behavior locally?

It can be re-produced easily in my env(with SAS disk) in following steps:

starting perf-consumer: bin/pulsar-perf consume -n 1 -s sub-0 -q 1000 -u foo_url bar_topicUrl
starting perf-producer: bin/pulsar-perf produce -n 20 -r 10000 -s 256 -u foo_url bar_topicUrl
after received some messages in the consumer side, stop the consumer
let the producer running for some times, e.g. 30 mins
restart the perf-consumer: bin/pulsar-perf consume -n 1 -s sub-0 -q 1000 -u foo_url bar_topicUrl

Probably there may be some
configuration parameters that need to be adjusted in some way for different machine specs.

Despite of the configuration, I believe there can be some enhancement in org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage.fillReadAheadCache(long, long, long), that it will be much efficient to use rocksdb's iterator to expose the db's range scan ability other than seeking for every location

from pulsar.

merlimat commented on July 1, 2024

Despite of the configuration, I believe there can be some enhancement in org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage.fillReadAheadCache(long, long, long), that it will be much efficient to use rocksdb's iterator to expose the db's range scan ability other than seeking for every location

The best option there is actually to completely bypass the rocksdb reads and just scan the BK entrylog file, and stopping when an entry for a different ledger is found (or max batch count entries were read).

from pulsar.

DongbinNie commented on July 1, 2024

I have created a PR https://github.com/yahoo/bookkeeper/pull/1/files for modifying the fillReadAheadCache to use rocksdb's iterator, after patched in my env, the reading QPS goes to 30,000 from 300.

the compiled jar is attached for quick evaluation, please rename bookkeeper-server-4.3.1.41-yahoo-iterator.zip to bookkeeper-server-4.3.1.41-yahoo-iterator.jar.

bookkeeper-server-4.3.1.41-yahoo-iterator.zip

from pulsar.

merlimat commented on July 1, 2024

@DongbinNie Nice finding. This looks like a perf regression introduced in bookkeeper-server-4.3.1.41-yahoo that was included in v1.15.
The read-ahead code itself was not changed significantly, but before there was an in-memory cache of the offset that was skipping the rocksdb get calls.

Your change looks good, though I'd prefer to scan directly the entry log and avoid rocks db interaction in the read-ahead logic, as the iterators themselves can perform slowly under certain conditions.

I'm testing out the change right now and will update here.

from pulsar.

merlimat commented on July 1, 2024

I just ran the tests with the change that avoid to do RocksDB get() when filling the read-ahead cache and I can clearly see the improvement, though I am seeing a very different number in the baseline on my hardware/config:

Conditions:

1 producer / 1 consumer
200 GB of backlog
1 KB message sizes
50 K writes /s
No batches (batches make the problem too easy for the bookie, consumer can easily fetch ~300K msg/s without any change)

Backlog draining:

Before: 34 K reads/s
After: 58 K reads/s

from pulsar.

merlimat commented on July 1, 2024

Merged the fix in master. @sschepens @DongbinNie can you try again to check whether it improves backlog draining in your setup?

from pulsar.

DongbinNie commented on July 1, 2024

Merged the fix in master. @sschepens @DongbinNie can you try again to check whether it improves backlog draining in your setup?

I test in my env, the consumer QPS can goes to 30,000 with the latest master branch build.
I'm wondering another case, if there were many writing topic, will it impact the performance due to fillReadAheadCache may only query a little entries every time(I'm not familiar with bookkeeper for now).

from pulsar.

merlimat commented on July 1, 2024

Entries are expected to be mostly sequential on the bookie entry logs, and that is why we have introduced the read-ahead cache.

The reason is that when a write reaches the bookie, first it is appended and synced on the journal (for durability) and then inserted into the write cache.

Write cache is in memory in the bookie process and it gets flushed periodically (default 1min). When flushing, we sort the entries by ledgerId/entryId, so that messages for the same topic are stored sequentially on disk, for that 1min interval.

You can try publishing and consuming on many topics by passing the "-t 100" flag on the perf tools.

from pulsar.

merlimat commented on July 1, 2024

Closing this one. I will open a new PR with documentation for all the DbStorage configuration parameters.

from pulsar.

sschepens commented on July 1, 2024

@merlimat still seeing strange performance on nodes when autorecovery is running, some nodes go to 100% cpu utilization and are apparently hung on reading from rocksdb
What seems strange is that we have configured 4GB of readahead cache, but it doesn't even seem to be using it, processes are consuming about 2GB memory total. RocksDB block cache size is also 4GB.

from pulsar.

sschepens commented on July 1, 2024

@merlimat Found something curious, Bookkeepers we're logging Entry location cache max size: 16 MB.
You recommended incrementing dbStorage_rocksDB_blockCacheSize to allow locations to fit, but it seems there's another configuration for that (dbStorage_entryLocationCacheMaxSizeMb) which is defaulted at 16MB.
~~Incrementing this value to 4GB immediately caused bookies cpu usage to drop significantly.~~

Edit: dbStorage_entryLocationCacheMaxSizeMb seems to not be used, our bookies continue to give really low throughput.

from pulsar.

sschepens commented on July 1, 2024

CPU is down now, even tough, read throughput is still extremely low...
This seemed to happen when we restarted our bookkeepers.

We're seeing a spike in throughput when we restart bookies, but then they slowly degrade. Here's a graph of the read throughput of a consumer on a partitioned topic after restarting bookies:

from pulsar.

merlimat commented on July 1, 2024

@sschepens dbStorage_entryLocationCacheMaxSizeMb was actually unused in 1.15. I have removed the keys from BK branch. Can you paste all the config settings you are currently using?

If you re-do the initial test of backlog draining, without restarting bookies, does it show higher read throughput?

from pulsar.

sschepens commented on July 1, 2024

To give some insights, before restarting our bookies and we started observing this behavior, we were having a read throughput double the amount of the spike in the graph constantly.

from pulsar.

merlimat commented on July 1, 2024

So, restarting the bookies, in general triggers the auto-recovery for each of the restarted bookie.

To avoid that, in a controlled restart you should be disabling auto-recovery (bin/bookkeeper shell autorecovery -disable) and then re-enable when the bookie is back up (bin/bookkeeper shell autorecovery -enable)

from pulsar.

sschepens commented on July 1, 2024

This is our current config file.
We're in the process of re running a test on a new topic.

#
# Copyright 2016 Yahoo Inc.
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# 

## Bookie settings

# Port that bookie server listen on
bookiePort=3181

# Set the network interface that the bookie should listen on.
# If not set, the bookie will listen on all interfaces.
#listeningInterface=eth0

# Whether the bookie allowed to use a loopback interface as its primary
# interface(i.e. the interface it uses to establish its identity)?
# By default, loopback interfaces are not allowed as the primary
# interface.
# Using a loopback interface as the primary interface usually indicates
# a configuration error. For example, its fairly common in some VPS setups
# to not configure a hostname, or to have the hostname resolve to
# 127.0.0.1. If this is the case, then all bookies in the cluster will
# establish their identities as 127.0.0.1:3181, and only one will be able
# to join the cluster. For VPSs configured like this, you should explicitly
# set the listening interface.
#allowLoopback=false

# Directory Bookkeeper outputs its write ahead log
journalDirectory=data/bookkeeper/journal

# Directory Bookkeeper outputs ledger snapshots
# could define multi directories to store snapshots, separated by ','
# For example:
# ledgerDirectories=/tmp/bk1-data,/tmp/bk2-data
# 
# Ideally ledger dirs and journal dir are each in a differet device,
# which reduce the contention between random i/o and sequential write.
# It is possible to run with a single disk, but performance will be significantly lower.
ledgerDirectories=data/bookkeeper/ledgers
# Directories to store index files. If not specified, will use ledgerDirectories to store.
# indexDirectories=data/bookkeeper/ledgers

# Ledger Manager Class
# What kind of ledger manager is used to manage how ledgers are stored, managed
# and garbage collected. Try to read 'BookKeeper Internals' for detail info.
ledgerManagerType=hierarchical

# Root zookeeper path to store ledger metadata
# This parameter is used by zookeeper-based ledger manager as a root znode to
# store all ledgers.
# zkLedgersRootPath=/ledgers

# Ledger storage implementation class
ledgerStorageClass=org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage

# Enable/Disable entry logger preallocation
# entryLogFilePreallocationEnabled=true

# Max file size of entry logger, in bytes
# A new entry log file will be created when the old one reaches the file size limitation
# logSizeLimit=2147483648

# Threshold of minor compaction
# For those entry log files whose remaining size percentage reaches below
# this threshold will be compacted in a minor compaction.
# If it is set to less than zero, the minor compaction is disabled.
# minorCompactionThreshold=0.2

# Interval to run minor compaction, in seconds
# If it is set to less than zero, the minor compaction is disabled. 
# minorCompactionInterval=3600

# Threshold of major compaction
# For those entry log files whose remaining size percentage reaches below
# this threshold will be compacted in a major compaction.
# Those entry log files whose remaining size percentage is still
# higher than the threshold will never be compacted.
# If it is set to less than zero, the minor compaction is disabled.
majorCompactionThreshold=0.5

# Interval to run major compaction, in seconds
# If it is set to less than zero, the major compaction is disabled. 
# majorCompactionInterval=86400 

# Set the maximum number of entries which can be compacted without flushing.
# When compacting, the entries are written to the entrylog and the new offsets
# are cached in memory. Once the entrylog is flushed the index is updated with
# the new offsets. This parameter controls the number of entries added to the
# entrylog before a flush is forced. A higher value for this parameter means
# more memory will be used for offsets. Each offset consists of 3 longs.
# This parameter should _not_ be modified unless you know what you're doing.
# The default is 100,000.
#compactionMaxOutstandingRequests=100000

# Set the rate at which compaction will readd entries. The unit is adds per second.
#compactionRate=1000

# Throttle compaction by bytes or by entries. 
#isThrottleByBytes=false

# Set the rate at which compaction will readd entries. The unit is adds per second.
#compactionRateByEntries=1000

# Set the rate at which compaction will readd entries. The unit is bytes added per second.
#compactionRateByBytes=1000000

# Max file size of journal file, in mega bytes
# A new journal file will be created when the old one reaches the file size limitation
#
# journalMaxSizeMB=2048

# Max number of old journal file to kept
# Keep a number of old journal files would help data recovery in specia case
#
# journalMaxBackups=5

# How much space should we pre-allocate at a time in the journal
# journalPreAllocSizeMB=16

# Size of the write buffers used for the journal
# journalWriteBufferSizeKB=64

# Should we remove pages from page cache after force write
journalRemoveFromPageCache=true

# Should we group journal force writes, which optimize group commit
# for higher throughput
# journalAdaptiveGroupWrites=true

# Maximum latency to impose on a journal write to achieve grouping
journalMaxGroupWaitMSec=1

# All the journal writes and commits should be aligned to given size
journalAlignmentSize=4096

# Maximum writes to buffer to achieve grouping
# journalBufferedWritesThreshold=524288

# If we should flush the journal when journal queue is empty
# journalFlushWhenQueueEmpty=false

# The number of threads that should handle journal callbacks
numJournalCallbackThreads=8

# The number of max entries to keep in fragment for re-replication
rereplicationEntryBatchSize=5000

# How long the interval to trigger next garbage collection, in milliseconds
# Since garbage collection is running in background, too frequent gc
# will heart performance. It is better to give a higher number of gc
# interval if there is enough disk capacity.
gcWaitTime=900000

# How long the interval to trigger next garbage collection of overreplicated
# ledgers, in milliseconds [Default: 1 day]. This should not be run very frequently since we read
# the metadata for all the ledgers on the bookie from zk
# gcOverreplicatedLedgerWaitTime=86400000

# How long the interval to flush ledger index pages to disk, in milliseconds
# Flushing index files will introduce much random disk I/O.
# If separating journal dir and ledger dirs each on different devices,
# flushing would not affect performance. But if putting journal dir
# and ledger dirs on same device, performance degrade significantly
# on too frequent flushing. You can consider increment flush interval
# to get better performance, but you need to pay more time on bookie
# server restart after failure.
#
flushInterval=60000

# Interval to watch whether bookie is dead or not, in milliseconds
#
# bookieDeathWatchInterval=1000

## zookeeper client settings

# A list of one of more servers on which zookeeper is running.
# The server list can be comma separated values, for example:
# zkServers=zk1:2181,zk2:2181,zk3:2181
zkServers=localhost:2181
# ZooKeeper client session timeout in milliseconds
# Bookie server will exit if it received SESSION_EXPIRED because it
# was partitioned off from ZooKeeper for more than the session timeout
# JVM garbage collection, disk I/O will cause SESSION_EXPIRED.
# Increment this value could help avoiding this issue
zkTimeout=30000

## NIO Server settings

# This settings is used to enabled/disabled Nagle's algorithm, which is a means of
# improving the efficiency of TCP/IP networks by reducing the number of packets
# that need to be sent over the network.
# If you are sending many small messages, such that more than one can fit in
# a single IP packet, setting server.tcpnodelay to false to enable Nagle algorithm
# can provide better performance.
# Default value is true.
#
# serverTcpNoDelay=true

## ledger cache settings

# Max number of ledger index files could be opened in bookie server
# If number of ledger index files reaches this limitation, bookie
# server started to swap some ledgers from memory to disk.
# Too frequent swap will affect performance. You can tune this number
# to gain performance according your requirements.
openFileLimit=0

# Size of a index page in ledger cache, in bytes
# A larger index page can improve performance writing page to disk,
# which is efficent when you have small number of ledgers and these
# ledgers have similar number of entries.
# If you have large number of ledgers and each ledger has fewer entries,
# smaller index page would improve memory usage.
# pageSize=8192

# How many index pages provided in ledger cache
# If number of index pages reaches this limitation, bookie server
# starts to swap some ledgers from memory to disk. You can increment
# this value when you found swap became more frequent. But make sure
# pageLimit*pageSize should not more than JVM max memory limitation,
# otherwise you would got OutOfMemoryException.
# In general, incrementing pageLimit, using smaller index page would
# gain bettern performance in lager number of ledgers with fewer entries case
# If pageLimit is -1, bookie server will use 1/3 of JVM memory to compute
# the limitation of number of index pages.
pageLimit=0

#If all ledger directories configured are full, then support only read requests for clients.
#If "readOnlyModeEnabled=true" then on all ledger disks full, bookie will be converted
#to read-only mode and serve only read requests. Otherwise the bookie will be shutdown.
#By default this will be disabled.
readOnlyModeEnabled=true

#For each ledger dir, maximum disk space which can be used.
#Default is 0.95f. i.e. 95% of disk can be used at most after which nothing will
#be written to that partition. If all ledger dir partions are full, then bookie
#will turn to readonly mode if 'readOnlyModeEnabled=true' is set, else it will
#shutdown.
#Valid values should be in between 0 and 1 (exclusive). 
#diskUsageThreshold=0.95

#Disk check interval in milli seconds, interval to check the ledger dirs usage.
#Default is 10000
#diskCheckInterval=10000

# Interval at which the auditor will do a check of all ledgers in the cluster.
# By default this runs once a week. The interval is set in seconds.
# To disable the periodic check completely, set this to 0.
# Note that periodic checking will put extra load on the cluster, so it should
# not be run more frequently than once a day.
#auditorPeriodicCheckInterval=604800

# The interval between auditor bookie checks.
# The auditor bookie check, checks ledger metadata to see which bookies should
# contain entries for each ledger. If a bookie which should contain entries is
# unavailable, then the ledger containing that entry is marked for recovery.
# Setting this to 0 disabled the periodic check. Bookie checks will still
# run when a bookie fails.
# The interval is specified in seconds.
#auditorPeriodicBookieCheckInterval=86400

# number of threads that should handle write requests. if zero, the writes would
# be handled by netty threads directly.
numAddWorkerThreads=0

# number of threads that should handle read requests. if zero, the reads would
# be handled by netty threads directly.
numReadWorkerThreads=8

# If read workers threads are enabled, limit the number of pending requests, to
# avoid the executor queue to grow indefinitely
maxPendingReadRequestsPerThread=2500

# The number of bytes we should use as capacity for BufferedReadChannel. Default is 512 bytes.
readBufferSizeBytes=4096

# The number of bytes used as capacity for the write buffer. Default is 64KB.
# writeBufferSizeBytes=65536

# Whether the bookie should use its hostname to register with the
# co-ordination service(eg: zookeeper service).
# When false, bookie will use its ipaddress for the registration.
# Defaults to false.
#useHostNameAsBookieID=false

# Stats Provider Class
#statsProviderClass=org.apache.bookkeeper.stats.CodahaleMetricsProvider
codahaleStatsSlf4jEndpoint=stats

dbStorage_rocksDBEnabled=true

# This is where the data is stored before flushing. This will come from JVM direct memory.
dbStorage_writeCacheMaxSizeMb=256
#
# # Read-ahead cache to speed up backlog draining. From direct memory
dbStorage_readAheadCacheMaxSizeMb=4096
dbStorage_readAheadCacheBatchSize=1000
#
# # Give 4GBytes to RocksDB block cache. Ideally this cache should be big enough
# # to contain most of the locations index data/bookkeeper/ledgers/current/locations/
dbStorage_rocksDB_blockCacheSize=4294967296
dbStorage_entryLocationCacheMaxSizeMb=4096

dbStorage_rocksDB_writeBufferSizeMB=64
dbStorage_rocksDB_sstSizeInMB=64
dbStorage_rocksDB_blockSize=65536
dbStorage_rocksDB_bloomFilterBitsPerKey=10
dbStorage_rocksDB_numLevels=-1
dbStorage_rocksDB_numFilesInLevel0=4
dbStorage_rocksDB_maxSizeInLevel1MB=256

from pulsar.

sschepens commented on July 1, 2024

So, restarting the bookies, in general triggers the auto-recovery for each of the restarted bookie.

To avoid that, in a controlled restart you should be disabling auto-recovery (bin/bookkeeper shell autorecovery -disable) and then re-enable when the bookie is back up (bin/bookkeeper shell autorecovery -enable)

We're not running any autorecovery process, this should not be the case, but we'll try

from pulsar.

merlimat commented on July 1, 2024

dbStorage_writeCacheMaxSizeMb=256

This should be bigger, to avoid stalling the writes when the flush is slower. As I commented on the other issue, ideally it should be holding 1min worth of write traffic.

from pulsar.

sschepens commented on July 1, 2024

This should be bigger, to avoid stalling the writes when the flush is slower. As I commented on the other issue, ideally it should be holding 1min worth of write traffic.

We're seeing read throughput issues, not write issues.

We did experience some issues when setting dbStorage_writeCacheMaxSizeMb too high, our ledgers storage seemed to be really stressed out by huge writes.

from pulsar.

sschepens commented on July 1, 2024

Another update, writing and consuming from a new topic (without restarting bookies) seems to work just fine. Gonna try restarting bookies now.

from pulsar.

DongbinNie commented on July 1, 2024

@merlimat while I'm evaluating Pulsar, it will be very appreciated that you can list some real data in your production env, e.g. the number of bookie servers, bookie server's hardware, the number of topic and subscription, throughput, latency of message producing.

Thanks.

from pulsar.

merlimat commented on July 1, 2024

@DongbinNie

We have shared some numbers on our blog post: https://yahooeng.tumblr.com/post/150078336821/open-sourcing-pulsar-pub-sub-messaging-at-scale

Also, take a look at the slides from a recent talk I gave:
http://www.slideshare.net/merlimat/pulsar-distributed-pubsub-platform

For bookie hardware, there is a brief explanation at https://github.com/yahoo/pulsar/blob/master/docs/ClusterSetup.md#bookkeeper

Let me know if you want more details

from pulsar.

DongbinNie commented on July 1, 2024

Thanks @merlimat

Our business is heavily relied on MQ(rabbitmq for current), and the produce latency is very critical for us while the throughput is growing rapidly.

The underlying bookkeepr need to fsync the journal for every message(including batching) to complete the message adding, it means the bookkeeper must be deployed into SSD severs which will increase the costing of sever, so I'm very curious about the real data in your production practice. :-)

The slides page 4 mentions "Average latency <5ms, 99% 15ms", and the page 24 mentions a single topic's measurement that focus on tp99 < 5ms, If convenience, could you list some detail data about the two cases that the cluster size, especially the bookkeeper size and hardware, the topics and throughput...

Thanks in advance.

from pulsar.

100% CPU Usage on Bookies about pulsar HOT 35 CLOSED

Comments (35)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent