Comments (14)
@merlimat any idea what could be happening? We had all brokers connecting to bookies which no longer existed, and continued doing so for days, we had to restart them for them to realize those bookies did not exist.
Is this an issue with Bookkeeper client or with brokers themselves?
from pulsar.
I think this is due to the BK client still having the old metadata, pointing to the retired bookies. I know that in broker if we have a read error, we close the ledger to make sure we get the updated metadata.
I'm not sure in this case at which point it might have got stuck. Need to reproduce it in a simple scenario to debug it further.
from pulsar.
I think this is due to the BK client still having the old metadata, pointing to the retired bookies. I know that in broker if we have a read error, we close the ledger to make sure we get the updated metadata.
We had lot's of failures and broker didn't seem to be updating the metadata, If I can find the related logs, I'll attach them.
from pulsar.
@merlimat i'm checking logs, the mostly complain of failing to write to the replaced bookies, and then they get quarantined, but eventually, they get removed from quarantine and start failing again.
No errors seem to come from reading, maybe that's why metadata doesn't get refreshed?
from pulsar.
Uhm, that sound strange. If the bookies are down, they should not be being picked up for new ledgers ensembles. Can you indeed post the logs here? Also, can you verify that the bookies are really not registered in ZK anymore?
bin/bookkeeper shell listbookies --readwrite
from pulsar.
Also, can you verify that the bookies are really not registered in ZK anymore?
I did that when we had the issue and they were indeeed not registered.
Here I attach some logs for a single broker which is failing against bookie 10.64.103.176 which did not exist at the time. The logs are only those mentioning that bookie.
from pulsar.
@merlimat could you take a look at the logs?
from pulsar.
So, I'm not sure on what is exactly happening, though it seems to be related with the RackAware policy and the notification it gets when the z-node with the mapping is changed.
In particular, at the beginning of the log :
2017-02-03 02:39:04,038 - INFO - [zk-cache-executor-11-1:ZkBookieRackAffinityMapping@160] - Bookie rack info updated to {us-east-1={10.64.103.28:3181=com.yahoo.pulsar.zookeeper.BookieInfo@583c6a6e, 10.64.102.115:3181=com.yahoo.pulsar.zookeeper.BookieInfo@410cbbe8, 10.64.102.214:3181=com.yahoo.pulsar.zookeeper.BookieInfo@797f9064, 10.64.102.126:3181=com.yahoo.pulsar.zookeeper.BookieInfo@10f00f34, 10.64.103.156:3181=com.yahoo.pulsar.zookeeper.BookieInfo@2ba4685e, 10.64.102.237:3181=com.yahoo.pulsar.zookeeper.BookieInfo@f535539, 10.64.102.145:3181=com.yahoo.pulsar.zookeeper.BookieInfo@a2a0807, 10.64.103.176:3181=com.yahoo.pulsar.zookeeper.BookieInfo@1ab32fd9, 10.64.103.68:3181=com.yahoo.pulsar.zookeeper.BookieInfo@12dd5249, 10.64.103.79:3181=com.yahoo.pulsar.zookeeper.BookieInfo@73227b6, 10.64.102.65:3181=com.yahoo.pulsar.zookeeper.BookieInfo@5d026d67, 10.64.103.171:3181=com.yahoo.pulsar.zookeeper.BookieInfo@5e4c5cf9}}. Notifying rackaware policy.
2017-02-03 02:39:04,039 - INFO - [zk-cache-executor-11-1:NetworkTopology@463] - Removing a node: /us-east-1e/10.64.103.176:3181
2017-02-03 02:39:04,039 - INFO - [zk-cache-executor-11-1:NetworkTopology@394] - Adding a new node: /us-east-1e/10.64.103.176:3181
So, first bookie 10.64.103.176 gets removed and then immediately added back again. I need to setup a test env to try to reproduce this.
In the meantime, I think you were updating rack-aware mapping z-node every time a bookie was removed from /ledgers/available
, right? Can you try not to touch the mapping and see if it makes any difference?
from pulsar.
@sschepens Not able to reproduce locally so far.
Can you try turning debug logs on these classes?
org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy
org.apache.bookkeeper.net.NetworkTopology
com.yahoo.pulsar.zookeeper.ZkBookieRackAffinityMapping
Also can you explain again how do you update the rack info?
from pulsar.
If you are still seeing this issue, it will be useful if you can list the nodes under /ledgers/available and send the contents of /bookies
from pulsar.
@saandrews yes we're still experimenting this every once in a while
from pulsar.
This might be related to network stabilization in zookeeper.
Basically it takes time to stabilise the network of bookies as registered in zookeeper. You might want to look at the zookeeper property bkc.networkTopologyStabilizePeriodSeconds
Ref - https://twitter.github.io/distributedlog/html/implementation/storage.html
Do let us know if this issue is resolved.
from pulsar.
@sschepens did this issue ever get resolved, or do you continue to see it?
from pulsar.
Closing this for now, please reopen if you see again
from pulsar.
Related Issues (20)
- [Bug] Broker became irresponsive due to too many open files error HOT 2
- [Doc] Document the removal of compaction
- [Bug] Major compaction is not recovered automatically after the disk is writable again
- [Bug] `status.html` can't access using 3.3.0 image
- [Bug] Ledger can not recover with Digest Mismatch Error HOT 5
- [Bug] Pulsar Functions Runtime doesn't properly enable direct byte buffer access for Netty on Java 17+
- [Bug] Pulsar broker CPU stratification problem HOT 5
- [Bug] [broker] broker log a full thread dump when a deadlock is detected in healthcheck every time
- [Doc][Improve] Backlog increase during subscription replication
- [Doc] add golang in transaction support list
- [improve]Perform health checks on the endpoints passed in by serviceUrl
- [Bug] Unable to initialize Stream metadata
- Jetty Upgrade: 12.x.x or latest HOT 6
- [Bug][broker] cursor will read in dead loop when do tailing-read with enableTransaction
- [Bug] Pulsar Functions ignores compressionType and crypto config for producers created with Context produce/newOutputMessage methods
- [Bug] Update partitions call is failing when topic level replication is disabled HOT 1
- As a websocket consumer I need to set InitialSubscriptionPosition to earliest HOT 1
- [Bug][broker] BrokerId npe when broker restart HOT 1
- [Bug] Unexpected Package Manager Behavior in Pulsar 3.3.0 Standalone Mode HOT 1
- [Bug] Dead lock error in Pulsar 3.0.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pulsar.