Our current setup has a Partitioner that routes data

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Addressed by <a class="issue-link js-issue-link" data-error-text="Failed to load title

readOffset is slow/expensive for large, highly partitioned tables about kafka-connect-hdfs HOT 3 CLOSED

confluentinc commented on July 24, 2024

readOffset is slow/expensive for large, highly partitioned tables

from kafka-connect-hdfs.

Comments (3)

Ishiihara commented on July 24, 2024

@jingw Thanks for reporting this. We definitely want to improve this behavior. Do you want to elaborate a bit more of your both options? We need a good design here as so that the scalability of the connector can be improved.

from kafka-connect-hdfs.

jingw commented on July 24, 2024

Thanks for the reply.

For (1), here's a patch I've been playing with: https://gist.github.com/jingw/5f427f09037c26a21978e299415e7d39 (not complete, not successfully tested)
The idea is that the write ahead log knows about the offsets, so we should be able to use that instead of recursively scanning HDFS.

(2) is a quick hack to reduce the scale of the problem. If I make my partitioning format partition/day/hour instead of day/hour, then each partition only needs to scan the day/hour for its own partition. Furthermore, if we're willing to assume some amount of ordering (e.g. data isn't out-of-order by more than a day), then we can assume the latest offset is in one of the day/hour directories. This approach would probably work for me, but I'm guessing it wouldn't be suitable for inclusion in your repo.

from kafka-connect-hdfs.

dosvath commented on July 24, 2024

Addressed by #556

from kafka-connect-hdfs.

readOffset is slow/expensive for large, highly partitioned tables about kafka-connect-hdfs HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent