Bug/performance: Nearby joins can trigger a duplicate output message

kafka-denormalization

This is a sample project to denormalize two Kafka topics into one. In other words, it performs a many-to-one join between two topics based on a foreign key, and emits the joined data to a third topic.

The basic use-case is for when your data is already being produced on some topics, and you need to combine them over time as updates come in (and updating existing records from either side).

Example

This repository contains an example using Hacker News comments and stories. The services directory contains 2 microservices, one that polls for new stories and produces them on a topic, and the other for comments.

hn.comments (left)

{"by":"zinekeller","id":32546427,"parent":32546388,"text":"...","time":1661132891,"type":"comment","story":32545513}

hn.stories (right)

{"by":"thesuperbigfrog","descendants":40,"id":32545513,"score":50,"time":1661124181,"title":"The Google Pixel 6a highlights everything wrong with the U.S. phone market","type":"story","url":"https://www.xda-developers.com/google-pixel-6a-us-market-editorial/"}

Our objective is to join these 2 topics into one. Each message will contain the comment object, as well as the inflated story object.

hn.comments-with-story

{
    "comment": {"by":"zinekeller","id":32546427,"parent":32546388,"text":"...","time":1661132891,"type":"comment","story":32545513},
    "story": {"by":"thesuperbigfrog","descendants":40,"id":32545513,"score":50,"time":1661124181,"title":"The Google Pixel 6a highlights everything wrong with the U.S. phone market","type":"story","url":"https://www.xda-developers.com/google-pixel-6a-us-market-editorial/"}
}

Using the DSL I made for this project, it can be represented like this:

@Autowired
public void buildPipeline(StreamsBuilder builder) {
    var indexStore = Stores.inMemoryKeyValueStore("index");

    StreamDenormalize.<String, Comment, String, Story, String, JoinedCommentStoryEvent>builder()
        .keySchema(JoinKeySchemas.Blake2b(8, Serdes.String(), Serdes.String()))
        .indexTopic("hn.index")
            .indexStore(indexStore)
        .leftTopic("hn.comments")
            .leftSerde(Comment.serde)
        .rightTopic("hn.stories")
            .rightSerde(Story.serde)
        .joinOn(comment -> comment.story().toString())
            .joiner((comment, story) -> new JoinedCommentStoryEvent(comment, story))
            .keyMapper((k, joined) -> joined.comment().id().toString())
        .build()
        .innerJoin(builder)
            .to("hn.comments-with-story", Produced.with(Serdes.String(), JoinedCommentStoryEvent.serde));
}

	// Right side of the join
	// Every time a RIGHT is received, it will forward it to the INDEX topic, but re-keyed to have a NULL primary key,
	// and also, it will manually repartition the output based on the foreign key only.
	builder.stream(rightTopic, Consumed.with(keySchema.rightSerde(), Serdes.ByteArray()))
	.selectKey(keySchema.right())
	.to(indexTopic, Produced.with(JoinKey.serde, Serdes.ByteArray()).withStreamPartitioner(partitioner()));

aramperes / kafka-denormalization Goto Github PK

kafka-denormalization's Introduction

kafka-denormalization

Example

kafka-denormalization's People

Contributors

Stargazers

Watchers

kafka-denormalization's Issues

Bug/performance: Nearby joins can trigger a duplicate output message

Behavior

Cause

Solutions

Feat: allow foreign-key-to-foreign-key join

Optimization: optional `selectKey` value deserialization

Right-side source

Left-side source

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent