Coder Social home page Coder Social logo

Comments (8)

codeaholics avatar codeaholics commented on July 22, 2024 1

I'm not convinced you need the fixed record count. Let's say you're consuming from a relatively slow-moving stream and you get, say, 15 records (number 10001 - 10015). You write these out into a file with the start ID in its name (my-data-10001) then you fail to checkpoint because you crash. When you restart (or another worker picks up this shard), it will also get a batch starting at 10001. It may get the same 15 records or it may get a few more. Let's say it now gets 17 records (10001 - 10017). It (over-)writes these to the my-data-10001 file. That file has all the data it had previously, plus a bit more. We now checkpoint at 10017, and the next batch starts at 10018 ready to go into the next file.

I can't see the need for the fixed record count.

What I can see the need for that isn't exposed from the connector library, is the shard ID. Without writing my own KinesisConnectorRecordProcessorFactory and having it return a subclass of KinesisConnectorRecordProcessor which overrides initialize and captures its own copy of shardId, I can't see any way of accessing the shard ID. And even then, I'm not sure how you'd get it to the emitter.

from amazon-kinesis-connectors.

ananthrk avatar ananthrk commented on July 22, 2024

Any thoughts/updates on this?

from amazon-kinesis-connectors.

gauravgh avatar gauravgh commented on July 22, 2024

Would it be possible for you to configure the time and size thresholds to large enough values where the emit is triggered by counts alone?

Sincerely,
Gaurav

from amazon-kinesis-connectors.

ananthrk avatar ananthrk commented on July 22, 2024

Hi Gaurav,

Thanks for the response. Even if we configure the time & size thresholds in such a way that only the counts matter, the fact that the counts are not always exact would still be a problem. That is, configuring a value of "n" for count does not guarantee only "n" records are buffered for emit and can always go slightly higher depending on the processRecords call. Hence this strategy does not always help, right?

from amazon-kinesis-connectors.

ananthrk avatar ananthrk commented on July 22, 2024

What happens when the order of these operations are reversed - that is, we got 17 records the first time, but only 15 the next time (assuming we set recordCount=15)? In that case, we will have two duplicate records in S3.

from amazon-kinesis-connectors.

codeaholics avatar codeaholics commented on July 22, 2024

Why would you get fewer records on the subsequent call? And even if you did, you'd replace your 17 record file with a 15 record file and then your next call would presumably include the missing two records so they'd go into the next file. In any case, you have to cope with duplicates in the S3 files anyway because if you're using the Kinesis Client Library (which the Connector Library does), it can (briefly) have multiple workers subscribed to the same shards. So this whole scheme is just an exercise trying to reduce the number of duplicates, not completely eliminating them.

from amazon-kinesis-connectors.

ananthrk avatar ananthrk commented on July 22, 2024

Why would you get fewer records on the subsequent call?

I did not mean the call to processRecords here, but the number of records that are in the buffer when the file is emitted to S3. Since the "recordCount" variable is not exact, there is a chance for the buffer to hold less or more records depending on when the emit was triggered.

you have to cope with duplicates in the S3 files anyway because if you're using the Kinesis Client Library (which the Connector Library does), it can (briefly) have multiple workers subscribed to the same shards. So this whole scheme is just an exercise trying to reduce the number of duplicates, not completely eliminating them

Does this mean the only way to ensure unique records is through mechanisms available in the consuming system (say an RDBMS)?

from amazon-kinesis-connectors.

codeaholics avatar codeaholics commented on July 22, 2024

That's my belief, yes.

from amazon-kinesis-connectors.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.