Comments (8)
I'm not convinced you need the fixed record count. Let's say you're consuming from a relatively slow-moving stream and you get, say, 15 records (number 10001 - 10015). You write these out into a file with the start ID in its name (my-data-10001) then you fail to checkpoint because you crash. When you restart (or another worker picks up this shard), it will also get a batch starting at 10001. It may get the same 15 records or it may get a few more. Let's say it now gets 17 records (10001 - 10017). It (over-)writes these to the my-data-10001 file. That file has all the data it had previously, plus a bit more. We now checkpoint at 10017, and the next batch starts at 10018 ready to go into the next file.
I can't see the need for the fixed record count.
What I can see the need for that isn't exposed from the connector library, is the shard ID. Without writing my own KinesisConnectorRecordProcessorFactory
and having it return a subclass of KinesisConnectorRecordProcessor
which overrides initialize
and captures its own copy of shardId
, I can't see any way of accessing the shard ID. And even then, I'm not sure how you'd get it to the emitter.
from amazon-kinesis-connectors.
Any thoughts/updates on this?
from amazon-kinesis-connectors.
Would it be possible for you to configure the time and size thresholds to large enough values where the emit is triggered by counts alone?
Sincerely,
Gaurav
from amazon-kinesis-connectors.
Hi Gaurav,
Thanks for the response. Even if we configure the time & size thresholds in such a way that only the counts matter, the fact that the counts are not always exact would still be a problem. That is, configuring a value of "n" for count does not guarantee only "n" records are buffered for emit and can always go slightly higher depending on the processRecords call. Hence this strategy does not always help, right?
from amazon-kinesis-connectors.
What happens when the order of these operations are reversed - that is, we got 17 records the first time, but only 15 the next time (assuming we set recordCount=15)? In that case, we will have two duplicate records in S3.
from amazon-kinesis-connectors.
Why would you get fewer records on the subsequent call? And even if you did, you'd replace your 17 record file with a 15 record file and then your next call would presumably include the missing two records so they'd go into the next file. In any case, you have to cope with duplicates in the S3 files anyway because if you're using the Kinesis Client Library (which the Connector Library does), it can (briefly) have multiple workers subscribed to the same shards. So this whole scheme is just an exercise trying to reduce the number of duplicates, not completely eliminating them.
from amazon-kinesis-connectors.
Why would you get fewer records on the subsequent call?
I did not mean the call to processRecords here, but the number of records that are in the buffer when the file is emitted to S3. Since the "recordCount" variable is not exact, there is a chance for the buffer to hold less or more records depending on when the emit was triggered.
you have to cope with duplicates in the S3 files anyway because if you're using the Kinesis Client Library (which the Connector Library does), it can (briefly) have multiple workers subscribed to the same shards. So this whole scheme is just an exercise trying to reduce the number of duplicates, not completely eliminating them
Does this mean the only way to ensure unique records is through mechanisms available in the consuming system (say an RDBMS)?
from amazon-kinesis-connectors.
That's my belief, yes.
from amazon-kinesis-connectors.
Related Issues (20)
- AWS ES service version conflict HOT 1
- Add a fail callback when the message can not be transformed. HOT 1
- HTTP proxy support HOT 4
- Emitters should be more extensible
- Application logs
- Logic for auto scaling
- ElasticSearch connector does not work against AWS-ES service HOT 1
- Hibernate - Redshift.
- UnmodifiableBuffer equals method does not return false when buffers contain different records
- is Amazon KCL 1.7.5 compatible with elasticSearch 5.1.1? HOT 1
- Script is still using default values even after modifying the properties file HOT 2
- No suitable driver found for JDBC
- Understanding Number of records emitted to S3 HOT 3
- Not able to run the amazon-kinesis-connectors-samples-1.0.0-SNAPSHOT.jar
- finish Application after consuming n records from the stream with graceful shutdown HOT 2
- Setting Kinesis Client Library DynamoDB properties fails
- The input line is too long. The syntax of the command is incorrect.
- How can I migrate to use KCL 2.x and new AWS SDK, what's the future of kinesis-connectors?
- S3 Sample Error - Caught exception when uploading file s3://pfifer-connector-test HOT 3
- S3 and Kinesis access and secret keys can be different.Support multiple access and secret key.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon-kinesis-connectors.