Coder Social home page Coder Social logo

Record duplication? about vmware-go-kcl HOT 7 CLOSED

vmware avatar vmware commented on August 22, 2024
Record duplication?

from vmware-go-kcl.

Comments (7)

taoj-action avatar taoj-action commented on August 22, 2024

Make sure checkpointing correctly. You might also check whether the record processor is restarted. Enable more log for stress test can find the issue.

We had stress tested it before and didn't find any data duplication issue.

from vmware-go-kcl.

taoj-action avatar taoj-action commented on August 22, 2024

Also, https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html

from vmware-go-kcl.

arl avatar arl commented on August 22, 2024

I found the issue.

Fixed provisioning was enabled on DynamoDB and we had exceeded the provisioned throughput.
Setting the DynamoDB to on-demand provisioning fixed the problem.

The fixed provisioning was triggering shardConsumer.getRecords failures:

Error in refreshing lease on shard: shardId-000000001185 for worker: b3e512c484b4-23a4dbb5-060c-11ea-9728-0242ac110005. Error: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
	status code: 400, request id: xxxEDITEDxxx

immediately followed by:

Error in getRecords: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
	status code: 400, request id: xxxEDITEDxxx

side note: the same error is logged twice. In that case it might be desirable to not log at the error site. Instead we could return a new error created with fmt.Errorf that provides additional context, and let the caller logs it.

Make sure checkpointing correctly. You might also check whether the record processor is restarted. Enable more log for stress test can find the issue.

Still, I'm not sure I get why losing and restarting a shardProcessor should lead to record duplication. Ideally the restarted processor should take on where the last one stopped, right?

Also at one point the same cause (ProvisionedThroughputExceededException) triggered another effect, probably when a new shardProcessor was restarting.

 Error: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
	status code: 400, request id: xxxEDITEDXXX

Note the space in front of Error: this detail led me to the only place in the code base where Error: is used:

err := w.checkpointer.FetchCheckpoint(shard)
if err != nil {
// checkpoint may not existed yet is not an error condition.
if err != chk.ErrSequenceIDNotFound {
log.Errorf(" Error: %+v", err)
// move on to next shard
continue
}
}

from vmware-go-kcl.

arl avatar arl commented on August 22, 2024

@taojwmware I know that from the aws document you linked about duplicated records one of the possible case of duplication is a worker terminated unexpectedly, still I'm not sure I understand why it should be the case if checkpointer.GetLease fails, since the current shard owner hasn't moved its checkpoint yet when it happens. Thus even if a new worker gets started, it should continue where the previous owner stopped, right?

Am I missing something here?

from vmware-go-kcl.

taoj-action avatar taoj-action commented on August 22, 2024

For logging, there were some internal discussions long time before. We decided we'd rather log more instead of logging less to make troubleshooting easy. We use to wrap error and only log in one place but find it hard to trace back exactly error origin.

from vmware-go-kcl.

taoj-action avatar taoj-action commented on August 22, 2024

Depends on how to checkpoint, if you checkpoint it immediately after get record before process and store, restart processor will cause data lose. If only checkpointing the whole records but storing parted of processed record, processor restarts will cause the stored portion of records to be refetched after restart because no checkpoint was done on those record.

from vmware-go-kcl.

taoj-action avatar taoj-action commented on August 22, 2024

Here is the link for Kafka Clients (At-Most-Once, At-Least-Once, Exactly-Once, and Avro Client)
https://dzone.com/articles/kafka-clients-at-most-once-at-least-once-exactly-o

It applies to Kinesis as well.

from vmware-go-kcl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.