I'm trying to debug a duplicated records problem. Amount is from 0% (for low traffic s

Record duplication? about vmware-go-kcl HOT 7 CLOSED

vmware commented on August 22, 2024

Record duplication?

from vmware-go-kcl.

Comments (7)

taoj-action commented on August 22, 2024

Make sure checkpointing correctly. You might also check whether the record processor is restarted. Enable more log for stress test can find the issue.

We had stress tested it before and didn't find any data duplication issue.

from vmware-go-kcl.

taoj-action commented on August 22, 2024

Also, https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html

from vmware-go-kcl.

arl commented on August 22, 2024

I found the issue.

Fixed provisioning was enabled on DynamoDB and we had exceeded the provisioned throughput.
Setting the DynamoDB to on-demand provisioning fixed the problem.

The fixed provisioning was triggering shardConsumer.getRecords failures:

Error in refreshing lease on shard: shardId-000000001185 for worker: b3e512c484b4-23a4dbb5-060c-11ea-9728-0242ac110005. Error: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
	status code: 400, request id: xxxEDITEDxxx

immediately followed by:

Error in getRecords: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
	status code: 400, request id: xxxEDITEDxxx

side note: the same error is logged twice. In that case it might be desirable to not log at the error site. Instead we could return a new error created with fmt.Errorf that provides additional context, and let the caller logs it.

Make sure checkpointing correctly. You might also check whether the record processor is restarted. Enable more log for stress test can find the issue.

Still, I'm not sure I get why losing and restarting a shardProcessor should lead to record duplication. Ideally the restarted processor should take on where the last one stopped, right?

Also at one point the same cause (ProvisionedThroughputExceededException) triggered another effect, probably when a new shardProcessor was restarting.

 Error: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
	status code: 400, request id: xxxEDITEDXXX

Note the space in front of Error: this detail led me to the only place in the code base where Error: is used:

vmware-go-kcl/clientlibrary/worker/worker.go

Lines 260 to 268 in d0e9c4c

    
           err := w.checkpointer.FetchCheckpoint(shard) 
        
           if err != nil { 
        
           	// checkpoint may not existed yet is not an error condition. 
        
           	if err != chk.ErrSequenceIDNotFound { 
        
           		log.Errorf(" Error: %+v", err) 
        
           		// move on to next shard 
        
           		continue 
        
           	} 
        
           }

from vmware-go-kcl.

arl commented on August 22, 2024

@taojwmware I know that from the aws document you linked about duplicated records one of the possible case of duplication is a worker terminated unexpectedly, still I'm not sure I understand why it should be the case if checkpointer.GetLease fails, since the current shard owner hasn't moved its checkpoint yet when it happens. Thus even if a new worker gets started, it should continue where the previous owner stopped, right?

Am I missing something here?

from vmware-go-kcl.

taoj-action commented on August 22, 2024

For logging, there were some internal discussions long time before. We decided we'd rather log more instead of logging less to make troubleshooting easy. We use to wrap error and only log in one place but find it hard to trace back exactly error origin.

from vmware-go-kcl.

taoj-action commented on August 22, 2024

Depends on how to checkpoint, if you checkpoint it immediately after get record before process and store, restart processor will cause data lose. If only checkpointing the whole records but storing parted of processed record, processor restarts will cause the stored portion of records to be refetched after restart because no checkpoint was done on those record.

from vmware-go-kcl.

taoj-action commented on August 22, 2024

Here is the link for Kafka Clients (At-Most-Once, At-Least-Once, Exactly-Once, and Avro Client)
https://dzone.com/articles/kafka-clients-at-most-once-at-least-once-exactly-o

It applies to Kinesis as well.

from vmware-go-kcl.

Record duplication? about vmware-go-kcl HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	err := w.checkpointer.FetchCheckpoint(shard)
	if err != nil {
	// checkpoint may not existed yet is not an error condition.
	if err != chk.ErrSequenceIDNotFound {
	log.Errorf(" Error: %+v", err)
	// move on to next shard
	continue
	}
	}