Comments (4)
Essentially the problem we have to solve is, find a reasonable time window of consensusTimestamp not spanning streaming buffer. startTime is known, so we have to find endTime.
We have a function F(x) which tells us if X is in streaming buffer or not. Basically, F(x) = update table set dedupe = 1 where UNIX_SECONDS(consensusTimestampTruncated) = X
;
Albeit, it's a bit costly (~10-20s), so we can't splurge.
To probe for right X, we have couple options:
-
Linear probing:
X = startTime + T
iterate over T = t, 2t, 3t, 4t ..... until it fails, and use the last passing one.t
can be something like 10 min. -
Quadratic probing:
X = startTime + T
iterate over T = t, 2t, 4t, 8t ..... same as above.
Better if dedupe has to catchup large windows. -
Two state method
We know that our system will be in one of the two states: steady state or catching up.
In steady state, dedupe would be right behind streaming buffer boundary and something like T=10min would be enough.
In catch up, dedupe would have large amount of data to catchup and we can choose a TT in couple hours so that it catches up fast.
We only check for T and TT and use maximum passing one, if any.
I like (2) since it easy enough to implement. In steady state, it'll also do two probes like (3). But in catching up, it'll be much faster.
from hedera-etl.
I like option (2).
Another option to look at is to look for earliest entry time in streamingBuffer
returned by tables.get request https://stackoverflow.com/questions/43085896/update-or-delete-tables-with-streaming-buffer-in-bigquery. At least in the BigQuery UI I can see "Earliest entry time" field in table Details tab:
I suspect the API should return this field as well. Then either use linear or quadratic probing backwards starting from that timestamp, or simply subtract 5 minutes from it.
The caveat here is that this time is likely the ingestion time not the consensusTimestampTruncated
field from the record.
from hedera-etl.
That's one disappointing data point. I had such high hopes 🌈 from it when i was starting to implement deduplication for the first time. Soon it was all 🌧️.
Since it is bigquery's internal timestamp for the row, it is basically useless in catchup mode. For eg. in your image its close to 5/19/2020, however the actual consensusTimestamp for rows being loaded right now is around 9/15/19.
IF only GCP also gave a way to access the row corresponding to it, it would have solved all problems :)
from hedera-etl.
Yeah that's frustrating. Hopefully they will support updating records in the streaming buffer soon.
Another simple approach would be to run an hourly job that deduplicates rows in a fixed range that spans 1 hour. Say the job at 13:40 deduplicates rows in the range 11:00 - 12:00, the job at 14:40 deduplicates rows in the range 12:00 - 13:00 etc. And another daily job that deduplicates rows in the past day. Yet another one-time job that would dedup the rows at the end of the initial load. Quite straightforward though doesn't solve all the possible edge cases. Those can be captured by a daily job that checks for duplicates in the entire dataset, that we discussed before. Since this approach relies heavily on retries and keeping the state of scheduled jobs, perhaps a better fit for Composer. Likely can be done in Spring too, but would be more involved than the approach you described with probing X.
from hedera-etl.
Related Issues (12)
- Add PubSub → GCS dataflow to archive json messages
- Add Full deduplication
- Debug the bigQueryInsertErrors HOT 1
- Add table for transactionType
- HTS support
- Scheduled transaction support
- Add full list of transactionType mappings
- Token Transfer Not Included
- Add packaging
- Add metric to dataflow to track transaction latency (now - consensusTimestamp)
- Monitoring and Alerts for both etl-pipeline and deduplication
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hedera-etl.