Comments (3)
Thank you for this report.
Fixing this is going to require a large change to our current naive windowing model and is an oversight in our initial design of the internal windowing machinery. Currently we assume that window assignment is static and can never change and is immediately passed to windowing logic for that window for processing. Session windows and watermarks interact so that adding out-of-order data means that it's possible that a window assignment for a previous data point is no longer correct when out-of-order data arrives (even if not "late" and before the watermark).
E.g. You want to calculate 5 min session windows. Assume the watermark is at 12:50 PM and does not advance through this example. A data point arrives with timestamp 1:00 PM; it is assigned window ID 1 and is passed to the ID 1 logic for processing. A data point then arrives with timestamp 1:06 PM; it is assigned window ID 2 because it has a >5 min gap, and is passed to the ID 2 logic for processing. Then a data point arrives with timestamp 1:03 PM; (in the current logic) it is assigned to window ID 1 because it is <5 min gap and then passed to the window ID 1 logic for processing.
This is correct but incomplete and inconsistent: Somehow the 1:06 PM data point should be "re-assigned" to the ID 1 window because the new out-of-order data point at 1:03 PM causes it to in total be in the same session as the initial 1:00 PM point, even though it initially got an ID 2. The current internal windowing API does not have the capability to do this, nor do I think that that is necessarily the correct design. It is also possible to do something like "buffer items until the window is closed1 and then process all at once". It's also possible to require that window logic be always phrased as reductions so windows can be combined. We'll have to spend some time figuring out if there's a unifying way to re-design the internal interfaces to allow all windowers to be implemented in the same way.
I recommend this work be included as part of the "move windowing operators into Python" work as to not tee this up to be done twice.
I'll try to think if there are some more immediate mitigations you can do to still implement a generally-correct session window join, but I'm not sure there are.
Footnotes
-
I'm actually not sure what the definition of "closed" is in this sense and it might require custom logic per-windowing type. E.g. in session windowing you need to take into account the gap parameter to know how far past the watermark you should wait to know all window assignments are final; but that's not the same logic for all windowing types. ↩
from bytewax.
Sorry accidental autoclose. This is not fixed.
from bytewax.
This will be fixed with the next major release due to #433 which added "window merging" features.
from bytewax.
Related Issues (20)
- Separate `epoch_interval` and `snapshot_interval`
- `count_window` only sends the count to the event clock's `dt_getter` function
- [FEATURE] Release a Python 3.12 wheel HOT 1
- Kafka group.id to manage the number of workers Bytewax Docker HOT 3
- Provide worker count and worker index in list_parts of FixedPartitionedSource HOT 2
- Allow intra-file source parallelism HOT 1
- Some mechanism for queuing batch source partition reads
- Backup interval example from docs does not work
- [FEATURE] - Add Auth to RedpandaSchemaRegistry HOT 1
- Bytewax does not scale in case of single process and multiple workers HOT 3
- RTD flyout view page source link is broken for API docs HOT 4
- Calendar windower
- Data missing from windows when `align_to` is very long ago
- [FEATURE] Make CodSpeed work HOT 3
- Inconsistent SessionWindow output HOT 8
- Prometheus monitoring fails because of the corrupted metrics response HOT 2
- [FEATURE] Visualize data flow graph HOT 1
- Hook to allow cached prep for all logic classes
- Operator chaining
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bytewax.