Coder Social home page Coder Social logo

Comments (3)

davidselassie avatar davidselassie commented on June 13, 2024

Thank you for this report.

Fixing this is going to require a large change to our current naive windowing model and is an oversight in our initial design of the internal windowing machinery. Currently we assume that window assignment is static and can never change and is immediately passed to windowing logic for that window for processing. Session windows and watermarks interact so that adding out-of-order data means that it's possible that a window assignment for a previous data point is no longer correct when out-of-order data arrives (even if not "late" and before the watermark).

E.g. You want to calculate 5 min session windows. Assume the watermark is at 12:50 PM and does not advance through this example. A data point arrives with timestamp 1:00 PM; it is assigned window ID 1 and is passed to the ID 1 logic for processing. A data point then arrives with timestamp 1:06 PM; it is assigned window ID 2 because it has a >5 min gap, and is passed to the ID 2 logic for processing. Then a data point arrives with timestamp 1:03 PM; (in the current logic) it is assigned to window ID 1 because it is <5 min gap and then passed to the window ID 1 logic for processing.

This is correct but incomplete and inconsistent: Somehow the 1:06 PM data point should be "re-assigned" to the ID 1 window because the new out-of-order data point at 1:03 PM causes it to in total be in the same session as the initial 1:00 PM point, even though it initially got an ID 2. The current internal windowing API does not have the capability to do this, nor do I think that that is necessarily the correct design. It is also possible to do something like "buffer items until the window is closed1 and then process all at once". It's also possible to require that window logic be always phrased as reductions so windows can be combined. We'll have to spend some time figuring out if there's a unifying way to re-design the internal interfaces to allow all windowers to be implemented in the same way.

I recommend this work be included as part of the "move windowing operators into Python" work as to not tee this up to be done twice.


I'll try to think if there are some more immediate mitigations you can do to still implement a generally-correct session window join, but I'm not sure there are.

Footnotes

  1. I'm actually not sure what the definition of "closed" is in this sense and it might require custom logic per-windowing type. E.g. in session windowing you need to take into account the gap parameter to know how far past the watermark you should wait to know all window assignments are final; but that's not the same logic for all windowing types.

from bytewax.

davidselassie avatar davidselassie commented on June 13, 2024

Sorry accidental autoclose. This is not fixed.

from bytewax.

davidselassie avatar davidselassie commented on June 13, 2024

This will be fixed with the next major release due to #433 which added "window merging" features.

from bytewax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.