Coder Social home page Coder Social logo

EvalFunc Return Type about datasketches-pig HOT 9 CLOSED

apache avatar apache commented on July 28, 2024
EvalFunc Return Type

from datasketches-pig.

Comments (9)

AlexanderSaydakov avatar AlexanderSaydakov commented on July 28, 2024

There is a reason for this. I don't recall exactly what it was. I thought about this and tried it at some point during development. I think it has to do with the way Pig packages the intermediate data (data between the stages).

from datasketches-pig.

kainoa21 avatar kainoa21 commented on July 28, 2024

Agreed that with the Algebraic interface you have to do a type check, but that has to happen no matter what (https://github.com/DataSketches/sketches-pig/blob/master/src/main/java/com/yahoo/sketches/pig/theta/DataToSketch.java#L517)

Or are you referring to something else?

In other similar libraries that compute binary data structures, they generally just return the DataByteArray type directly. This allows a cleaner syntax, especially when used together with the estimate udf like getEstimate(dataToSketch(bag.field))

Would you be open to breaking PR that changes the return types of the sketch functions?

from datasketches-pig.

AlexanderSaydakov avatar AlexanderSaydakov commented on July 28, 2024

The initial stage in the algebraic mode is pass-through by design. So it has to return the original tuple. This is done for performance. Otherwise each input record would be converted into a sketch with one entry.

from datasketches-pig.

kainoa21 avatar kainoa21 commented on July 28, 2024

That makes perfect sense and the return types of the Initial and Intermediate exec functions have to be Tuple. But the return type of the Final exec just needs to match the generic type defined by the UDF.

I think this is a great library but this tripped me up when I tried to use it the first time because I was expecting the sketch udf to return a DataByteArray as that is more the norm for these types of functions.

from datasketches-pig.

AlexanderSaydakov avatar AlexanderSaydakov commented on July 28, 2024

I want to clarify this issue.

Our design decision was to have a pass-through initial stage of the Algebraic interface for performance reasons. The alternative would be to produce a sketch with one entry for each input row, and in the next stages only deal with merging sketches. This would work, but would be very inefficient.

A pass-through initial stage has to return the original tuple, and this dictates the return types of all other stages and the UDF overall because Pig expects the types to agree.

Yes, I agree, it does look a bit unnatural and unnecessary on the first glance, but having to flatten the result is a small price to pay for the efficiency.

from datasketches-pig.

kainoa21 avatar kainoa21 commented on July 28, 2024

I believe this statement is incorrect:
"A pass-through initial stage has to return the original tuple, and this dictates the return types of all other stages and the UDF overall because Pig expects the types to agree."

That was the point I was trying to make in my previous comment, that the return type for the Initial and Intermediate stage is expected to be Tuple, but that the return type from the Final stage is what matters and that is the only one that has to match the generic type of the EvalFunc class.

from datasketches-pig.

AlexanderSaydakov avatar AlexanderSaydakov commented on July 28, 2024

Oh, now I see your point. You might be right. I would like to investigate this. These UDFs were implemented several years ago by somebody else, and I took this requirement of the same output type for granted. Maybe it was required before, I am not sure. If this works as you suggest, we may change the UDFs at some point, but it is an incompatible change, so we will need to find the right opportunity. I would suggest keeping this issue open for now.

from datasketches-pig.

kainoa21 avatar kainoa21 commented on July 28, 2024

Thanks @AlexanderSaydakov. As a POC, see this fork which makes this change for the Theta sketch.

I verified that this works and gives consistent results with the test data and script found here. In fact, it works with the script as is: flatten(dataToSketch(a.id)) as (sketch) AND also works calling it like so: dataToSketch(a.id) as sketch. So in some sense it may be pseudo backward compatible.

I do think it makes the resulting pig syntax a little cleaner, unsure if that is really worth it.

from datasketches-pig.

leerho avatar leerho commented on July 28, 2024

I am labeling this as a possible enhancement, but low priority and closing this issue after lingering for over 3 years. We can always reopen this issue and revisit this in the future.

from datasketches-pig.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.