Comments (9)
There is a reason for this. I don't recall exactly what it was. I thought about this and tried it at some point during development. I think it has to do with the way Pig packages the intermediate data (data between the stages).
from datasketches-pig.
Agreed that with the Algebraic interface you have to do a type check, but that has to happen no matter what (https://github.com/DataSketches/sketches-pig/blob/master/src/main/java/com/yahoo/sketches/pig/theta/DataToSketch.java#L517)
Or are you referring to something else?
In other similar libraries that compute binary data structures, they generally just return the DataByteArray type directly. This allows a cleaner syntax, especially when used together with the estimate udf like getEstimate(dataToSketch(bag.field))
Would you be open to breaking PR that changes the return types of the sketch functions?
from datasketches-pig.
The initial stage in the algebraic mode is pass-through by design. So it has to return the original tuple. This is done for performance. Otherwise each input record would be converted into a sketch with one entry.
from datasketches-pig.
That makes perfect sense and the return types of the Initial and Intermediate exec functions have to be Tuple. But the return type of the Final exec just needs to match the generic type defined by the UDF.
I think this is a great library but this tripped me up when I tried to use it the first time because I was expecting the sketch udf to return a DataByteArray as that is more the norm for these types of functions.
from datasketches-pig.
I want to clarify this issue.
Our design decision was to have a pass-through initial stage of the Algebraic interface for performance reasons. The alternative would be to produce a sketch with one entry for each input row, and in the next stages only deal with merging sketches. This would work, but would be very inefficient.
A pass-through initial stage has to return the original tuple, and this dictates the return types of all other stages and the UDF overall because Pig expects the types to agree.
Yes, I agree, it does look a bit unnatural and unnecessary on the first glance, but having to flatten the result is a small price to pay for the efficiency.
from datasketches-pig.
I believe this statement is incorrect:
"A pass-through initial stage has to return the original tuple, and this dictates the return types of all other stages and the UDF overall because Pig expects the types to agree."
That was the point I was trying to make in my previous comment, that the return type for the Initial and Intermediate stage is expected to be Tuple, but that the return type from the Final stage is what matters and that is the only one that has to match the generic type of the EvalFunc class.
from datasketches-pig.
Oh, now I see your point. You might be right. I would like to investigate this. These UDFs were implemented several years ago by somebody else, and I took this requirement of the same output type for granted. Maybe it was required before, I am not sure. If this works as you suggest, we may change the UDFs at some point, but it is an incompatible change, so we will need to find the right opportunity. I would suggest keeping this issue open for now.
from datasketches-pig.
Thanks @AlexanderSaydakov. As a POC, see this fork which makes this change for the Theta sketch.
I verified that this works and gives consistent results with the test data and script found here. In fact, it works with the script as is: flatten(dataToSketch(a.id)) as (sketch)
AND also works calling it like so: dataToSketch(a.id) as sketch
. So in some sense it may be pseudo backward compatible.
I do think it makes the resulting pig syntax a little cleaner, unsure if that is really worth it.
from datasketches-pig.
I am labeling this as a possible enhancement, but low priority and closing this issue after lingering for over 3 years. We can always reopen this issue and revisit this in the future.
from datasketches-pig.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasketches-pig.