Coder Social home page Coder Social logo

Comments (2)

mbossenbroek avatar mbossenbroek commented on July 26, 2024

That has to do with how folds are composed, which is admittedly a little wonky. A little background...

There are 4 phases to a fold operation: pre-processing, reducing, combining, and post-processing. Pre-processing and post-processing compose in a fairly regular manner, but the reducer/combiner parts go in pairs and don't compose nicely.

Let's look at the combinef/reducef for both of count and distinct.

fn seed reducef combinef
count 0 count (reduce +)
distinct #{} conj set/union

For your example, we would compute the distinct elements for each mapper, and add them to a set. In the combiner, we would then merge those intermediate sets. We can't count the items in the set in the mappers because then we lose what items are in the set and can't properly combine the outputs of the mappers.

I still wanted to have fold operations sort of compose, so that you could do stuff like this:

(->> (fold/map f) (fold/filter g) (fold/distinct) (fold/first))

Which ends up doing something like this:

map filter distinct first result
pre f g (comp g f)
reduce conj conj
combine union union
post first first

But it's not clear which parts apply to which phases. I thought that I would throw an exception if you tried to do what you did, but clearly I do not. What you tried looks something like this because it's using the distributed count:

distinct count result
pre
reduce conj count count
combine union + +
post

Where I'm just taking the last reducer/combiner combo that you supply. What you really want is something like this:

distinct count result
pre
reduce conj conj
combine union union
post count count

Which does the count as a post-processing operation. So now there are two versions of fold/count and that's confusing too...

So, what can you do? The solution today is to define a custom fold-fn that specifies which part is which:

(->> (pig/return records)                                                                                                                                                                                                                                                                                                            
 (pig/fold (fold/fold-fn clojure.set/union conj count))
 (pig/dump))

... or if you want to reuse it ...

(defn distinct-count
  []
  (fold/fold-fn clojure.set/union conj count))

(->> (pig/return records)                                                                                                                                                                                                                                                                                                            
 (pig/fold (distinct-count))
 (pig/dump))

It's not perfect and it could be better. If you have any ideas for making these compose better, I'm all ears. Take a look at the comp-* fns in pigpen.fold if you're curious. If not, pigpen.fold/fold-fn is probably your friend :)

from pigpen.

theJohnnyBrown avatar theJohnnyBrown commented on July 26, 2024

Thanks, this is very helpful. I was wondering whether fold-fn would be necessary. Looks like it's time for a little brain-stretching :)

from pigpen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.