Comments (2)
That has to do with how folds are composed, which is admittedly a little wonky. A little background...
There are 4 phases to a fold operation: pre-processing, reducing, combining, and post-processing. Pre-processing and post-processing compose in a fairly regular manner, but the reducer/combiner parts go in pairs and don't compose nicely.
Let's look at the combinef/reducef for both of count and distinct.
fn | seed | reducef | combinef |
---|---|---|---|
count | 0 |
count |
(reduce +) |
distinct | #{} |
conj |
set/union |
For your example, we would compute the distinct elements for each mapper, and add them to a set. In the combiner, we would then merge those intermediate sets. We can't count the items in the set in the mappers because then we lose what items are in the set and can't properly combine the outputs of the mappers.
I still wanted to have fold operations sort of compose, so that you could do stuff like this:
(->> (fold/map f) (fold/filter g) (fold/distinct) (fold/first))
Which ends up doing something like this:
map | filter | distinct | first | result | |
---|---|---|---|---|---|
pre | f |
g |
(comp g f) |
||
reduce | conj |
conj |
|||
combine | union |
union |
|||
post | first |
first |
But it's not clear which parts apply to which phases. I thought that I would throw an exception if you tried to do what you did, but clearly I do not. What you tried looks something like this because it's using the distributed count:
distinct | count | result | |
---|---|---|---|
pre | |||
reduce | conj |
count |
count |
combine | union |
+ |
+ |
post |
Where I'm just taking the last reducer/combiner combo that you supply. What you really want is something like this:
distinct | count | result | |
---|---|---|---|
pre | |||
reduce | conj |
conj |
|
combine | union |
union |
|
post | count |
count |
Which does the count as a post-processing operation. So now there are two versions of fold/count
and that's confusing too...
So, what can you do? The solution today is to define a custom fold-fn that specifies which part is which:
(->> (pig/return records)
(pig/fold (fold/fold-fn clojure.set/union conj count))
(pig/dump))
... or if you want to reuse it ...
(defn distinct-count
[]
(fold/fold-fn clojure.set/union conj count))
(->> (pig/return records)
(pig/fold (distinct-count))
(pig/dump))
It's not perfect and it could be better. If you have any ideas for making these compose better, I'm all ears. Take a look at the comp-*
fns in pigpen.fold
if you're curious. If not, pigpen.fold/fold-fn
is probably your friend :)
from pigpen.
Thanks, this is very helpful. I was wondering whether fold-fn would be necessary. Looks like it's time for a little brain-stretching :)
from pigpen.
Related Issues (20)
- Add support for distributed cache on the Cascading platform
- Release HOT 2
- Cascading: Optimize co-group with all folds
- Cascading: Add docs & tutorial
- Cascading: Update parquet and avro storage to work with cascading
- CUBE/ROLLUP in PigPen HOT 2
- Libraries/Functions in closures HOT 7
- Should locally executed load functions support compression? HOT 2
- Weird error when used with prismatic plumbing HOT 13
- Use cascading-hadoop2-mr1 by default HOT 10
- allow custom properties to be passed to FlowConnector when creating a flow HOT 2
- Add a pigpen.pig/dump command
- Tutorial error: Pig version 0.12.0-cdh5.4.2,0.14 is right. HOT 5
- Hadoop Versions lists hadoop-client twice in dependencies. HOT 1
- clojure.lang.ExceptionInfo: :auto not supported on headerless data. {} HOT 17
- pigpen.core store functions don't quite work HOT 2
- Incorrect script generation with large number of fields (parquet) HOT 4
- Doc CSS is broken HOT 1
- Is this project being maintained? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pigpen.