netflix / pigpen Goto Github PK

Map-Reduce for Clojure

License: Apache License 2.0

Clojure 93.44% Java 6.56%

pigpen's Introduction

PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig or Cascading but you don't need to know much about either of them to use it.

Getting Started, Tutorials & Documentation

Getting started with Clojure and PigPen is really easy.

The wiki explains what PigPen does and why we made it
The tutorial is the best way to get Clojure and PigPen installed and start writing queries
The full API lists all of the operators with example usage
PigPen for Clojure users is great for Clojure users new to map-reduce
PigPen for Pig users is great for Pig users new to Clojure
PigPen for Cascading users is great for Cascading users new to Clojure

Note: It is strongly recommended to familiarize yourself with Clojure before using PigPen.

Note: PigPen is not a Clojure wrapper for writing Pig scripts you can hand edit. While entirely possible, the resulting scripts are not intended for human consumption.

Questions & Complaints

[email protected]
Group discussion archives can be accessed here

Artifacts

pigpen is available from Maven:

With Leiningen:

;; core library
[com.netflix.pigpen/pigpen "0.3.3"]

;; pig support
[com.netflix.pigpen/pigpen-pig "0.3.3"]

;; cascading support
[com.netflix.pigpen/pigpen-cascading "0.3.3"]

;; rx support
[com.netflix.pigpen/pigpen-rx "0.3.3"]

The platform libraries all reference the core library, so you only need to reference the platform specific one that you require and the core library should be included transitively.

Note: PigPen requires Clojure 1.5.1 or greater

Parquet

To use the parquet loader, add this to your dependencies:

[com.netflix.pigpen/pigpen-parquet-pig "0.3.3"]

Here an example of how to write parquet data.

(require '[pigpen.core :as pig])
(require '[pigpen.parquet :as pqt])

;;
;; assuming that `data` is in tuples
;;
;; [["John" "Smith" 28]
;;  ["Jane" "Doe"   21]]

(defn save-to-parquet
  [output-file data]
  (->> data
       ;; turning tuples into a map
       (pig/map (partial zipmap [:firstname :lastname :age]))
       ;; then storing to Parquet files
       (pqt/store-parquet
        output-file
        (pqt/message "test-schema"
                     ;; the field names here MUST match the map's keys
                     (pqt/binary "firstname")
                     (pqt/binary "lastname")
                     (pqt/int64  "age")))))

And how to load the records back:

(defn load-from-parquet
  [input-file]
  ;; the output will be a sequence of maps
  (pqt/load-parquet
   input-file
   (pqt/message "test-schema"
                (pqt/binary "firstname")
                (pqt/binary "lastname")
                (pqt/int64  "age"))))

And check out the pigpen.parquet namespace for usage.

Note: Parquet is currently only supported by Pig

Avro

To use the avro loader (alpha), add this to your dependencies:

[com.netflix.pigpen/pigpen-avro-pig "0.3.3"]

And check out the pigpen.avro namespace for usage.

Note: Avro is currently only supported by Pig

Release Notes

0.3.3 - 5/19/16
- Explicitly disable *print-length* and *print-level* when generating scripts
- Add a better error message for storage types that expect a map with keywords
0.3.2 - 1/12/16
- Allow more types in generated pig scripts
0.3.1 - 10/19/15
- Update cascading version to 2.7.0
- Report correct pigpen version to concurrent
- Update nippy to 2.10.0 & tune performance
0.3.0 - 5/18/15
- No changes
0.3.0-rc-7 - 4/29/15
- Fixed bug in local mode where nils weren't handled consistently
0.3.0-rc.6 - 4/14/15
- Add local mode code eval memoization to avoid thrashing permgen
- Added pigpen.pig/set-options command to explicitly set pig options in a script. This was previously available (though undocumented) by setting {:pig-options {...}} in any options block, but is now official.
0.3.0-rc.5 - 4/9/15
- Update core.async version
0.3.0-rc.4 - 4/8/15
- Memoize code evaluation when run in the cluster
0.3.0-rc.3 - 4/2/15
- Bugfixes
0.3.0-rc.2 - 3/30/15
- Parquet refactor. Local parquet loading no longer depends on Pig. Parquet schemas are now defined using Parquet classes.
0.3.0-rc.1 - 3/23/15
- Added Cascading support
  - pigpen.cascading/generate-flow - Generate a cascading flow from a pigpen query
  - pigpen.cascading/load-tap - Load data from an existing cascading tap
  - pigpen.cascading/store-tap - Store data using an existing cascading tap
- Added pigpen.core/keys-fn, a new convenience macro to support named anonymous functions. Like keys destructuring, but less verbose.
- New function based operators to build more dynamic scripts. These are function versions of all the core pigpen macros, but you have to handle quoting user code manually. These were previously available, but not officially supported. Now they're alpha, but supported and documented. See pigpen.core.fn
- New lower-level operators to build custom storage and commands. These were previously available, but not officially supported. Now they're alpha, but supported and documented. See pigpen.core.op
- *** Breaking Changes ***
  - pigpen.core/script is now pigpen.core/store-many
  - pigpen.core/generate-script is now pigpen.pig/generate-script
  - pigpen.core/write-script is now pigpen.pig/write-script
  - pigpen.core/show is now pigpen.viz/show (requires dependency [com.netflix.pigpen/pigpen-viz "..."])
  - pig/dump has changed. The old version was based on rx-java, and still exists as pigpen.rx/dump. The replacement for pigpen.core/dump is now entirely Clojure based. The Clojure version is better for unit tests and small data. All stages are evaluated eagerly, so the stack traces are simpler to read. The rx version is lazy, including the load-* commands. This means that you can load a large file, take a few rows, and process them without loading the entire file into memory. The downside is confusing stack traces and extra dependencies. See here for more details.
  - The interface for building custom loaders and storage has changed. See here for more details. Please email [email protected] with any questions.
0.2.15 - 2/20/15
- Include sources in jars
0.2.14 - 2/18/15
- Avro updates
0.2.13 - 1/19/15
- Added load-avro in the pigpen-avro project: http://avro.apache.org/
- Fixed the nRepl configuration; use gradlew nRepl to start an nRepl
- Exclude nested relations from closures
0.2.12 - 12/16/14
- Added load-csv, which allows for quoting per RFC 4180
0.2.11 - 10/24/14
- Fixed a bug (feature?) introduced by new rx version. Also upgraded to rc7. This would have only affected local mode where the data being read was faster than the code consuming it.
0.2.10 - 9/21/14
- Removed load-pig and store-pig. The pig data format is very bad and should not be used. If you used these and want them back, email [email protected] and we'll put it into a separate jar. The jars required for this feature were causing conflicts elsewhere.
- Upgraded the following dependencies:
  - org.clojure/clojure 1.5.1 -> 1.6.0 - this was also changed to a provided dependency, so you should be able to use any version greater than 1.5.1
  - org.clojure/data.json 0.2.2 -> 0.2.5
  - com.taoensso/nippy 2.6.0-RC1 -> 2.6.3
  - clj-time 0.5.0 - no longer needed
  - joda-time 2.2 -> 2.4 - pig needs this to run locally
  - instaparse 1.2.14 - no longer needed
  - io.reactivex/rxjava 0.9.2 -> 1.0.0-rc.1
- Fixed the rx limit bug. pigpen.local/*max-load-records* is no longer required.
0.2.9 - 9/16/14
- Fix a local-mode bug in pigpen.fold/avg where some collections would produce a NPE.
- Change fake pig delimiter to \n instead of \0. Allows for \0 to exist in input data.
- Remove 1000 record limit for local-mode. This was originally introduced to mitigate an rx bug. Until #61 is fixed, bind pigpen.local/*max-load-records* to the maximum number of records you want to read locally when reading large files. This now defaults to nil (no limit).
- Fix a local dispatch bug that would prevent loading folders locally
0.2.8 - 7/31/14
- Fix a bug in load-tsv and load-lazy
0.2.7 - 7/31/14 *** Don't use ***
- Fix load-lazy and speed up both load-tsv and load-lazy
- Convert to multi-project build
- Added pigpen-parquet with initial support for loading the Parquet format: https://github.com/apache/incubator-parquet-mr
0.2.6 - 6/17/14
- Minor optimization for local mode. The creation of a UDF was occurring for every value processed, causing it to run out of perm-gen space when processing large collections locally.
- Fix (pig/return [])
- Fix (pig/dump (pig/reduce + (pig/return [])))
- Fix Longs in scripts that are larger than an Integer
- Memoize local UDF instances per use of pig/dump
- The jar location in the generated script is now configurable. Use the :pigpen-jar-location option with pig/generate-script or pig/write-script.
0.2.5 - 4/9/14
- Remove dump&show and dump&show+ in favor of pigpen.oven/bake. Call bake once and pass to as many outputs as you want. This is a breaking change, but I didn't increment the version because dump&show was just a tool to be used in the REPL. No scripts should break because of this change.
- Remove dymp-async. It appeared to be broken and was a bad idea from the start.
- Fix self-joins. This was a rare issue as a self join (with the same key) just duplicates data in a very expensive way.
- Clean up functional tests
- Fix pigpen.oven/clean. When it was pruning the graph, it was also removing REGISTER commands.
0.2.4 - 4/2/14
- Fix arity checking bug (affected varargs fns)
- Fix cases where an Algebraic fold function was falling back to the Accumulator interface, which was not supported. This affected using cogroup with fold over multiple relations.
- Fix debug mode (broken in 0.1.5)
- Change UDF initialization to not rely on memoization (caused stale data in REPL)
- Enable AOT. Improves cluster perf
- Add :partition-by option to distinct
0.2.3 - 3/27/14
- Added load-json, store-json, load-string, store-string
- Added filter-by, and remove-by
0.2.2 - 3/25/14
- Fixed bug in pigpen.fold/vec. This would also cause fold/map and fold/filter to not work when run in the cluster.
0.2.1 - 3/24/14
- Fixed bug when using for to generate scripts
- Fixed local mode bug with map followed by reduce or fold
0.2.0 - 3/3/14
- Added pigpen.fold - Note: this includes a breaking change in the join and cogroup syntax as follows:
```
; before
(pig/join (foo on :f)
          (bar on :b optional)
          (fn [f b] ...))

; after
(pig/join [(foo :on :f)
           (bar :on :b :type :optional)]
          (fn [f b] ...))
```
Each of the select clauses must now be wrapped in a vector - there is no longer a varargs overload to either of these forms. Within each of the select clauses, :on is now a keyword instead of a symbol, but a symbol will still work if used. If optional or required were used, they must be updated to :type :optional and :type :required, respectively.
0.1.5 - 2/17/14
- Performance improvements
  - Implemented Pig's Accumulator interface
  - Tuned nippy
  - Reduced number of times data is serialized
0.1.4 - 1/31/14
- Fix sort bug in local mode
0.1.3 - 1/30/14
- Change Pig & Hadoop to be transitive dependencies
- Add support for consuming user code via closure
0.1.2 - 1/3/14
- Upgrade instaparse to 1.2.14
0.1.1 - 1/3/14
- Initial Release

pigpen's People

Contributors

Stargazers

Watchers

Forkers

colinger mozacoval rdsr l1x akaowester mping magomimmo daveray jeremyrsellars nswaney pkozikow rukor emarcelino dm04806 sarnowski thejohnnybrown cswaroop senseb rspieldenner loic-talon ljank smramdani sarvex hardikus mitesh91 orneryhippo aglenis seshendranath nikolayvoronchikhin cloudxtreme clashthebunny ptaoussanis oztc kinzer1 robinsonraju kakamessi99 hardiku viking93 coconutpalm zolok ericzuobin punit-naik bmabey tony824 technion fakenetflix katoomba-demo andy-h-jay waichun30 adam-netflix optionalg test-mass-forker-org-1 zhaopufeng

pigpen's Issues

pig/fold inconsistent behaviour

I've noticed that pig/map in conjunction with a pig/fold behaves unexpectedly. The behaviour is also different from what we've seen running it on hadoop (latest AMI on EMR).

Here's a snippet explaining the problem; broken-fold-test returns [5] instead of [2]. This is because all the inputs to the count-true reduce function is wrapped in a vector, so all x evaluates to true. If this is expected behaviour what is wrong in this example?

(defn count-true []
  (let [combinef (fn
                   ([] 0)
                   ([a b] (+ a b)))
        reducef (fn [acc x] (if x (inc acc) acc))]
  (fold/fold-fn identity combinef reducef identity)))

(deftest working-fold-test
  (is (= (pig/dump (->> (pig/return [false false false true true])
                        (pig/fold (count-true))
                        ))
         [2])))

(deftest working-map-test
  (is (= (pig/dump (->> (pig/return [-2 -1 0 1 2])
                        (pig/map #(> % 0))
                        ))
         [false false false true true])))

(deftest broken-fold-test
  (is (= (pig/dump (->> (pig/return [-2 -1 0 1 2])
                        (pig/map #(> % 0))
                        (pig/fold (count-true))
                        ))
         [2]))) ; fails

Coerce ints to longs during serialization

Clojure's value equality states that integers and longs of the same value are equal, but because pigpen serializes them differently, they are treated as not equal in joins.

=> (= (int 42) (long 42))
true

We should convert all ints to longs before serializing, with an option to disable this behavior. Also, look into how floats/doubles are handled.

fold/vec should call clojure.core/vec

Hi,

I'm getting an arity exception on fold/vec when using some fold functions that use it (e.g. fold/first). It looks like vec should be calling (clojure.core/vec (concat l r)) rather than itself.

Caused by: clojure.lang.ArityException: Wrong number of args (1) passed to: fold$vec
        at clojure.lang.AFn.throwArity(AFn.java:437)
        at clojure.lang.AFn.invoke(AFn.java:39)
        at pigpen.fold$vec$fn__1521.invoke(fold.clj:170)
        at clojure.core$comp$fn__4154.invoke(core.clj:2332)
        at pigpen.fold$juxt$fn__1560$fn__1566.invoke(fold.clj:342)
        at clojure.core$map$fn__4214.invoke(core.clj:2498)

(defn vec
  []
  (fold-fn ^:seq (fn
                   ([] [])
                   ([l r] (vec (concat l r))))
           ^:seq (fn [acc val] (conj acc val))))

Create new local-mode that doesn't use pig UDFs

The current local mode is more of a test for the pig UDF than a true local implementation. Create a new version that doesn't rely on pig at all.

Split pigpen.core

pigpen.core has both generic and pig specific functions. Split this such that each are in the appropriate jar.

Does PigPen support loading file in customized format?

Hello team,

I am trying to use PigPen in my work. But I have some troubles when I load input data. Sometimes my input data is not a text file. It may be sequence files or files in other customized format. Is there any solution for this?

Cascading: Optimize co-group with all folds

A co-group with all folds is currently implemented as n folds followed by a co-grouping. This could be optimized as a single co-group stage.

Add support for a load-fn

Would like to have something like this:

(pig/load-fn (fn ))

Not sure how practical it is, but seems like a nice-to-have...

Make the pigpen.jar location configurable

Currently, every script requires pigpen.jar. There should be an option to configure the location of this jar so that you can put it into S3 more easily.

Cascading: Update parquet and avro storage to work with cascading

allow custom properties to be passed to FlowConnector when creating a flow

It is a common pattern for Cascading users to supply a Properties object to the FlowConnector to control many aspects of a flow. It would be great if PigPen allowed for that as well:

http://docs.cascading.org/cascading/2.7/javadoc/cascading-core/cascading/flow/FlowConnector.html

Mapping over many input files

Given partial input file names that are part of pairs with something like "x_Start.csv", "x_End.csv":

(def inputs ["abc" "def" "ghi" "jkl"])

Each of the files in the pairs have an ID column, and some ids only exist in one file. What I want to do is join on the ids of each pair, and concat the result of all of the joins to input to further processing.

My approach thus far has been something similar to this:

;; load xyz_Start.csv and xyz_End.csv and return a [start end] pair
;; of loaded data from each file
(defn load-by-xyz [xyz] ...) 
;; manipulate the [start end] pair and perform the join
(defn do-stuff [...]) 
(pig/mapcat #(-> (load-by-xyz %) do-stuff) inputs)

I've tried this multiple ways, but I always get:

java.lang.AssertionError: Assert failed: (map? relation)

If I run the load -> do-stuff portion by itself on a single pair it works, so my guess is that it has something to do with inputs not being data that pig can recognize.

Any suggestions as to how I can get something like this to work?

Thanks

Investigate a read-through storage operator

It would be nice to have a storage operator that would return the data as well. Something that would allow debugging of a script without breaking it into smaller pieces.

(->> (p/load-clj "inp.clj")
        (p/map inc)
        (p/store-and-return-clj "inc.clj")
        (p/map #(* % 2))
        (p/store-and-return-clj "double.clj")
        (p/map #(* % 3)))

Add support for distributed cache on the Cascading platform

Cascading supports serializing arbitrary (large) objects and making them available to UDFs via Hadoop's distributed cache. We should expose this feature in PigPen.

Test PigPen on Tez

This is via Pig on Tez, not PigPen on Tez directly. Should work fine, just need to make sure nothing is broken.

Add a pigpen.pig/dump command

The command would run pig in a local mode, which would be slow, but able to process large files locally.

clojure.lang.ExceptionInfo: :auto not supported on headerless data. {}

PigPen Version: 0.3.1-20150825.165235-4
java version: 1.7
hadoop version 2.4.0
java version: 1.7
pig versions: 0.12.0
AMI version:3.4.0
Hadoop distribution:Amazon 2.4.0

At the end of our map step we are seeing the following exception:

Caused by: clojure.lang.ExceptionInfo: :auto not supported on headerless data. {}
at clojure.core$ex_info.invoke(core.clj:4403)
at taoensso.nippy$get_auto_compressor.invoke(nippy.clj:558)
at taoensso.nippy$thaw$thaw_data__10665.invoke(nippy.clj:599)
at taoensso.nippy$thaw.doInvoke(nippy.clj:644)
at clojure.lang.RestFn.invoke(RestFn.java:410)
at pigpen.pig.runtime$fn__12071.invoke(runtime.clj:108)
at pigpen.runtime$fn__2746$G__2741__2751.invoke(runtime.clj:186)
at clojure.core$mapv$fn__6311.invoke(core.clj:6353)
at clojure.core.protocols$fn__6074.invoke(protocols.clj:79)
at clojure.core.protocols$fn__6031$G__6026__6044.invoke(protocols.clj:13)
at clojure.core$reduce.invoke(core.clj:6289)
at clojure.core$mapv.invoke(core.clj:6353)
at pigpen.pig.runtime$fn__12069.invoke(runtime.clj:111)
at pigpen.runtime$fn__2746$G__2741__2751.invoke(runtime.clj:186)
at clojure.core$map$fn__4245.invoke(core.clj:2559)
at clojure.lang.LazySeq.sval(LazySeq.java:40)
at clojure.lang.LazySeq.seq(LazySeq.java:49)
at clojure.lang.RT.seq(RT.java:484)
at clojure.core$seq.invoke(core.clj:133)
at clojure.core$apply.invoke(core.clj:624)
at clojure.core$mapcat.doInvoke(core.clj:2586)
at clojure.lang.RestFn.invoke(RestFn.java:423)
at pigpen.pig.runtime$fn__12067.invoke(runtime.clj:115)
at pigpen.runtime$fn__2746$G__2741__2751.invoke(runtime.clj:186)
at pigpen.pig.runtime$exec_initial.invoke(runtime.clj:295)
at pigpen.pig.runtime$udf_algebraic.invoke(runtime.clj:337)
at clojure.lang.Var.invoke(Var.java:388)
at pigpen.PigPenFnAlgebraic$Initial.exec(PigPenFnAlgebraic.java:104)
at pigpen.PigPenFnAlgebraic$Initial.exec(PigPenFnAlgebraic.java:88)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)

The map phase get to %100 and all the inputs seem to be processed.

Could it be some compression library thats not installed?

Any ideas where to look?

Thanks
-Bob

Rank behaves differently in the cluster

From the pig docs:

If two or more tuples tie on the sorting field values, they will receive the same rank.

The local version currently does not do this. It should match the behavior of the cluster, or have an option that only the local mode can do.

Add partitioned storage

Add a storage option that writes multiple files based on values found in the data

Should locally executed load functions support compression?

I just lost some time hunting down why my locally executed load of "input.tsv.gz"

(load-tsv "input.tsv.gz")

was returning garbage values. Now of course I realize my error and that when run locally load-tsv hasn't been instrumented to support the compression unlike when run in the pig environment.

This feels like a pretty elementary mistake and one that I wont make again, but it also feels like something that pigpen could have either supported or warned me about ("hey you have a .gz file extension on here and this isn't actually pig!")

Does PigPen support subqueries (FOREACH within query)?

I want to run a query exactly like the one described here:

http://mail-archives.apache.org/mod_mbox/pig-user/201109.mbox/%3CEF686A3D2941844BB7D33062F3E57C894C11A5C8E4@DEN-MEXMS-001.corp.ebay.com%3E

In my working example, I'm using clojure.core's sort-by and take as a workaround, but I'd love to be able to build this into my query.

Thanks!

Verify that pig/script isn't consumed in other commands

It's possible today to do this:

(->> data
  (pig/map ...)
  (pig/store ...)
  (pig/script ...)
  (pig/map ...))

And have it generate a script that won't run. Add some validation to catch this.

Add support for elastic search storage

Write directly to an elastic search instance. This has already been done, just need to merge. See pigpen-support google group for details.

Don't require join function to be anonymous

With PigPen 0.2.3, I was using join, but instead of specifying an anonymous function inline with the join call, I defnd a function and just used the name of the function. In other words, instead of:

(join [(xs :on first)
       (ys :on first)]
      (fn [x y] ...))

...I was doing:

(defn foo [x y] ...)
(join [(xs :on first)
       (ys :on first)]
      foo)

The second version produced output with the same structure as the first version, except that there were nils in most places. My guess is that the macros aren't quite evaluating things properly, but I don't know this for sure.

I can't share more details, unfortunately, as they are proprietary. Although, if you're having difficulty reproducing this issue, I can try to reproduce it in a way that I can share.

An issue which may be related is that a print statement in the function in version 1 works, but in version 2 it does not print anything.

Thanks very much!
-Jeff T.

NullPointer exception when folding over empty set locally

You can trigger the bug with:

(pig/dump (pig/fold (fold/count) (pig/filter string? (pig/return [:foo, 42]))))

You'd expect it to dump [0] instead of throwing a NullPointer exception, maybe related to #34?

Use cascading-hadoop2-mr1 by default

Cascading supports multiple backends and we recommend the Hadoop 2 (YARN) platform for everyone these days. It would be great if pigpen-cascading could move to that. Let me know, if you want a PR for that or if something forces you to stay on Hadoop 1.x.

Tutorial error: Pig version 0.12.0-cdh5.4.2，0.14 is right.

Follow https://github.com/Netflix/PigPen/wiki/Tutorial
my-script.pig

(defproject pigpen-demo "0.1.0-SNAPSHOT"
:description "FIXME: write description"
:url "http://example.com/FIXME"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/clojure "1.6.0"]
[com.netflix.pigpen/pigpen-pig "0.3.0"]]
:profiles {:dev {:dependencies [[org.apache.pig/pig "0.12.0"]
[org.apache.hadoop/hadoop-core "1.1.2"]]}})

$ pig -i
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Apache Pig version 0.12.0-cdh5.4.2 (rexported)
compiled May 19 2015, 17:03:41

Pig Stack Trace

ERROR 2998: Unhandled internal error. null

java.lang.ExceptionInInitializerError
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:635)
at org.apache.pig.impl.PigContext.getClassForAlias(PigContext.java:769)
at org.apache.pig.parser.LogicalPlanBuilder.buildUDF(LogicalPlanBuilder.java:1491)
at org.apache.pig.parser.LogicalPlanGenerator.func_eval(LogicalPlanGenerator.java:9372)
at org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGenerator.java:11051)
at org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10810)
at org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10159)
at org.apache.pig.parser.LogicalPlanGenerator.flatten_generated_item(LogicalPlanGenerator.java:7488)
at org.apache.pig.parser.LogicalPlanGenerator.generate_clause(LogicalPlanGenerator.java:17590)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator.java:15982)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15849)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1688)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1421)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:354)
at org.apache.pig.PigServer.executeBatch(PigServer.java:379)
at org.apache.pig.PigServer.executeBatch(PigServer.java:365)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:478)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalArgumentException: No matching method: compress, compiling:(taoensso/nippy/compression.clj:21:22)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:6651)
at clojure.lang.Compiler.analyze(Compiler.java:6445)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:6632)
at clojure.lang.Compiler.analyze(Compiler.java:6445)
at clojure.lang.Compiler.analyze(Compiler.java:6406)
at clojure.lang.Compiler$BodyExpr$Parser.parse(Compiler.java:5782)
at clojure.lang.Compiler$NewInstanceMethod.parse(Compiler.java:8008)
at clojure.lang.Compiler$NewInstanceExpr.build(Compiler.java:7544)
at clojure.lang.Compiler$NewInstanceExpr$DeftypeParser.parse(Compiler.java:7425)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:6644)
at clojure.lang.Compiler.analyze(Compiler.java:6445)
at clojure.lang.Compiler.analyze(Compiler.java:6406)
at clojure.lang.Compiler$BodyExpr$Parser.parse(Compiler.java:5782)
at clojure.lang.Compiler$LetExpr$Parser.parse(Compiler.java:6100)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:6644)
at clojure.lang.Compiler.analyze(Compiler.java:6445)
at clojure.lang.Compiler.analyze(Compiler.java:6406)
at clojure.lang.Compiler$BodyExpr$Parser.parse(Compiler.java:5782)
at clojure.lang.Compiler$FnMethod.parse(Compiler.java:5217)
at clojure.lang.Compiler$FnExpr.parse(Compiler.java:3846)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:6642)
at clojure.lang.Compiler.analyze(Compiler.java:6445)
at clojure.lang.Compiler.eval(Compiler.java:6700)
at clojure.lang.Compiler.load(Compiler.java:7130)
at clojure.lang.RT.loadResourceScript(RT.java:370)
at clojure.lang.RT.loadResourceScript(RT.java:361)
at clojure.lang.RT.load(RT.java:440)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5066.invoke(core.clj:5641)
at clojure.core$load.doInvoke(core.clj:5640)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5446)
at clojure.core$load_lib$fn__5015.invoke(core.clj:5486)
at clojure.core$load_lib.doInvoke(core.clj:5485)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invoke(core.clj:626)
at clojure.core$load_libs.doInvoke(core.clj:5528)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invoke(core.clj:626)
at clojure.core$require.doInvoke(core.clj:5607)
at clojure.lang.RestFn.invoke(RestFn.java:436)
at taoensso.nippy$eval7064$loading__4958__auto____7065.invoke(nippy.clj:1)
at taoensso.nippy$eval7064.invoke(nippy.clj:1)
at clojure.lang.Compiler.eval(Compiler.java:6703)
at clojure.lang.Compiler.eval(Compiler.java:6692)
at clojure.lang.Compiler.load(Compiler.java:7130)
at clojure.lang.RT.loadResourceScript(RT.java:370)
at clojure.lang.RT.loadResourceScript(RT.java:361)
at clojure.lang.RT.load(RT.java:440)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5066.invoke(core.clj:5641)
at clojure.core$load.doInvoke(core.clj:5640)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5446)
at clojure.core$load_lib$fn__5015.invoke(core.clj:5486)
at clojure.core$load_lib.doInvoke(core.clj:5485)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invoke(core.clj:626)
at clojure.core$load_libs.doInvoke(core.clj:5524)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invoke(core.clj:626)
at clojure.core$require.doInvoke(core.clj:5607)
at clojure.lang.RestFn.invoke(RestFn.java:551)
at pigpen.pig.runtime$eval3$loading__4958__auto____4.invoke(runtime.clj:19)
at pigpen.pig.runtime$eval3.invoke(runtime.clj:19)
at clojure.lang.Compiler.eval(Compiler.java:6703)
at clojure.lang.Compiler.eval(Compiler.java:6692)
at clojure.lang.Compiler.load(Compiler.java:7130)
at clojure.lang.RT.loadResourceScript(RT.java:370)
at clojure.lang.RT.loadResourceScript(RT.java:361)
at clojure.lang.RT.load(RT.java:440)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5066.invoke(core.clj:5641)
at clojure.core$load.doInvoke(core.clj:5640)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5446)
at clojure.core$load_lib$fn__5015.invoke(core.clj:5486)
at clojure.core$load_lib.doInvoke(core.clj:5485)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invoke(core.clj:626)
at clojure.core$load_libs.doInvoke(core.clj:5524)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invoke(core.clj:626)
at clojure.core$require.doInvoke(core.clj:5607)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.lang.Var.invoke(Var.java:379)
at pigpen.PigPenFn.(PigPenFn.java:45)
... 35 more
Caused by: java.lang.IllegalArgumentException: No matching method: compress
at clojure.lang.Compiler$StaticMethodExpr.(Compiler.java:1590)
at clojure.lang.Compiler$HostExpr$Parser.parse(Compiler.java:959)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:6644)

... 121 more

Release

Would you mind cutting a release build with current master?
939c4c1

Optimize fold/top

The current implementation is to sort and take after every element is added. This is very wasteful.

Instead, we should leverage sorted-set, sorted-map, or just allow the collection to grow a little before shrinking.

cc @andrescorrada

Hadoop Versions lists hadoop-client twice in dependencies.

https://github.com/Netflix/PigPen/wiki/Hadoop-Versions:

git show 7c57e10                                                                                                                                    ⬆
Fri Sep 18 19:15:15 2015 -0500 7c57e10 (HEAD -> master) remove second instance of hadoop-client  [Randall Mason]
diff --git a/Hadoop Versions.md b/Hadoop Versions.md
index ade229c..89404c0 100644
--- a/Hadoop Versions.md
+++ b/Hadoop Versions.md
@@ -13,8 +13,7 @@ PigPen can also be used with Cloudera distributions of Hadoop. All you need to d
                  [com.netflix.pigpen/pigpen "0.3.0"]]
   :profiles {:dev {:dependencies
                    [[org.apache.hadoop/hadoop-client "2.0.0-mr1-cdh4.4.0"]
-                    [org.apache.pig/pig "0.11.0-cdh4.4.0"]
-                    [org.apache.hadoop/hadoop-client "2.0.0-mr1-cdh4.4.0"]]}})
+                    [org.apache.pig/pig "0.11.0-cdh4.4.0"]]}})
 ` ``

 This will prevent jar problems when the uberjar is submitted to the cluster.

CUBE/ROLLUP in PigPen

This project just blew my mind! I'm super excited to find it, but.. Are there any plans to introduce CUBE/ROLLUP functions? Are these on a roadmap?

I'd be happy to contribute, just that I'm not sure yet whether it won't contradict any future plans (i.e. migration to other backend; or this will be separate project?)

Thank you!

Weird error when used with prismatic plumbing

When I try to use prismatic plumbing, I run into a weird error java.lang.ClassNotFoundException: pig_graph_test.graph-record3363. graph-recordxxxx is a macro generated record exposed by the library, but my attempts to get pigpen to play nice with this have all failed.

I suspect something about the way that code/trap-* operates? I've included a little test repository that you can just hit with lein run to reproduce, and includes the full stack trace.

https://github.com/aconbere/pigpen-graph-test

Investigate more generic dispatch functions for pigpen.io

Currently everything in pigpen.io is tied to pig. See if we can re-use some of this on other platforms.

raw/load$ should omit the 'AS' clause if fields are empty

Use case is using loaders that support schema such as parquet. In pig, any of this works:

RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader();
--or
RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader('contentHost:chararray');
--or
RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader('contentHost:chararray') AS (contentHost:chararray);
--
DESCRIBE RAW_DATA; -- will work properly with any since we have metadata

but an empty array

(raw/load$
          location
          '[] ;; these are the fields this loader returns
          storage
          opts)
        (raw/bind$
...

generates this:

load20 = LOAD '/path/to/data/'
    USING MyComplexStorage('name', 'address', 'phone')
    AS ();

Since the loader handles the schema, there's no need for the AS clause.

EMR Docs

Add docs for how to run e2e on EMR w/ s3

Strange behavior of count distinct

Hi Folks, we're getting some unexpected results here

(def records [:a :a :b :b :c])

(->> (pig/return records)                                                                                                                                                                                                                                                                                                            
 (pig/fold (fold/count (fold/distinct)))
 (pig/dump))

This code returns [5]. Removing the count gives [#{:a :b :c}], no surprises. I'd expect the count to return the cardinality of the set returned by distinct, but unfortunately something else is happening.

Any ideas?

Remove Pig's DataByteArray from core AST

The core expression tree produced by pigpen shouldn't have any Pig specific stuff in it. Remove the use of DataByteArray and any other Pig libraries.

cc @pkozikow

Upgrade to newer rx version

The version of rx used today has a broken unsubscribe mechanism, which causes load-* operators to continue to load on background threads infinitely. Update to newer rx version & fix that issue.

Avro support

Add direct support for the Avro format: http://avro.apache.org/

pig fails on case-insensitive file system due to old Instaparse version

I get an error when running pig with a pigpen uberjar generated on a case-insensitive file system (OS X in this case).

java.lang.NoClassDefFoundError: instaparse/print$parser__GT_str (wrong name: instaparse/print$Parser__GT_str)

It looks like it is because versions of Instaparse before 1.2.0 define fns called parser->str and Parser->str
https://github.com/Engelberg/instaparse/blob/v1.0.1/src/instaparse/print.clj

As reported here:
https://groups.google.com/forum/#!topic/clojure/4Dv2MfDybz8

Libraries/Functions in closures

As stated in the docs:

PigPen supports a number of different types of closures. (...) Compiled functions and mutable structures like atoms won't work.

We have time in millis in our data and would like to format it as YYYY-MM-DD, but that's impossible due to aforementioned reasons :( Are there any workaround to make functions work? Otherwise this statement looks far fetched:

There are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program.

Thank you!

Clean all pig references out of `pigpen.oven`

Custom loaders can't use functions defined elsewhere

The example on the wiki shows this:

(defn square [x]
  (* x x))

(defn baz [data]
  (->> data
    (pig/map square)))

I'm trying to use a function like square above from inside a custom loader (raw/load$), but that results in an "Unable to resolve symbol" exception. Defining the function in a separate namespace and requiring it doesn't work either.

I'd guess this could be solved the same way as for operations like pig/map in the example.

Add more storage options

Investigate avro, parquet, custom loaders, partitioned storage, etc.

Cascading: Add docs & tutorial

Includes both wiki pages & docstrings

Memoize UDF code evaluation when run on the cluster

Pig calls the constructor for a UDF in a tight loop when compiling scripts. pigpen.pig/eval-string should be memoized, but only when run in the cluster - not in local REPL mode.

Profile PigPenFn UDF

It's unclear if the pigpen UDF is adding significant overhead to user code. Profile this path to ensure it's optimized.

See https://issues.apache.org/jira/browse/PIG-3956

Use type hints to inform serialization

Type hints, when added to the return type of a fn, can be used to serialize data more efficiently.

cc @pkozikow

problem with load-tsv function

Hi ,

I am trying to run following function on tsv file with more than 100k line, while running it on my laptop.
The function looks like this.

(defn- hashed-data  [file-name]
  (->>
    (pig/load-tsv file-name)
    (pig/map  (fn  [[ & args]]
                [args]))))

Instead of getting same amount of lines like in input file I am always getting only 1000 items.
do I miss something obvious ?

Upgrade to Clojure 1.6.0

I tried upgrading some libraries, but ran into problems with projects that declare a 1.5.1 dependency.

Tried this:

-    compile 'org.clojure:clojure:1.5.1'
+    provided 'org.clojure:clojure:1.6.0'
     compile 'org.clojure:data.json:0.2.2'
     compile 'org.clojure:core.async:0.1.267.0-0d7780-alpha'
     compile 'clj-time:clj-time:0.5.0'
-    compile 'instaparse:instaparse:1.2.14'
-    compile 'com.taoensso:nippy:2.6.0-RC1'
+    compile 'instaparse:instaparse:1.3.4'
+    compile 'com.taoensso:nippy:2.6.3'
     compile 'rhizome:rhizome:0.1.9'
     compile 'com.netflix.rxjava:rxjava-core:0.9.2'
     compile 'com.netflix.rxjava:rxjava-clojure:0.9.2'

     provided 'org.apache.pig:pig:0.11.1'
     provided 'org.apache.hadoop:hadoop-core:1.1.2'

Got this:

=> (require '[pigpen.core :as pig])
ClassCastException clojure.lang.Var$Unbound cannot be cast to clojure.lang.IFn$LLL  instaparse.auto-flatten-seq.AutoFlattenSeq (auto_flatten_seq.clj:123)
nil
ExceptionInInitializerError 
    java.lang.Class.forName0 (Class.java:-2)
    java.lang.Class.forName (Class.java:270)
    clojure.lang.RT.loadClassForName (RT.java:2098)
    clojure.lang.RT.load (RT.java:430)
    clojure.lang.RT.load (RT.java:411)
    clojure.core/load/fn--5018 (core.clj:5530)
    clojure.core/load (core.clj:5529)
    clojure.core/load-one (core.clj:5336)
    clojure.core/load-lib/fn--4967 (core.clj:5375)
    clojure.core/load-lib (core.clj:5374)
    clojure.core/apply (core.clj:619)
    clojure.core/load-libs (core.clj:5413)
Caused by:
ClassCastException clojure.lang.Var$Unbound cannot be cast to clojure.lang.IFn$LLL
    instaparse.auto-flatten-seq.AutoFlattenSeq (auto_flatten_seq.clj:123)
    instaparse.gll/CatListener/fn--3975 (gll.clj:394)
    instaparse.gll/push-message/f--3905 (gll.clj:173)
    instaparse.gll/step (gll.clj:328)
    instaparse.gll/run (gll.clj:344)
    instaparse.gll/run (gll.clj:332)
    instaparse.gll/parse (gll.clj:758)
    instaparse.cfg/ebnf (cfg.clj:273)
    instaparse.abnf__init.load (:189)
    instaparse.abnf__init.<clinit> (:-1)
    java.lang.Class.forName0 (Class.java:-2)
    java.lang.Class.forName (Class.java:270)