Coder Social home page Coder Social logo

problem with load-tsv function about pigpen HOT 6 CLOSED

netflix avatar netflix commented on August 31, 2024
problem with load-tsv function

from pigpen.

Comments (6)

mapstrchakra avatar mapstrchakra commented on August 31, 2024

You need to change the binding of how many results you are returning. By
default it returns a 1000 items in the REPL

(def ^:dynamic max-load-records 1000).

You can wrap your function with a max-load-records binding as follows:

(defn- hashed-data [file-name](binding [pigpen.local/max-load-records 100000
%28->>
%28pig/load-tsv file-name%29
%28pig/map %28fn [[ & args]]
[args]%29%29%29)))

On Sunday, September 7, 2014, Michael Rubanov [email protected]
wrote:

Hi ,

I am trying to run following function on tsv file with more than 100k
line, while running it on my laptop.
The function looks like this.

(defn- hashed-data [file-name](->>
%28pig/load-tsv file-name%29
%28pig/map %28fn [[ & args]]
[args]%29%29))

Instead of getting same amount of lines like in input file I am always
getting only 1000 items.
do I miss something obvious ?


Reply to this email directly or view it on GitHub
#57.

from pigpen.

mbossenbroek avatar mbossenbroek commented on August 31, 2024

Yeah, I put that cap in there because the version of rx I'm using doesn't unsubscribe properly from the observable. It's kind of a hacky fix, but this prevents you from processing potentially large files just to throw the result away. In general, the REPL should only be used for vetting your code & then you'd run at scale on the cluster, but 100k should be well within the limits of what it can handle locally.

At Netflix, we sample large GB files over the network directly into pigpen - without this limit it was just continuing to download the file on a background thread & slowing down the REPL. This was painful when I just wanted the first 10 records.

The longer term fix is to upgrade the version of rx that I'm using, but they tend to break their API frequently so I've been waiting for v1.0 to be released.

-Matt

On Sunday, September 7, 2014 at 11:17 AM, mapstrchakra wrote:

You need to change the binding of how many results you are returning. By
default it returns a 1000 items in the REPL

(def ^:dynamic max-load-records 1000).

You can wrap your function with a max-load-records binding as follows:

(defn- hashed-data [file-name]

(binding [pigpen.local/max-load-records 10000
(->>
(pig/load-tsv file-name)
(pig/map (fn [[ & args]]
[args]))))))

On Sunday, September 7, 2014, Michael Rubanov <[email protected] (mailto:[email protected])>
wrote:

Hi ,

I am trying to run following function on tsv file with more than 100k
line, while running it on my laptop.
The function looks like this.

(defn- hashed-data [file-name](->>
%28pig/load-tsv file-name%29
%28pig/map %28fn [[ & args]]
[args]%29%29))

Instead of getting same amount of lines like in input file I am always
getting only 1000 items.
do I miss something obvious ?


Reply to this email directly or view it on GitHub
#57.


Reply to this email directly or view it on GitHub (#57 (comment)).

from pigpen.

micrub avatar micrub commented on August 31, 2024

Thanks for clarification , though following wrapping didn't solved this issue in repl , still getting 1000 items back:

(binding [pigpen.local/*max-load-records* 100000])

from pigpen.

mbossenbroek avatar mbossenbroek commented on August 31, 2024

Your example shows the trailing paren after the binding expression… To use binding, you need to enclose the code that requires the rebinding.

(binding [pigpen.local/max-load-records 100000](->> %28pig/load-tsv)
(pig/dump)))

In this case, you'd want to make sure that the code calling pig/dump is what gets wrapped, not the load command. The load command just builds an expression tree.

(def x (pig/load …))

(binding [pigpen.local/max-load-records 100000](pig/dump x))

Let me know if that works for you.

-Matt

On Sunday, September 7, 2014 at 2:35 PM, Michael Rubanov wrote:

Thanks for clarification , though following wrapping didn't solved this issue in repl , still getting 1000 items back:
(binding [pigpen.local/max-load-records 100000]) ``


Reply to this email directly or view it on GitHub (#57 (comment)).

from pigpen.

mbossenbroek avatar mbossenbroek commented on August 31, 2024

After thinking about this some more, I'm going to change the default to be unlimited and add this as an option to limit it only if you need it.

from pigpen.

mbossenbroek avatar mbossenbroek commented on August 31, 2024

Fixed entirely in #61 This binding is no longer necessary.

from pigpen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.