Coder Social home page Coder Social logo

ecdf_ruby_demo's Introduction

-- A SOLUTION FOR AN ECDF PROBLEM --

Copyright © 2011 Timothy James; All rights reserved.

Runs on Ruby 1.9.2. Code at github.com/enthal/ecdf_ruby_demo

Try this to generate random/Gaussian data and process it:

ruby ./driver.rb

Or this with input files:

ruby ./driver.rb file1.csv file2.csv

Read on for the full problem and solution, including parallelization!


PROBLEM

Problem Statement

NOTE: This problem statement is copied verbatim from: https://squareup.com/jobs/oD5LVfw8

Suppose you have multiple large files (on the order of 100GBs) containing tuples of the following form:

{user_id, payment_id, payment_amount, is_card_present, created_at}.

Write a program to compute the empirical cumulative distribution function of the card present ratio for users who processed less than $100, and for users who processed over $100.

The expected output should be of the form:

Users who processed less than $100
percentile    % cp
1             0
2             0
3             5
...           ...
100           100

and similarly for users who processed over $100.

Problem Discussion

The scope of this challenge has two essential facets: parallelization (data “on the order of 100GBs”) and the ECDF computation itself.

Since the two required total-payment-bucketed ECDFs serve to assign users to percentiles, the data must first be aggregated per-user, counting their payments and summing their individual payment_amount and is_card_present count. Then the aggregate users may be partitioned into buckets (at $100 total payment_amount threshold) and from there into percentile by ratio of is_card_present count to total payment count. Thus, there is ample opportunity to parallelize the (expected) bulk of the work: aggregation by user.

The payment_id and created_at tuple fields are notable; a discussion follows in the SOLUTION section.

SOLUTION

I offer a solution written in ruby (1.9.2). The code is written to be understood, and serves as the most thorough description of the algorithms involved.

Following from my discussion of the problem (see above), the solution separates aggregation by user from ECDF computation, allowing parallel processing for aggregation. See “Parallelizing operation”, below, for more details.

The aggregation phase runs in O(payment_count) time and O(user_count) space for the given process input (a subset of the full data set when parallelized). The finish phase runs in O(intermediate_input_line_count) time and O(user_count) space across the entire dataset. As such, this solution is memory-bounded as a linear function of user count, i.e., if you have many many users, you’ll exhaust memory and be sad, but if you have enough memory for your user count, there is no bound on number of payments.

Unit tests, included, helped me get each small part right. I chose not to cover the higher-level operation, preferring manual testing in this case. In a work environment, I might create unit or functional tests with canned data and known correct output, but I lacked that with this challenge problem. Instead, this solution includes a data generator that produces payment data with per-user is_card_present probability occurring in a Gaussian distribution, such that the output ECDFs under correct operation occur as a recognizable S-curve (especially when graphed). In real work, tests could (I imagine) be constructed to mathematically show output ECDF fits input Gaussian data, by, e.g., standard deviation. These could be maintained as part of a functional test suite.

The code is likely most approachable reading these files (and perhaps their associated specs (tests)) in order:

  • lib/cardp.rb - Bare functions to partition users and calculate the ECDFs given per-user aggregate data.

  • lib/card_pres.rb - Classes that aggregate data per-user

    • class CardPres - Aggregate incoming payment tuples for an individual user; line parsing and formatting for raw input and intermediate files

    • class CardPres::Aggregator - Collect incoming tuples across all users as CardPres objects hashed by user_id

  • driver.rb - Invoked from command line, performs all or phased actions to calculate and present ECDFs using logic from the files described above.

The other .rb files provide supporting functionality.

FWIW, a complete procedural/functional implementation of the core algorithm was done at git tag “procedural” (commit 8b09b6c) in file cardp.rb. I deemed it potentially hard for others to understand and opted to bring in more OO, which I leave in the delivered solution.

Also of note: the presence in the input tuples of the payment_id and created_at fields suggest that unique payment (by id) may occur multiple times in the data. That might occur if a payment was changed or corrected, or perhaps when resending a payment. In that case the last instance (by created_at) for a payment_id would supersede all others. The stated problem does not address this issue. I chose in my solution to disregard this possibility and ignore those fields, assuming that each payment line in the data is to be included in the results. While I made this decision for simplicity, I note that significantly more processing would be required to “unique” the payments (aggregating by payment_id) and this may (may) ultimately not yield significant difference in the results (if updated or “corrected” payments are relatively few). Naturally I’m curious what the problem’s author had in mind.

Alternatives

For real work, I might choose a solution in Pig (on Hadoop) instead, since it provides a general solution for them work of the program not specific to the problem domain. This would likely require a UDF to do the ECDF from ordered card_present ratios. Cascading (also on Hadoop), perhaps with cascading.jruby, would make another fine choice.

I wanted to work the problem through in bare Ruby and see what I could learn. In practice, it would be at least worth writing (or attempting to write) a Pig or Cascading solution for comparison before investing further in this solution. By contrast, I imagine having worked through it in ruby ahead of such other approach would bring out important subtleties of work.

It is worth noting the the solution program as implemented could likely be simply modified to run distributed with Hadoop. Given the “demo” “challenge” nature of the problem, I opted for a simple file-based parallelization approach.

Running the Solution Program

- Simple operation

Invoked without options or filenames, the program generates random data (see below), processes it (without parallelization – but see below for that) into an ECDF for each spending bucket, and prints the result in the required (see above) format.

ruby ./driver.rb

Input files may be specified as command line arguments. All other options are given in name=value form.

This processes data in files 1.csv and 2.csv (“all” happens to be the default action):

ruby ./driver.rb action=all 1.csv 2.csv

- Raw input data format

The raw input data format follows the problem statement: a text file of (newline-terminated) lines, each of 5 comma-separated fields. In order, the fields are:

  • user_id : integer or string

  • (ignored)

  • payment_amount : decimal (float) number

  • is_card_present : 0 or 1 (1 denotes card present)

  • (ignored)

- Generating random data

When run without any specified filenames, the program generates random data for a given (defaulted, see below) number of users and payments. Each user is assigned, in a Gaussian distribution with a given (defaulted) standard deviation, a random probability of having card present on any given payment. Each payment amount is scaled so that the total payment amount for each user across all generated payments is roughly as likely to be in either payment bucket, regardless of the number of users or payments.

Thus, the ECDF (for each payment bucket) calculated from the generated data occurs as a recognizable curve (graph=true helps), helping to validate the whole of the program as correct.

Instead of processing the data, the program will dump it to a file given action=dump. An output filename may be specified with outfile=filename; default: outfile=raw.csv. E.g.,

ruby ./driver.rb action=dump outfile=/tmp/generated.csv

Produces a file whose first lines might be

2659, _, 11.0, 1, _
2257, _, 7.0, 0, _
785, _, 2.0, 1, _
2829, _, 7.0, 0, _

This could in turn be processed using:

ruby ./driver.rb /tmp/generated.csv

The amount and “shape” of data generated bear 3 parameters:

  • n_users default:3000; The number of users, each with own random (Gaussian) card present probability

  • n_payments default:50000; Total number of payments spread across all n_users

  • stddev default:0.15; Standard deviation of card present probabilities across all users

E.g.,

ruby ./driver.rb action=dump outfile=dump.csv n_users=30000 n_payments=500000 stddev=0.25

- Parallelizing operation

Input data “on the order of 100GBs” calls for parallel processing. The program works in two phases, which may be run independently to allow parallelization of the first phase. That is, raw data may be aggregated in parallel and independently from the final computation and output of payment-bucketed ECDFs. This may be performed as follows.

First, run the aggregation step in parallel across separate raw input files with action=aggregate. An outfile may be named; default: outfile=aggregate.csv. Then collect the produced files onto a single machine and there run the final computation with action=finish, naming all the intermediate files produced in the aggregate step. E.g.,

ruby ./driver.rb data.1.csv  action=aggregate outfile=ag.1.csv
ruby ./driver.rb data.2.csv  action=aggregate outfile=ag.2.csv
ruby ./driver.rb ag.[12].csv action=finish

Since the parallelizable step is per-user aggregation, the work is sped up only as a function of average payment count per user. So in the degenerate case of 1 payment per unique user, there would be no speedup.

If raw input data is present on (and ideally initially logged to) separate machines, e.g., AWS EC2 nodes, aggregation may be run on each node holding part of the data, with output files sent to a single node for finishing computation. This follows Google’s original model for MapReduce where processing was moved to data, and is distinct from the approach taken by, say, AWS EMR, where data must be moved to processing ad hoc. The tradeoff, of course, is the speed of the former approach, with no need to transfer the raw data to the processing, against the convenience of elasticity afforded by the latter approach: there’s no need to dedicate resources for hosting data and processing ahead of time. An EMR-style approach could still be used with this program as implemented.

Note that this program needs no unique association of user with aggregation process or file, so incoming payment transaction log lines may be arbitrarily spread among machines with, e.g., naïve load balancing. However, greater efficiency (at least as smaller intermediate files) could be achieved with user-to-aggregation-node affinity.

FWIW, multiple intermediate output files (from action=aggregate) may be re-aggregated (“re-reduced”) with action=reaggregate. This may be useful when running several aggregate processes in parallel on each aggregate-node before final transmission to the finish-node, to maximally reduce file size and transmission time.

Finally, even if data is all on one machine, it can be faster to parallelize the aggregate step. For example, informal experimentation on my dual-core laptop (with n_users=30000 n_payments=500000) yields ~16 second aggregation of the whole dataset (and available CPU), and ~10 second aggregation with 2 parallel aggregations (with bound CPU). The finish step for this dataset on this machine takes ~2 seconds. This preliminarily suggests 1 aggregation process per core up to disk IO limits.

Try it yourself:

ruby ./driver.rb quantiles=50 action=dump outfile=dump.csv n_users=30000 n_payments=500000 stddev=0.25
head -250000 dump.csv > dump.1.csv
tail -250000 dump.csv > dump.2.csv
(
  time ruby ./driver.rb dump.1.csv action=aggregate outfile=ag.1.csv )& (
  time ruby ./driver.rb dump.2.csv action=aggregate outfile=ag.2.csv )&
# Wait till both finish...
time ruby ./driver.rb action=finish ag.[12].csv

- Special output options

Two output options offer output that may be more interesting or convenient for testing:

  • graph=true - Rudimentary graph of ECDFs, as “ascii-art”

  • quantiles=number - default:100; Group users into more or fewer ECDF intervals than the default 100. Useful at small values while experimenting with, e.g., the Gaussian parameters, or parallelization timings.


Thanks for reading. If you found the experience worthwhile, let’s talk!

ecdf_ruby_demo's People

Contributors

enthal avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.