-- A SOLUTION FOR AN ECDF PROBLEM --

Runs on Ruby 1.9.2. Code at github.com/enthal/ecdf_ruby_demo

Try this to generate random/Gaussian data and process it:

ruby ./driver.rb

Or this with input files:

ruby ./driver.rb file1.csv file2.csv

Read on for the full problem and solution, including parallelization!

PROBLEM¶ ↑

Problem Statement¶ ↑

NOTE: This problem statement is copied verbatim from: https://squareup.com/jobs/oD5LVfw8

Suppose you have multiple large files (on the order of 100GBs) containing tuples of the following form:

{user_id, payment_id, payment_amount, is_card_present, created_at}.

Write a program to compute the empirical cumulative distribution function of the card present ratio for users who processed less than $100, and for users who processed over $100.

The expected output should be of the form:

Users who processed less than $100
percentile    % cp
1             0
2             0
3             5
...           ...
100           100

and similarly for users who processed over $100.

Problem Discussion¶ ↑

The scope of this challenge has two essential facets: parallelization (data “on the order of 100GBs”) and the ECDF computation itself.

Since the two required total-payment-bucketed ECDFs serve to assign users to percentiles, the data must first be aggregated per-user, counting their payments and summing their individual payment_amount and is_card_present count. Then the aggregate users may be partitioned into buckets (at $100 total payment_amount threshold) and from there into percentile by ratio of is_card_present count to total payment count. Thus, there is ample opportunity to parallelize the (expected) bulk of the work: aggregation by user.

The payment_id and created_at tuple fields are notable; a discussion follows in the SOLUTION section.

SOLUTION¶ ↑

I offer a solution written in ruby (1.9.2). The code is written to be understood, and serves as the most thorough description of the algorithms involved.

Following from my discussion of the problem (see above), the solution separates aggregation by user from ECDF computation, allowing parallel processing for aggregation. See “Parallelizing operation”, below, for more details.

The aggregation phase runs in O(payment_count) time and O(user_count) space for the given process input (a subset of the full data set when parallelized). The finish phase runs in O(intermediate_input_line_count) time and O(user_count) space across the entire dataset. As such, this solution is memory-bounded as a linear function of user count, i.e., if you have many many users, you’ll exhaust memory and be sad, but if you have enough memory for your user count, there is no bound on number of payments.

Unit tests, included, helped me get each small part right. I chose not to cover the higher-level operation, preferring manual testing in this case. In a work environment, I might create unit or functional tests with canned data and known correct output, but I lacked that with this challenge problem. Instead, this solution includes a data generator that produces payment data with per-user is_card_present probability occurring in a Gaussian distribution, such that the output ECDFs under correct operation occur as a recognizable S-curve (especially when graphed). In real work, tests could (I imagine) be constructed to mathematically show output ECDF fits input Gaussian data, by, e.g., standard deviation. These could be maintained as part of a functional test suite.

The code is likely most approachable reading these files (and perhaps their associated specs (tests)) in order:

lib/cardp.rb - Bare functions to partition users and calculate the ECDFs given per-user aggregate data.
lib/card_pres.rb - Classes that aggregate data per-user
- class CardPres - Aggregate incoming payment tuples for an individual user; line parsing and formatting for raw input and intermediate files
- class CardPres::Aggregator - Collect incoming tuples across all users as CardPres objects hashed by user_id
driver.rb - Invoked from command line, performs all or phased actions to calculate and present ECDFs using logic from the files described above.

The other .rb files provide supporting functionality.

FWIW, a complete procedural/functional implementation of the core algorithm was done at git tag “procedural” (commit 8b09b6c) in file cardp.rb. I deemed it potentially hard for others to understand and opted to bring in more OO, which I leave in the delivered solution.

Also of note: the presence in the input tuples of the payment_id and created_at fields suggest that unique payment (by id) may occur multiple times in the data. That might occur if a payment was changed or corrected, or perhaps when resending a payment. In that case the last instance (by created_at) for a payment_id would supersede all others. The stated problem does not address this issue. I chose in my solution to disregard this possibility and ignore those fields, assuming that each payment line in the data is to be included in the results. While I made this decision for simplicity, I note that significantly more processing would be required to “unique” the payments (aggregating by payment_id) and this may (may) ultimately not yield significant difference in the results (if updated or “corrected” payments are relatively few). Naturally I’m curious what the problem’s author had in mind.

Alternatives¶ ↑

For real work, I might choose a solution in Pig (on Hadoop) instead, since it provides a general solution for them work of the program not specific to the problem domain. This would likely require a UDF to do the ECDF from ordered card_present ratios. Cascading (also on Hadoop), perhaps with cascading.jruby, would make another fine choice.

I wanted to work the problem through in bare Ruby and see what I could learn. In practice, it would be at least worth writing (or attempting to write) a Pig or Cascading solution for comparison before investing further in this solution. By contrast, I imagine having worked through it in ruby ahead of such other approach would bring out important subtleties of work.

It is worth noting the the solution program as implemented could likely be simply modified to run distributed with Hadoop. Given the “demo” “challenge” nature of the problem, I opted for a simple file-based parallelization approach.

Running the Solution Program¶ ↑

- Simple operation¶ ↑

Invoked without options or filenames, the program generates random data (see below), processes it (without parallelization – but see below for that) into an ECDF for each spending bucket, and prints the result in the required (see above) format.

ruby ./driver.rb

Input files may be specified as command line arguments. All other options are given in name=value form.

This processes data in files 1.csv and 2.csv (“all” happens to be the default action):

ruby ./driver.rb action=all 1.csv 2.csv

- Raw input data format¶ ↑

The raw input data format follows the problem statement: a text file of (newline-terminated) lines, each of 5 comma-separated fields. In order, the fields are:

user_id : integer or string
(ignored)
payment_amount : decimal (float) number
is_card_present : 0 or 1 (1 denotes card present)
(ignored)

- Generating random data¶ ↑

When run without any specified filenames, the program generates random data for a given (defaulted, see below) number of users and payments. Each user is assigned, in a Gaussian distribution with a given (defaulted) standard deviation, a random probability of having card present on any given payment. Each payment amount is scaled so that the total payment amount for each user across all generated payments is roughly as likely to be in either payment bucket, regardless of the number of users or payments.

Thus, the ECDF (for each payment bucket) calculated from the generated data occurs as a recognizable curve (graph=true helps), helping to validate the whole of the program as correct.

Instead of processing the data, the program will dump it to a file given action=dump. An output filename may be specified with outfile=filename; default: outfile=raw.csv. E.g.,

ruby ./driver.rb action=dump outfile=/tmp/generated.csv

Produces a file whose first lines might be

2659, _, 11.0, 1, _
2257, _, 7.0, 0, _
785, _, 2.0, 1, _
2829, _, 7.0, 0, _

This could in turn be processed using:

ruby ./driver.rb /tmp/generated.csv

The amount and “shape” of data generated bear 3 parameters:

n_users default:3000; The number of users, each with own random (Gaussian) card present probability
n_payments default:50000; Total number of payments spread across all n_users
stddev default:0.15; Standard deviation of card present probabilities across all users

E.g.,

ruby ./driver.rb action=dump outfile=dump.csv n_users=30000 n_payments=500000 stddev=0.25

- Parallelizing operation¶ ↑

Input data “on the order of 100GBs” calls for parallel processing. The program works in two phases, which may be run independently to allow parallelization of the first phase. That is, raw data may be aggregated in parallel and independently from the final computation and output of payment-bucketed ECDFs. This may be performed as follows.

First, run the aggregation step in parallel across separate raw input files with action=aggregate. An outfile may be named; default: outfile=aggregate.csv. Then collect the produced files onto a single machine and there run the final computation with action=finish, naming all the intermediate files produced in the aggregate step. E.g.,

ruby ./driver.rb data.1.csv  action=aggregate outfile=ag.1.csv
ruby ./driver.rb data.2.csv  action=aggregate outfile=ag.2.csv
ruby ./driver.rb ag.[12].csv action=finish

Since the parallelizable step is per-user aggregation, the work is sped up only as a function of average payment count per user. So in the degenerate case of 1 payment per unique user, there would be no speedup.

If raw input data is present on (and ideally initially logged to) separate machines, e.g., AWS EC2 nodes, aggregation may be run on each node holding part of the data, with output files sent to a single node for finishing computation. This follows Google’s original model for MapReduce where processing was moved to data, and is distinct from the approach taken by, say, AWS EMR, where data must be moved to processing ad hoc. The tradeoff, of course, is the speed of the former approach, with no need to transfer the raw data to the processing, against the convenience of elasticity afforded by the latter approach: there’s no need to dedicate resources for hosting data and processing ahead of time. An EMR-style approach could still be used with this program as implemented.

Note that this program needs no unique association of user with aggregation process or file, so incoming payment transaction log lines may be arbitrarily spread among machines with, e.g., naïve load balancing. However, greater efficiency (at least as smaller intermediate files) could be achieved with user-to-aggregation-node affinity.

FWIW, multiple intermediate output files (from action=aggregate) may be re-aggregated (“re-reduced”) with action=reaggregate. This may be useful when running several aggregate processes in parallel on each aggregate-node before final transmission to the finish-node, to maximally reduce file size and transmission time.

Finally, even if data is all on one machine, it can be faster to parallelize the aggregate step. For example, informal experimentation on my dual-core laptop (with n_users=30000 n_payments=500000) yields ~16 second aggregation of the whole dataset (and available CPU), and ~10 second aggregation with 2 parallel aggregations (with bound CPU). The finish step for this dataset on this machine takes ~2 seconds. This preliminarily suggests 1 aggregation process per core up to disk IO limits.

Try it yourself:

ruby ./driver.rb quantiles=50 action=dump outfile=dump.csv n_users=30000 n_payments=500000 stddev=0.25
head -250000 dump.csv > dump.1.csv
tail -250000 dump.csv > dump.2.csv
(
  time ruby ./driver.rb dump.1.csv action=aggregate outfile=ag.1.csv )& (
  time ruby ./driver.rb dump.2.csv action=aggregate outfile=ag.2.csv )&
# Wait till both finish...
time ruby ./driver.rb action=finish ag.[12].csv

- Special output options¶ ↑

Two output options offer output that may be more interesting or convenient for testing:

graph=true - Rudimentary graph of ECDFs, as “ascii-art”
quantiles=number - default:100; Group users into more or fewer ECDF intervals than the default 100. Useful at small values while experimenting with, e.g., the Gaussian parameters, or parallelization timings.

Thanks for reading. If you found the experience worthwhile, let’s talk!

enthal / ecdf_ruby_demo Goto Github PK

ecdf_ruby_demo's Introduction

PROBLEM¶ ↑

Problem Statement¶ ↑

Problem Discussion¶ ↑

SOLUTION¶ ↑

Alternatives¶ ↑

Running the Solution Program¶ ↑

- Simple operation¶ ↑

- Raw input data format¶ ↑

- Generating random data¶ ↑

- Parallelizing operation¶ ↑

- Special output options¶ ↑

ecdf_ruby_demo's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent