Coder Social home page Coder Social logo

spree's Introduction

Spree

Gitter Chat

Spree is a live-updating web UI for Spark built with Meteor and React.

Screencast of a Spark job running and UI updating along with it

Left: Spree pages showing all jobs and stages, updating in real-time; right: a spark-shell running a simple job; see the Screencast gallery in this repo for more examples.

Features!

Spree is a complete rewrite of Spark's web UI, providing several notable benefits…

Real-time Updating

All data on all pages updates in real-time, thanks to Meteor magic.

Persistence, Scalability

Spree offers a unified interface to past- and currently-running Spark applications, combining functionality that is currently spread across Spark's web UI and "history server".

It persists all information about Spark applications to MongoDB, allowing for archival storage that is easily query-able and solves various Spark-history-server issues, e.g. slow load-times, caching problems, etc.

Pagination and sorting are delegated to Mongo for graceful handling of arbitrarily large stages, RDDs, etc., which makes for a cleaner scalability story than Spark's current usage of textual event-log files and in-memory maps on the driver as ad-hoc databases.

Usability

Spree includes several usability improvements, including:

Toggle-able Columns

All tables allow easy customization of displayed columns:

Collapsible Tables

Additionally, whole tables can be collapsed/uncollapsed for easy access to content that would otherwise be "below the fold":

Persistent Preferences/State

Finally, all client-side state is stored in cookies for persistence across refreshes / sessions, including:

  • sort-column and direction,
  • table collapsed/uncollapsed status,
  • table columns' shown/hidden status,
  • pages' displaying one table with "all" records vs. separate tables for "running", "succeeded", "failed" records, etc.

Extensibility, Modularity

Spree is easy to fork/customize without worrying about changing everyones' Spark UI experience, managing custom Spark builds with bespoke UI changes, etc.

It also includes two useful standalone modules for exporting/persisting data from Spark applications:

  • The json-relay module broadcasts all Spark events over a network socket.
  • The slim module aggregates stats about running Spark jobs and persists them to indexed Mongo collections.

These offer potentially-useful alternatives to Spark's EventLoggingListener and event-log files, respectively (Spark's extant tools for exporting and persisting historical data about past and current Spark applications).

Usage

Spree has three components, each in its own subdirectory:

  • ui: a web-app that displays the contents of a Mongo database populated with information about running Spark applications.
  • slim: a Node server that receives events about running Spark applications, aggregates statistics about them, and writes them to Mongo for Spree's ui above to read/display.
  • json-relay: a SparkListener that serializes SparkListenerEvents to JSON and sends them to a listening slim process.

The latter two are linked in this repo as git submodules, so you'll want to have cloned with git clone --recursive (or run git submodule update) in order for them to be present.

Following are instructions for configuring/running them:

Start Spree

First, run a Spree app using Meteor:

git clone --recursive https://github.com/hammerlab/spree.git
cd spree/ui   # the Spree Meteor app lives in ui/ in this repo.
meteor        # run it

You can now see your (presumably empty) Spree dashboard at http://localhost:3000:

If you don't have meteor installed, see "Installing Meteor" below.

Start slim

Next, install and run slim:

npm install -g slim.js
slim

If you have an older unsupported version of npm installed you may get error messages from the above command that contain message failed to fetch from registry. If so, upgrade the version of node and npm and try again.

slim is a Node server that receives events from JsonRelay and writes them to the Mongo instance that Spree is watching.

By default, slim listens for events on localhost:8123 and writes to a Mongo at localhost:3001, which is the default Mongo URL for a Spree started as above.

Run Spark with JsonRelay

If using Spark ≥ 1.5.0, simply pass the following flags to spark-{shell,submit}:

--packages org.hammerlab:spark-json-relay:2.0.0
--conf spark.extraListeners=org.apache.spark.JsonRelay

Otherwise, download a JsonRelay JAR:

wget https://repo1.maven.org/maven2/org/hammerlab/spark-json-relay/2.0.0/spark-json-relay-2.0.0.jar

…then tell Spark to send events to it by passing the following arguments to spark-{shell,submit}:

# Include JsonRelay on the driver's classpath
--driver-class-path /path/to/json-relay-2.0.0.jar
  
# Register your JsonRelay as a SparkListener
--conf spark.extraListeners=org.apache.spark.JsonRelay
  
# Point it at your `slim` instance; default: localhost:8123
--conf spark.slim.host=…
--conf spark.slim.port=…

Comparison to Spark UI

Below is a journey through Spark JIRAs past, present, and future, comparing the current state of Spree with Spark's web UI.

~Fixed JIRAs

I believe the following are resolved or worked around by Spree:

Missing Functionality

Functionality known to be present in the existing Spark web UI / history server and missing from Spree:

Future Nice-to-haves

A motley collection of open Spark-UI JIRAs that might be well-suited for fixing in Spree:

  • SPARK-1622: expose input splits
  • SPARK-1832: better use of warning colors
  • SPARK-2533: summary stats about locality-levels
  • SPARK-3682: call out anomalous/concerning/spiking stats, e.g. heavy spilling.
  • SPARK-3957: distinguish/separate RDD- vs. non-RDD-storage.
  • SPARK-4072: better support for streaming blocks.
  • Control spark application / driver from Spree:
  • SPARK-4906: unpersist applications in slim that haven't been heard from in a while.
  • SPARK-7729: display executors' killed/active status.
  • SPARK-8469: page-able viz?
  • Various duration-confusion clarification/bug-fixing:
    • SPARK-8950: "scheduler delay time"-calculation bug
    • SPARK-8778: "scheduler delay" mismatch between event timeline, task list.
  • SPARK-4800: preview/sample RDD elements.

Notes / Implementation Details / FAQ

ECONNREFUSED / MongoError

If you see errors like this when starting slim:

/usr/local/lib/node_modules/slim.js/node_modules/mongodb/lib/server.js:228
        process.nextTick(function() { throw err; })
                                      ^
AssertionError: null == { [MongoError: connect ECONNREFUSED 127.0.0.1:3001]
  name: 'MongoError',
  message: 'connect ECONNREFUSED 127.0.0.1:3001' }

it's likely because you need to start Spree first (by running meteor from the ui subdirectory of this repo).

slim expects to connect to a MongoDB that Spree starts (at localhost:3001 by default).

BYO Mongo

Meteor (hence Spree) spins up its own Mongo instance by default, typically at port 3001.

For a variety of reasons, you may want to point Spree and Slim at a different Mongo instance. The handy ui/start script makes this easy:

$ ui/start -h <mongo host> -p <mongo port> -d <mongo db> --port <meteor port>

Either way, Meteor will print out the URL of the Mongo instance it's using when it starts up, and display it in the top right of all pages, e.g.:

Screenshot of Spree nav-bar showing Mongo-instance URL

Important: for Spree to update in real-time, your Mongo instance needs to have a "replica set" initialized, per this Meteor forum thread.

Meteor's default mongo instance will do this, but otherwise you'll need to set it up yourself. It should be as simple as:

  • adding the --replSet=rs0 flag to your mongod command (where rs0 is a dummy name for the replica set), and
  • running rs.initialize() from a mongo shell connected to that mongod server.

Now your Spark jobs will write events to the Mongo instance of your choosing, and Spree will display them to you in real-time!

Installing Meteor

Meteor can be installed, per their docs, by running:

curl https://install.meteor.com/ | sh

Installing Spree and Slim sans sudo

Meteor

By default, Meteor will install itself in ~/.meteor and attempt to put an additional helper script at /usr/local/bin/meteor.

It's ok to skip the latter if/when it prompts you for your root password by ^Cing out of the script.

Slim

npm install -g slim.js may require superuser privileges; if this is a problem, you can either:

  • Install locally with npm, e.g. in your home directory:
    cd ~
    npm install slim.js
    cd ~/node_modules/slim.js
    ./slim
    
  • Run slim from the sources in this repository:
    cd slim  # from the root of this repository; make sure you `git clone --recursive`
    npm install
    ./slim
    

More Screencasts

See the screencast gallery in this repo for more GIFs showing Spree in action!

Spark Version Compatibility

Spree has been tested pretty heavily against Spark 1.4.1. It's been tested less heavily, but should Just Work™, on Sparks from 1.3.0, when the spark.extraListeners conf option was added, which JsonRelay uses to register itself with the driver.

Contributing, Reporting Issues

Please file issues if you have any trouble using Spree or its sub-components or have any questions!

See slim's documentation for info about ways to report issues with it.

spree's People

Contributors

markgrover avatar ryan-williams avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spree's Issues

Add Spear as a submodule

Benefits:

  1. Granular versioning against Spear changes.
  2. Helps for streamlined deployment:
    • start long-running spruit server,
    • build spear jar from within this repo,
    • start spark with built Spear as listener

Avoid plumbing table pagination opts through iron router

Currently, interacting with table pagination "reloads" the page (if only client-side, thanks to iron router magic), including jumping to the top of the page, which is pretty awkward.

A cleaner and UX-superior solution would involve tables managing their own pagination opts and data subscriptions.

Per-hostname page

Similar to #41: show executors on a given host, and all stats about what's on them.

Link to this page from hostnames.

Stage page: display number of task indexes with each number of failures

slim already writes stage-level stats about how many tasks have failed 1 attempt, 2 attempts, etc.

This is useful information when there are some failures on a stage; is there one bad task that has failed 3 times and is about to fail the stage (given default cutoff of 4 task failures)?

Paginate all tables (cap at 100 rows?)

I think that this is both a feature (giant 1000-row html tables are not particularly convenient to work with) and a possibly-necessary workaround to problems I've hit implementing live-updating StagePage for #2 (namely that a naive implementation is quite slow on a 1000-row table).

If there is a quick fix (e.g. an unnecessary O(n^2) operation I'm doing that can be removed) that would be great, but either way I'll still want to be able to paginate tables, as Spark Web UI JIRAs have long complained about performance/usability on >1000-task stages.

Make tables collapsible

It's really tedious to e.g. scroll past the per-executor table on the stage page to get to the per-task table.

Label skipped stages accordingly

e.g.:

vs:

Unclear whether I should require that the DB annotate the stages this way or whether I should just figure it out on the frontend…

Add sane secondary sorts to various columns

Most columns would do well to have a default tiebreaker secondary-sort column of the first column in the table, for instance.

The plumbing for secondary sort keys is already done; just need to configure columns to utilize it.

"Running" counts can go negative, be out of date when data is dropped

First of all: why are events being dropped? In the instance above, the normal Spark web UI is in a pretty inconsistent state as well:

This application finished successfully hours ago, and all other Spark listeners (e.g. the event log) also seem to have missed most TaskStart/TaskEnd events for the last several jobs, as well as others, like the JobEnd for jobs 16 and 17.

In any case, maybe I should more forcibly prevent "running" counts from going below 0? Kind of hard to say until I figure out:

  • why the data was dropped
  • whether the same data was dropped for all listeners (my read/write metrics don't match Spark web UI's for some of the stages where the data loss occurred)

I think the action items here are to:

  • run on some more big jobs and see if the data loss occurs again
  • I have an email out to Imran Rashid about the possibility of such data loss
  • I could refactor my "Counts" sub-records to store maps of indexes that have been {started} x {finished}, which would really give me a high-fidelity view into what's happening in situations like these; I should file that against Spear.

Add per-executor page

Could just show the executor-aggregated tables from stage and RDD pages, as well as executor-level statistics.

Scale to large Spark workloads.

The code currently at HEAD is roughly feature-complete but lags for modestly-sized workloads. Here is a brain dump about spruit and friends, recent work and roadmap:

I've rewritten the SparkEvent -> Mongo component:

  • Previously: Spear
    • Scala SparkListener that receives events and updates records in Mongo reflecting the state of various Spark nouns (jobs, stages, tasks, executors, RDDs).
    • [Blocking Mongo queries] in [Spark driver's "listener bus" thread] were too slow, ruining everything under load.
    • foursquare uses Rogue in a way that supports async writing with Twitter Futures, but I haven't come across an example publicly released anywhere.
      • My low-priority thread to follow up with them is currently asleep.
    • In general, I should get that work out of the Spark driver.
    • Type-safety for my Mongo queries was proving more cumbersome than beneficial given shaky tooling foundations.
  • Hence, now:
    • "json-relay-spark-listener"
      • Simple SparkListener in Scala.
      • Serializes Spark events to JSON, sends them out via RPC.
      • Missed the seb-inspired-name boat on this one, will change it if/when inspiration strikes.
    • "Slim"
      • NodeJS server, listens to the json-relay-spark-listener.
      • Computes stats, writes them to Mongo.
      • Mongo queries are async by default, thx node.

So an event's journey is now:

Spark -> JSON RPC -> Mongo -> Meteor Server -> Meteor Client

  • I want to push bursts of 10,000s of events/s down this pipeline.
    • Motivating use case: a 1000-task stage running from start to finish in a few seconds will generate ~2000 events, mostly TaskStarts and TaskEnds.
    • Each task {starting,finishing} means I want to update something (usually some denormalized count or other) on corresponding application, job, stage, stage attempt, task, task attempt, executor, and (optionally) RDD records.
    • That's up to 8 mongo writes per task-end event, for easily 10k writes/s to Mongo.
  • I think at least one phase of the pipeline currently bottlenecks well shy of that, based on small tests.
  • I'm not sure I should expect single-instance {Mongo, Meteor} to handle kqps's no matter what I do.
    • Can I use fewer Mongo queries to get the info I need through the pipe, even during 1000 events/s bursts?
    • To do so, I'd need to stop writing Task records. Instead:
      1. I'd add them to a large tasks array-field on each Stage record.
        • This sounds a little crazy at first:
          • This array could have 1000s of elements.
          • Ideally I'd support 100k's of tasks/stage; the Spark UI performance has been evaluated in the past on Spark users' 100k-task-per-stage workflows.
        • I'd need to trust that the pipeline can push [a Stage object with 1000 tasks added/modified] through with reasonable latency.
          • A hopeful voice in my head says this is actually more aligned with what Mongo is designed for, as a "document-storage" engine.
          • 4sqs oft lamented that we used Mongo with many small records at our peril.
      2. I'd add per-{object type, object ID} rate-limiting to slim, so that I would receive many updates to e.g. a Stage mega-record (e.g. many tasks started or finished), but only write snapshots of that Stage to Mongo N times/s.
        • I have reasonable places to add such functionality in a Mongo "upsert" helper that I already hand-rolled.
        • First thought: I'd want another thread reading from a queue of updated objects, which I don't think I can do in node.
        • Second thought: I think I can (ab)use setTimeout per-object to coordinate only allowing each object to be written below a certain frequency.
  • Otherwise, I could maybe imagine Mongo handling such qps, and I'd have to think about how to batch things in Meteor.
    • This seems harder / less promising.

So my next steps are:

  1. Do some light profiling of the pipeline and see if I can identify whether it's obvious that some parts are performant enough, others not.
  2. Assume that I must live in a world without per-Task records, and build out [the mega-records + rate-limiting scheme described above] in slim.

Allow for long-running Meteor app / switching database on the fly

Currently there's a nontrivial amount of configuration that goes into getting this info from Spark → Spear → Mongo → Browser:

  1. Telling Spark which Mongo to write to.
  2. Pointing Spruit at that Mongo.

There's also the bit about setting up the replica-set on the Mongo to make Spruit's (Meteor's) oplog-tailing work.

The easy/best solution would probably involve using a Mongo that Meteor manages, so that we can (by default) do away with the MONGO_URL/MONGO_OPLOG_URL config, and just start up a long-running Meteor/Spruit process by typing meteor, then point SparkListeners at that Mongo.

This will necessitate the ability to switch the database within the Mongo server that Meteor is reading from; I heard that I might be able to do that by managing Meteor.subscribe/Meteor.publish calls; need to look into that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.