Coder Social home page Coder Social logo

digideskio / emr_spark_automation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kcrandall/emr_spark_automation

0.0 1.0 0.0 129 KB

A repository for deploying an AWS EMR cluster and submiting spark jobs on it. Boostrapping by default does inclues pysparkling so one can easily use h2o with python and spark.

License: Apache License 2.0

Python 68.86% CSS 8.36% Shell 22.78%

emr_spark_automation's Introduction

Spark Kaggle Starter

Summary: This code takes much of Patrick's code and upgrades it with Spark and Pysparkling functionality. Also included is an EMR automation tool for launching clusters and running code as well as a logging tool for logging plots and code from your cluster's environment.

spark_main.py: This file will run the a data prep and training example using pysparkling. If you would like to run this with a local installation of spark and pysparkling please remove all the lines with logging or make sure the LoggingController can access an S3 bucket from your local env.

emr_controler.py: This file helps with spinning up an EC2 cluster and zipping up code and submitting it to spark for execution. See README in directory. See README in spark_controler directory.

LoggingController.py: This class will log files and plots. See README in logging_lib directory.

MarkdownBuilder.py: This class will takes logs and make them into a nice clean markdown file. See README in logging_lib directory.

Using the EMR Automation tool and loggin tool: When using these tools you will need to download the aws command line interface (aws cli) and run aws configure and give it access credentials that can access S3 and EMR. My suggestion is to just make a user with Administration permissions to avoid confustion of policies and roles (create a group with that permission then a user and download credentials of the user).

To install aws cli. On Windows find the .msi file on AWS (easy takes like a whole 2 seconds). For macOS either go through the annoying terminal commands OR install homebrew and type brew install awscli.

After the aws cli is installed in terminal:

aws configure

Type in your access key and secret key from IAM user role.

For region type us-east-1 (or another region if you want to use it and know what you're doing).

Leave the last field blank (just hit enter past it).

Done. You have set up the 'default' profile.

emr_spark_automation's People

Contributors

kcrandall avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.