Coder Social home page Coder Social logo

gveerashekar / introtopyspark Goto Github PK

View Code? Open in Web Editor NEW

This project forked from asthesearises/introtopyspark

0.0 0.0 0.0 555 KB

Quick and easy setup of Amazon EMR (Elastic Map Reduce) with PySpark - using persistent Jupyter Notebook including walk-through of basic ETL scenario.

Shell 100.00%

introtopyspark's Introduction

Introduction To PySpark on Amazon EMR

The purpose of these snippets of code and scripts, are to make it easier to launch PySpark on Amazon EMR, while using a Jupyter Notebook style approach to learning the basics of Spark. Following this approach will allow you to have a permanent or peristent notebook saved in Amazon S3 for use with transient EMR clusters.

Persistent configuration based on https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-s3.html NOTE - There are costs associated with launching AWS resources. Refer to the EMR (Elastic Map Reduce) documentation for pricing information.

https://aws.amazon.com/emr/pricing/

Alternatively refer to the Simple Cost Calculator to determine costs.

https://calculator.s3.amazonaws.com/index.html

Estimated Cost

The cost for running a 3 node cluster made up of m4.large instances running in London region (4 vCPU, 8Gb RAM) would cost roughly $13.59 per hour)

Prerequisites

  1. AWS account (sign up) https://aws.amazon.com/

Steps Required

Process to launch EMR cluster with PySpark and JupyterHub installed, along with pre-created Jupyter notebook to get you started.

Step 1 - Create bucket that will contain your saved notebooks

a) Log into the AWS console https://aws.amazon.com/

To create an S3 bucket

Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

alt text

b) In the Bucket name field, type a unique DNS-compliant name for your new bucket. (The example screen shot uses the bucket name admin-created. You cannot use this name because S3 bucket names must be unique.) Create your own bucket name using the follow naming guidelines:

The name must be unique across all existing bucket names in Amazon S3.

After you create the bucket you cannot change the name, so choose wisely.

Choose a bucket name that reflects the objects in the bucket because the bucket name is visible in the URL that points to the objects that you're going to put in your bucket.

For information about naming buckets, see Rules for Bucket Naming in the Amazon Simple Storage Service Developer Guide.

For Region, choose US West (Oregon) as the region where you want the bucket to reside.

c) Choose Create.

alt text

Copy the notebook into this bucket

Step 2 - Create bucket that will contain the shell script used during EMR launch

Copy 'findspark.sh' into this bucket.

Step 3 - Run EMR create cluster from CLI

You can launch the EMR cluster using either AWS Cloud9, which is cloud based IDE, that also provides access to AWS CLI commands which are pre-packaged into the environment.

CLI command can be found at IntroToPySpark/CLI_Launch

introtopyspark's People

Contributors

asthesearises avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.