Coder Social home page Coder Social logo

himanitawade / emr-for-data-engineers Goto Github PK

View Code? Open in Web Editor NEW

This project forked from airscholar/emr-for-data-engineers

0.0 0.0 0.0 524 KB

This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.

Home Page: https://youtu.be/ZFns7fvBCH4

Python 100.00%

emr-for-data-engineers's Introduction

AWS EMR Data Processing for Data Engineers

Description

This project demonstrates the use of Amazon Elastic Map Reduce (EMR) for processing large datasets using Apache Spark. It includes a Spark script for ETL (Extract, Transform, Load) operations, AWS command line instructions for setting up and managing the EMR cluster, and a dataset for testing and demonstration purposes.

System Architecture

Architecture.png

Project Structure

  • spark-etl.py: The main Spark script used for ETL operations.
  • commands.py: Scripts for AWS EMR cluster setup and management.
  • data/: Directory containing the dataset used in the ETL process.

Spark Script

The spark-etl.py is a Python script that uses Apache Spark to perform ETL operations. It reads data from an input directory, processes it by adding a timestamp, and writes the result to an output directory in Parquet format.

Usage

To run the script, use the following command:

spark-submit spark-etl.py [s3-input-folder] [s3-output-folder]

Replace [s3-input-folder] with the path to the input data directory and [s3-output-folder] with the path where you want to save the output.

AWS Commands

The commands.py directory contains detailed instructions and necessary scripts to set up and manage an AWS EMR cluster. This includes steps for creating an EMR cluster, configuring necessary services, and submitting Spark jobs.

Data

The data/ directory contains the dataset used for the ETL process. This dataset is a sample that represents the type of data the Spark script is designed to process.

Requirements

  • Apache Spark
  • AWS CLI
  • An AWS account with necessary permissions to create and manage EMR clusters

Watch the Video Tutorial

For a complete walkthrough and practical demonstration, check out the video here:

EMR Masterclass

emr-for-data-engineers's People

Contributors

airscholar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.