Coder Social home page Coder Social logo

gakas14 / aws-serverless-data-lake Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 58 KB

This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.

Jupyter Notebook 100.00%
athena aws glue-catalog glue-etl kinesis-stream s3 sql etl kinesis-firehose data-lake

aws-serverless-data-lake's Introduction

AWS-Serverless-Data-Lake

To demonstrate the power of data lake architectures, In this workshop, I ingested streaming data from the Kinesis Data Generator (KDG) into Amazon S3. Then created a big data processing pipeline without servers or clusters, which is ready to process huge amounts of data, the dataset is an open dataset at AWS Open Data Registry, called GDELT and it has ~170GB+ size, and is comprised of thousands of uncompressed CSV files. I also created an AWS Glue transform job to perform basic transformations on the Amazon S3 source data. And finaly, I used the larger public dataset with more tables to observe the various AWS services in collaboration using AWS Athena.

  1. Create a CloudFormation template and uplode this file (serverlessDataLakeDay.json)
  2. Create Kinesis Firehose Delivery Stream to Ingest data into your Data Lake

Screen Shot 2022-11-18 at 3 45 07 PM

  1. Install the Kinesis Data Generator Tool (KDG)

Screen Shot 2022-11-18 at 3 45 57 PM

Monitoring for the Firehose Delivery Stream Screen Shot 2022-11-18 at 3 46 15 PM

Amazon Kinesis Firehose writes data to Amazon S3 Screen Shot 2022-11-18 at 3 47 52 PM

  1. Cataloging your Data with AWS Glue
  • Create crawler to auto discover schema of your data in S3

Screen Shot 2022-11-18 at 3 55 17 PM

  • Create a database and a table then Edit the Metadata Schema
  1. Create a Transformation Job with Glue Studio

Screen Shot 2022-11-18 at 4 00 28 PM

Screen Shot 2022-11-18 at 4 01 24 PM

  1. SQL analytics on a Large Scale Open Dataset usimg AWS Athena
  • create a database CREATE DATABASE gdelt;

  • Create Metadata Table for GDELT EVENTS Data CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events ( globaleventid INT, day INT, monthyear INT, year INT, fractiondate FLOAT, actor1code string, actor1name string, actor1countrycode string, actor1knowngroupcode string, actor1ethniccode string, actor1religion1code string, actor1religion2code string, actor1type1code string, actor1type2code string, actor1type3code string, actor2code string, actor2name string, actor2countrycode string, actor2knowngroupcode string, actor2ethniccode string, actor2religion1code string, actor2religion2code string, actor2type1code string, actor2type2code string, actor2type3code string, isrootevent BOOLEAN, eventcode string, eventbasecode string, eventrootcode string, quadclass INT, goldsteinscale FLOAT, nummentions INT, numsources INT, numarticles INT, avgtone FLOAT, actor1geo_type INT, actor1geo_fullname string, actor1geo_countrycode string, actor1geo_adm1code string, actor1geo_lat FLOAT, actor1geo_long FLOAT, actor1geo_featureid INT, actor2geo_type INT, actor2geo_fullname string, actor2geo_countrycode string, actor2geo_adm1code string, actor2geo_lat FLOAT, actor2geo_long FLOAT, actor2geo_featureid INT, actiongeo_type INT, actiongeo_fullname string, actiongeo_countrycode string, actiongeo_adm1code string, actiongeo_lat FLOAT, actiongeo_long FLOAT, actiongeo_featureid INT, dateadded INT, sourceurl string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '\t', 'field.delim' = '\t') LOCATION 's3://gdelt-open-data/events/';

  • Create Metadata Table for GDELT Lookup Tables

Screen Shot 2022-11-18 at 4 05 48 PM

Screen Shot 2022-11-18 at 4 05 54 PM

Screen Shot 2022-11-18 at 4 05 59 PM

Screen Shot 2022-11-18 at 4 06 05 PM

  • Example output:

Screen Shot 2022-11-18 at 4 08 06 PM

Screen Shot 2022-11-18 at 4 08 14 PM

Screen Shot 2022-11-18 at 4 08 24 PM

Screen Shot 2022-11-18 at 4 08 32 PM

This workshop is base on AWS workshop studio the link is below. https://catalog.us-east-1.prod.workshops.aws/workshops/ea7ddf16-5e0a-4ec7-b54e-5cadf3028b78/en-US

aws-serverless-data-lake's People

Contributors

gakas14 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.