AWS-Serverless-Data-Lake

To demonstrate the power of data lake architectures, In this workshop, I ingested streaming data from the Kinesis Data Generator (KDG) into Amazon S3. Then created a big data processing pipeline without servers or clusters, which is ready to process huge amounts of data, the dataset is an open dataset at AWS Open Data Registry, called GDELT and it has ~170GB+ size, and is comprised of thousands of uncompressed CSV files. I also created an AWS Glue transform job to perform basic transformations on the Amazon S3 source data. And finaly, I used the larger public dataset with more tables to observe the various AWS services in collaboration using AWS Athena.

Create a CloudFormation template and uplode this file (serverlessDataLakeDay.json)
Create Kinesis Firehose Delivery Stream to Ingest data into your Data Lake

Install the Kinesis Data Generator Tool (KDG)

Monitoring for the Firehose Delivery Stream

Amazon Kinesis Firehose writes data to Amazon S3

Cataloging your Data with AWS Glue

Create crawler to auto discover schema of your data in S3

Create a database and a table then Edit the Metadata Schema

Create a Transformation Job with Glue Studio

SQL analytics on a Large Scale Open Dataset usimg AWS Athena

create a database CREATE DATABASE gdelt;
Create Metadata Table for GDELT EVENTS Data CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events ( globaleventid INT, day INT, monthyear INT, year INT, fractiondate FLOAT, actor1code string, actor1name string, actor1countrycode string, actor1knowngroupcode string, actor1ethniccode string, actor1religion1code string, actor1religion2code string, actor1type1code string, actor1type2code string, actor1type3code string, actor2code string, actor2name string, actor2countrycode string, actor2knowngroupcode string, actor2ethniccode string, actor2religion1code string, actor2religion2code string, actor2type1code string, actor2type2code string, actor2type3code string, isrootevent BOOLEAN, eventcode string, eventbasecode string, eventrootcode string, quadclass INT, goldsteinscale FLOAT, nummentions INT, numsources INT, numarticles INT, avgtone FLOAT, actor1geo_type INT, actor1geo_fullname string, actor1geo_countrycode string, actor1geo_adm1code string, actor1geo_lat FLOAT, actor1geo_long FLOAT, actor1geo_featureid INT, actor2geo_type INT, actor2geo_fullname string, actor2geo_countrycode string, actor2geo_adm1code string, actor2geo_lat FLOAT, actor2geo_long FLOAT, actor2geo_featureid INT, actiongeo_type INT, actiongeo_fullname string, actiongeo_countrycode string, actiongeo_adm1code string, actiongeo_lat FLOAT, actiongeo_long FLOAT, actiongeo_featureid INT, dateadded INT, sourceurl string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '\t', 'field.delim' = '\t') LOCATION 's3://gdelt-open-data/events/';
Create Metadata Table for GDELT Lookup Tables