The Analytical Dataset Generation (ADG) cluster converts the latest versions of all records in specified HBase tables into Parquet files stored on S3. It then generates Hive tables to provide downstream data processing & analytics tasks with convenient SQL access to that data.
- At a defined time, a CloudWatch event will trigger the
EMR Launcher
Lambda function - The
EMR Launcher
reads EMR Cluster configuration files from theConfig
S3 bucket, then calls theRunJobFlow
API of the EMR service which results in anAnalytical Dataset Generator
(ADG
) EMR cluster being launched - The
ADG Cluster
is configured as a read-replica of theIngest HBase
EMR cluster; a PySpark step run on the cluster reads HBase Storefiles from theInput
S3 bucket and produces Parquet files in theOutput
S3 bucket. - The PySpark step then creates external Hive tables over those S3 objects, storing the table definitions in a Glue database
- Once processing is complete, the
ADG Cluster
terminates.