Data preparation using Amazon Redshift with AWS Glue DataBrew

This lab is provided as part of AWS Innovate Data Edition, it has been adapted from an AWS Blog Click here to explore the full list of hands-on labs.

ℹ️ You will run this lab in your own AWS account and running this lab will incur some costs. Please follow directions at the end of the lab to remove resources to avoid future costs.

Overview
Architecture
Step 1 - Create Redshift Cluster
Step 2 - Setup Student dataset
Step 3 - Create VPC Endpoints
Step 4 - Alter Security Group Rules
Step 5 - Create an S3 Bucket
Step 6 - Prepare data using DataBrew
Summary
Cleanup
Survey

Overview

In this lab we will use AWS Glue DataBrew to prepare data from Amazon Redshift. We will explore a student dataset stored in Amazon Redshift containing details of school id, name, age, country, gender, number of study hours and marks. We will use AWS Glue DataBrew to connect to Redshift cluster and ingest data. This data will then be prepared, cleaned and made ready for a downstream machine learning process.

With AWS Glue DataBrew, users can evaluate the quality of your data by profiling it to understand data patterns and detect anomalies. They can also visualize, clean, and normalize data directly from your data lake, data warehouses, and database including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS.

Now, with added support for JDBC-accessible databases, DataBrew also supports additional data stores, including PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. In this post, we use DataBrew to clean data from an Amazon Redshift table, and transform and use different feature engineering techniques to prepare data to build a machine learning (ML) model. Finally, we store the transformed data in an S3 data lake to build the ML model in Amazon SageMaker.

Architecture

In this lab we will setup the following architecture. The architecture will have several components explained below.

Components

VPC - The Amazon Virtual Private Cloud (VPC) will host the Private subnet where the Amazon Redshift cluster will be hosted. In this lab, we will use the default VPC.
Private Subnet - The Private subnet will host Elastic Network interface for the Redshift cluster. In this lab we will use the default Subnet. Note: The default is attached to the internet gateway and thus is not private for the lab purposes.
Elastic Network Interface (ENI).
Security Group - The Security group will specify the inbound and outbound rule to secure the network traffic coming in and going out of the ENI. In this lab we will alter the existing security group.
Amazon Redshift - Amazon Redshift will host the student dataset for this lab
AWS Glue - The AWS Glue DataBrew created connection to Amazon Redshift will be housed in the AWS Glue service
Amazon S3 - The Amazon S3 will store any Glue DataBrew intermediate logs and outputs from the DataBrew recipes.
Glue Interface Endpoint - An interface endpoint will make AWS Glue service available within the VPC.
S3 Gateway Endpoint - A gateway endpoint will make Amazon S3 service available within the VPC.
AWS Glue DataBrew - This service will connect to the student dataset in Amazon Redshift. The Service will allow users to prepare the data and create a reusable recipe to refine the data to make it available to AI/ML services like AWS Sagemaker.

Step 1 - Create RedShift Cluster

Navigate to the Amazon Redshift service in the AWS Console.
Click on create cluster and name the cluster student-cluster.
You could choose the Free Trial option which will create a cluster will sample data. For this lab we will choose Production.
Select the instance type as dc2.large and 2 nodes.
Leave the checkbox for Load Sample data unchecked
Use defaults for VPC and Security group. Note the security group. We will alter the inbound and outbound rules for this security group at a later step.
Select Create Cluster.

Step 2 - Setup Student Dataset

In the Redshift Console open the Query Editorfrm the side panel You could also use your preferred SQL client to execute the SQL statements. See here for connecting using SQLWorkBench/J
Connect to the 'dev' db which is created during the creation of the Cluster.
In the query editor run the DDLSchemaTable.sql
This will create student_schema schema and a table study_details within the schema.
Following this run the studentRecordsInsert.sql to insert the sample student dataset for the lab.
View the data student data loaded in Redshift

Step 3 - Create VPC Endpoints

S3 Gateway Endpoint

Navigate to the VPC service in the AWS Console.
Select the Endpoints option from the left pane.
Create a VPC Endpoint by selecting the 'Create Endpoint'.
Set the servict category as AWS services and search for the S3 service.
Select the service with type Gateway.
The default VPC should be selected by default else select the VPC where the redshift cluster was created.
Leave other options as is, scroll down, and click on 'Create Endpoint'

Glue Interface Endpoint

Create a VPC Endpoint by selecting the 'Create Endpoint'.
Set the servict category as AWS services and search for the Glue service.
Select the service with type Interface.
The default VPC should be selected by default else select the VPC where the redshift cluster was created.
Leave other options as is, scroll down, and click on 'Create Endpoint'

Step 4 - Alter Security Group Rules

Go to the Security group feature in the EC2 Console.
Fetch the security group noted in Step1
Alter the inbound rules like so.
Alter the outbound rules like so. While altering the outbound rules, ensure that the prefix list selected (pl-xxxxx) match the prefix list created for the S3 VPC endpoint.

Step 5 - Create an S3 bucket

Go to the S3 Console and click on create bucket.
S3 buckets have to unique. Select a unique name and create bucket.
Create 2 folders namely, AWSDatasetOutput and recipeJobOutput.

Step 6 - Create an IAM Role

Create new IAM Policy

Go to the IAM Policy console.
Click on Create Policy.
Navigate to JSON sub tab and paste the contents of the policy json
- This policy is an extension of AwsGlueDataBrewDataResourcePolicy
- Additionally to the permissions AwsGlueDataBrewDataResourcePolicy, this databrew instance also requires a few extra permissions. e.g. glue:GetConnection.
- Note: This policy can be further restricted to allow access to specific resources. An example of a more restricted policy is AwsGlueDataBrewDataResourcePolicy.json. In this example, the resources section is more specific to allow access to the specific S3 bucket and the specific glue connection.
Click Next Tags, then enter AwsGlueDataBrewDataResourcePolicy as the name of the Policy and click on Create Policy.

Create new IAM Role

Go to the IAM Role console.
Click on Create Role.
Select Glue as the trusted entity. Click on Next: Permissions
Filter for AwsGlueDataBrewDataResourcePolicy and select policy.
Click on Next:Tags. Add any tags as appropriate.
Click on Next: Review
Record AwsGlueDataBrewDataAccessRole as the role name and click on Create Role.

Step 6 - Prepare data using DataBrew

Create new connection

Go to the AWS Glue DataBrew.
On the left pane select Datasets and navigate to the Connections tab.
Create a new Connection.
Enter the Connection name as students-connection.
Select 'Amazon RedShift' as the Connection type.
Select the Redshift cluster, the database name, the AWS User and the password that was used to create the cluster.

Create new dataset

Select the created connection and click on 'Create dataset with this connection'.
Enter the Dataset name as studentrs-dataset.
The connection name should be auto-populated. Select the table, study_details.
Enter the s3 destination as s3://bucketname/AWSDatasetOutput/
Click on Create dataset.
At this point, if you open the dataset and navigate to the Data profile overview subtab or Column statistics subtab, you will see no information. This is because a data profiling has not been completed. We will do this in a future step.

Create new project

Select the created dataset and click on 'Create project with this dataset'.
Enter the project name as studentrs-project. The recipe name, the dataset name and the table name should be autopopulated.
Select the Role Name as AwsGlueDataBrewDataAccessRole created in Step 6
Click on Create Project. Once created, the project will run and provide a sample dataset view.

Data profiling

Navigate to the Jobs in the left pane and go to the Profile job
Click on 'Create Job' and enter the job name as student-profile-job.
Select the 'Create a profile job' option for Job Type.
Enter studentrs-dataset as Job input.
For the job output setting enter the s3 location created in Step 5. Set as s3://bucketname/.
Click on 'Create and run job'.
Select the created job and monitor progress to make sure the job has completed. This might take a few minutes depending on the size of the dataset.
Navigate to the Data lineage sub tab for the selected dataset to view a graphical representation of the data flow.
Navigate to the dataset and view column statistics. The data profiling job populates this data.
The data profiling provides insight into the data. e.g. missing data, outliers etc. In the dataset here we have 3 records without age field populated.

Data refining

DataBrew allows the refining of data by providing a number of tools that we will explore below. Using these tools, we can create databrew recipes to refine and prepare the data to be ingested by AWS Sagemaker to drive inferences.
As part of the refining process, lets delete the first name, last name and the schoolname.
We know from the profiling report that the age value is missing in three records. Let’s fill in the missing value with the median age of other records. Choose Missing and choose Fill with numeric aggregate.
The next step is to convert the categorical value to a numerical value for the gender column.
- Choose Mapping and choose Categorical mapping.
- For Source column, choose gender.
- For Mapping options, select Map top 2 values.
- For Map values, select Map values to numeric values.
- For F, choose 1.
- For M, choose 2.
ML algorithms often can’t work on label data directly, requiring the input variables to be numeric. One-hot encoding is one technique that converts categorical data that doesn’t have an ordinal relationship with each other to numeric data. To apply one-hot encoding, complete the following steps:
- Choose Encode and choose One-hot encode column.
- For Column select health.
- Click Apply
- This steps splits the health column into a number of columns.
A number of similar changes can be done. e.g. deleting the original gender column and renaming the new gender_mapped column to gender etc.
Post all the desired refinements, a recipe containing all the changes to be applied is created. This can be viewed on the right hand pane of the screen.
The recipe can now be published so that it can be applied to the entire dataset. Select the publish recipe option and leave the version description as is.
The published recipes can be viewed by selecting the Recipe option on the left pane.

Create Recipe Job on entire dataset

Now that the recipe is created it can be run to profile the entire data student dataset.

On the recipe page, choose Create job with this Recipe.
For Job name¸ enter student-performance.
Leave the job type as Create a recipe job
Dataset input as studentrs-dataset.
Select the output to Amazon S3 and select the S3 bucket we created in Step 5.
For the IAM Role name select AwsGlueDataBrewDataAccessRole
Click on Create and run job.
Navigate to the Job and wait for the job to finish. This should take a few minutes.
Navigate to the output to view the results of the recipe job created in the selected Amazon S3 bucket.
This CSV file can now be fed into AL/ML services for further analysis as required.

Summary

In this lab, we created an Amazon Redshift cluster data warehouse and loaded a student dataset. We used a JDBC connection to create a DataBrew dataset for an Amazon Redshift table. We then performed data profiling followed by some data transformation using DataBrew, preparing the data to be ingested by a ML model building exercise.

Cleanup

Follow the below steps to cleanup your account to prevent any aditional charges:

Navigate to the Jobs and delete the recipe job and the profiling job.
Navigate to the Recipes and delete the recipe and versions.
Navigate to the Projects and delete the project created in the lab.
Navigate to the Datasets and delete the dataset created in the lab.
Navigate to AWS Glue service and navigate to connections.
Delete connection created in the lab
Navigate to S3 console
Empty and then delete the bucket created in Step 5.
You can choose to remove the VPC Endpoints, IAM Policy and role and any security group alterations done.
Navigate to Redshift console
Open the student-cluster and delete the cluster. Uncheck the prompt to take a snapshot before deletion of the cluster

Survey

Please help us to provide your feedback here. Participants who complete the surveys from AWS Innovate Online Conference - Data Edition will receive a gift code for USD25 in AWS credits. AWS credits will be sent via email by 30 September, 2021.

phonghuule / dp-ar-awsgdb Goto Github PK

dp-ar-awsgdb's Introduction