AWS Redshift - Getting Started

Mystique Unicorn app is generating lot of data. The datascience team is looking to get started with Amazon Redshift as the data warehouse platform of choice.

Can you help them with a proof of concept, to show them how to use Amazon Redshift?

🎯 Solutions

We will be build a simple 1-node redshift cluster, ingest sample data from S3 and run some simple queries.

In this article, we will build an architecture, similar to the one shown above.

🧰 Prerequisites

This demo, instructions, scripts and cloudformation template is designed to be run in us-east-1. With few modifications you can try it out in other regions as well(Not covered here).
- 🛠 AWS CLI Installed & Configured - Get help here
- 🛠 AWS CDK Installed & Configured - Get help here
- 🛠 Python Packages, Change the below commands to suit your OS, the following is written for amzn linux 2
  - Python3 - yum install -y python3
  - Python Pip - yum install -y python-pip
  - Virtualenv - pip3 install virtualenv

⚙️ Setting up the environment

Get the application code

git clone https://github.com/miztiik/redshift-demo
cd redshift-demo

🚀 Prepare the dev environment to run AWS CDK

We will use cdk to make our deployments easier. Lets go ahead and install the necessary components.
```
# You should have npm pre-installed
# If you DONT have cdk installed
npm install -g aws-cdk

# Make sure you in root directory
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt
```
The very first time you deploy an AWS CDK app into an environment (account/region), you’ll need to install a bootstrap stack, Otherwise just go ahead and deploy using cdk deploy.
```
cdk bootstrap
cdk ls
# Follow on screen prompts
```
You should see an output of the available stacks,
```
redshift-demo-vpc-stack
redshift-demo-stack
```
🚀 Deploying the application

Let us walk through each of the stacks,
- Stack: redshift-demo-stack
  
  To host the redshift cluster, we need a vpc. This will be created as a dependant stack for us. The cluster database is named as comments_cluster with superuser as dwh_user. The password for the cluster is generated from AWS Secrets Manager. To interact with other AWS Services like S3, we need an IAM Role. This will be created along with the cluster
  
  Initiate the deployment with the following command,
```
cdk deploy redshift-demo-stack
```
  Check the Outputs section of the stack for the cluster endpoint and password.

🔬 Working with Redshift Cluster

If you have Postgresql client, you can connect to the cluster using the same. But here we will use the AWS Console Redshift Editor for admin functions. In this repo, you will find data_samples/src/amzn_reviews_en.json. Copy them to your S3 bucket. We will ingest these files to our cluster. _You can find more sample data here 2.

If you have python locally installed, here is a dataframe of the amzn_reviews_en.json file. Based on this, we can create a table with the following schema

>>> df
   language product_category          product_id  ...                                  review_title          reviewer_id stars
0          en     baby_product  product_en_0878845  ...   Not worth the price and very bad cap design  reviewer_en_0987470   1.0
1          en            shoes  product_en_0004522  ...                                      Garbage!  reviewer_en_0731158   1.0
2          en   office_product  product_en_0060687  ...               I do not recommend this printer  reviewer_en_0793876   1.0
3          en   office_product  product_en_0311791  ...  Don't purchase these refurbished cartridges!  reviewer_en_0837288   1.0
4          en     baby_product  product_en_0472877  ...                                     Not worth  reviewer_en_0878169   1.0
...       ...              ...                 ...  ...                                           ...                  ...   ...
4995       en             home  product_en_0068982  ...                                      So Cute!  reviewer_en_0061521   5.0
4996       en          kitchen  product_en_0563046  ...                                       Perfect  reviewer_en_0871798   5.0
4997       en     pet_products  product_en_0848682  ...                        My Doberman Loves Them  reviewer_en_0474236   5.0
4998       en            other  product_en_0536493  ...                                    Five Stars  reviewer_en_0546192   5.0
4999       en          apparel  product_en_0964853  ...                                   SUPER CUTE!  reviewer_en_0850986   5.0

[5000 rows x 8 columns]

>>> print("Schema: {}".format(pfile.schema))
Schema: language: string
-- field metadata --
PARQUET:field_id: '1'
product_category: string
-- field metadata --
PARQUET:field_id: '2'
product_id: string
-- field metadata --
PARQUET:field_id: '3'
review_body: string
-- field metadata --
PARQUET:field_id: '4'
review_id: string
-- field metadata --
PARQUET:field_id: '5'
review_title: string
-- field metadata --
PARQUET:field_id: '6'
reviewer_id: string
-- field metadata --
PARQUET:field_id: '7'
stars: string
-- field metadata --
PARQUET:field_id: '8'
-- schema metadata --
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 545
>>>

CREATE TABLE:

We will name our table as product_reviews_json using the following sql command.

CREATE TABLE IF NOT EXISTS product_reviews_json(
review_id           varchar(100)        not null    distkey sortkey,
product_id          varchar(100)        not null,
stars               varchar(10)         not null,
review_body         varchar(10000)      not null,
review_title        varchar(1000)       not null,
reviewer_id         varchar(100)        not null,
language            varchar(2)          not null,
product_category    varchar(100)        not null,
primary key(review_id)
);

To make the data ingestion faster, split your files - Ideally matching your slices. Here is a list of best practices to improve performance1. One of the recommendations is to use a columnar compressed format like parquet compressed using snazzy. You can also find the reviews data in parquet format data_samples/src/amzn_reviews_en_part00000.parquet.snappy. In S3 you can see the size taken by these two files,

_In case you made any error, you can drop the table using DROP TABLE public.product_reviews_json; and try again.

INGEST DATA TO Redshift:

We will use the COPY command³ to ingest our JSON file.

COPY product_reviews_json
FROM 's3://YOUR-BUCKET-NAME/amzn_reviews_en.json' 
IAM_ROLE 'YOUR-REDSHIFT-IAM-ROLE-ARN'
json 'auto ignorecase';

This file has 5000 records, let us confirm we have ingested all our data.

/* COUNT FOR ALL RECORDS */
SELECT COUNT(*) FROM product_reviews_json;

INGEST Performance: If you are using split and compressed files, your ingest performance can improve. Since our file is hardly 2MB, this query shows a small improvement in performance. But do try with a large file ~ few GBs or TBs

select datediff(s,starttime,endtime) as duration,
* 
from 
stl_query 
where query in (
    57991, /*  query id of json copy */
    58689  /*  query id of parquet copy */
);

TROUBLESHOOTING LOAD ERRORS: In case you are getting errors when you are loading data, You can troubleshoot those errors with these queries. The following query joins STL_LOAD_ERRORS to STL_LOADERROR_DETAIL to view the details errors that occurred during the most recent load.

select d.query, substring(d.filename,14,20), 
d.line_number as line, 
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();

query |    substring      | line |  value   |              err_reason
-------+-------------------+------+----------+----------------------------
    558| allusers_pipe.txt |  251 | 251      | String contains invalid or unsupported UTF8 code
    558| allusers_pipe.txt |  251 | ZRU29FGR | String contains invalid or unsupported UTF8 code
    558| allusers_pipe.txt |  251 | Kaitlin  | String contains invalid or unsupported UTF8 code
    558| allusers_pipe.txt |  251 | Walter   | String contains invalid or unsupported UTF8 code

The following example uses STL_LOAD_ERRORS with STV_TBL_PERM to create a new view, and then uses that view to determine what errors occurred while loading data into the EVENT table:

create view loadview as
(select distinct tbl, trim(name) as table_name, query, starttime,
trim(filename) as input, line_number, colname, err_code,
trim(err_reason) as reason
from stl_load_errors sl, stv_tbl_perm sp
where sl.tbl = sp.id);

Next, the following query actually returns the last error that occurred while loading the EVENT table:

select table_name, query, line_number, colname, starttime, 
trim(reason) as error
from loadview
where table_name ='event'
order by line_number limit 1;

The query returns the last load error that occurred for the EVENT table. If no load errors occurred, the query returns zero rows. In this example, the query returns a single error:

table_name | query | line_number | colname | error | starttime
------+-----+----+----+--------------------------------------------------------+----------------------
event | 309 |  0 |  5 | Error in Timestamp value or format [%Y-%m-%d %H:%M:%S] | 2014-04-22 15:12:44

(1 row)

RUN QUERIES:

Let us say, We want to answer this question: "Do people who buy kitchen and grocery items leave higher ratings?"

SELECT stars, COUNT(stars) total_ratings FROM product_reviews_json WHERE product_category='kitchen' or product_category='grocery'
GROUP BY stars;

VISUALIZING in Amazon Quicksight: To allow visualzing this data in quicksight, The redshift cluster has a security group that allows access from quicksight IP in us-east-1.

Housekeeping Redshift: Over a period of time, tables will get filled up with data and you may have to prune them or remove them. This query will give you the size of your tables,

SELECT "table", size, tbl_rows FROM SVV_TABLE_INFO;

📒 Conclusion

Here we have demonstrated how to create a redshift cluster, ingest data, querty and visualize data. You can extend this idea to process the data using AWS Glue and then ingest into the cluster.
🧹 CleanUp

If you want to destroy all the resources created by the stack, Execute the below command to delete the stack, or you can delete the stack from console as well
- Resources created during Deploying The Application
- Delete CloudWatch Lambda LogGroups
- Any other custom resources, you have created for this demo
```
# Delete from cdk
cdk destroy

# Follow any on-screen prompts

# Delete the CF Stack, If you used cloudformation to deploy the stack.
aws cloudformation delete-stack \
  --stack-name "MiztiikAutomationStack" \
  --region "${AWS_REGION}"
```
This is not an exhaustive list, please carry out other necessary steps as maybe applicable to your needs.

📌 Who is using this

This repository aims to show how to setup a redshift datawarehouse to new developers, Solution Architects & Ops Engineers in AWS. Based on that knowledge these Udemy course #1, course #2 helps you build complete architecture in AWS.

💡 Help/Suggestions or 🐛 Bugs

Thank you for your interest in contributing to our project. Whether it is a bug report, new feature, correction, or additional documentation or solutions, we greatly value feedback and contributions from our community. Start here

miztiik / redshift-demo Goto Github PK

redshift-demo's Introduction

AWS Redshift - Getting Started

🎯 Solutions

🧰 Prerequisites

⚙️ Setting up the environment

🚀 Prepare the dev environment to run AWS CDK

🚀 Deploying the application

🔬 Working with Redshift Cluster

📒 Conclusion

🧹 CleanUp

📌 Who is using this

💡 Help/Suggestions or 🐛 Bugs

👋 Buy me a coffee

📚 References

🏷️ Metadata

redshift-demo's People

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org