Working with the Data Lake

Demo Video

Before you Begin

Capture the following parameters from the launched CloudFormation template as you will use them in the demo.

https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks?filteringText=&filteringStatus=active&viewNested=true&hideStacks=false
  • GlueExternalDatabaseName
  • TaskDataBucketName

Client Tool

The screenshots for this demo will leverage the SQL Client Tool of SQLWorkbench/J. Be sure to complete the SQL Client Tool setup before proceeding. Alternatives to the SQLWorkbench/J include:

Challenge

Miguel has recently been tasked with solving a few different business problems with data related to their website traffic and purchasing behavior. He asked Marie how could he access the product review data. Marie recently added automated ingestion that loads product review and web analytics data into their new data lake in S3 and it is available in the Data Catalog. To solve this challenge Miguel will do the following:

  1. Sample the Data
  2. Connect the Data to Redshift
  3. Analyze the Data
  4. Extend the Data Warehouse with the Data Lake

Sample the Data

Say Do Show
First, let’s take a look at the raw data in S3. You’ll notice that the data is first partitioned by product. Navigate to: https://s3.console.aws.amazon.com/s3/buckets/amazon-reviews-pds/parquet/?region=us-east-1
Next, let’s dive deeper and look at the Apparel category and see the files Navigate to: https://s3.console.aws.amazon.com/s3/buckets/amazon-reviews-pds/parquet/product_category=Apparel?region=us-east-1 or just click on the product_category=Apparel folder
Prior to starting the demo today, this data was cataloged using a service called the Glue Data Crawler. Next, let’s take a look at the Glue Catalog and see some metadata about this table. Navigate to the glue table list: https://console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=tables. Click on the table associated to <GlueExternalDatabaseName> database.
Inspecting the table metadata, we see some interesting things. Highlight the SerDe of Parquet and the recordCount of 160Mil
Inspecting the table schema, we see the column data types and the partition columns. Highlight that the product_category is set as the partition key.

Connect the data to Redshift

SayDoShow
Let’s now connect our Redshift Cluster to this External Database for querying.

Replace the value for <GlueExternalDatabaseName> with the value previously determined.

Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows. Follow the blog for details.

Execute the following command:

DROP SCHEMA IF EXISTS demo;
CREATE EXTERNAL SCHEMA IF NOT EXISTS demo
FROM DATA CATALOG DATABASE '<GlueExternalDatabaseName>'
IAM_ROLE default;

Let’s do a quick check to confirm we can access the data. Note, the 160+ million records matches the number the Glue Crawler showed.

Execute the following:

select count(1) from demo.parquet;

Analyze the Data

SayDoShow
Looks interesting! Now Miguel wants to aggregate the reviews data by product, number of reviews and their average rating. In order to use this data in future queries, we’ll create a view named product_reviews_agg. Execute the following:

drop view if exists public.product_reviews_agg;
create view public.product_reviews_agg AS
select
  marketplace, product_category, product_parent, product_title, year,
  COUNT(1) as review_rating_total,
  AVG(star_rating) as review_rating_avg,
  (COUNT(CASE WHEN verified_purchase='Y' THEN 1 ELSE 0 END)/COUNT(1))*100 as verified_pct
FROM  demo.parquet
GROUP BY 1,2,3,4,5
with no schema binding;

Now, Let’s find the most reviewed products in the Home and Grocery categories using the view we just created. Execute the following:

select
  marketplace, product_category, product_parent, product_title, year,
  review_rating_total, review_rating_avg, verified_pct
FROM public.product_reviews_agg
WHERE product_category in ('Home' , 'Grocery')
and marketplace = 'US'
and review_rating_total >4000
and verified_pct >99
order BY product_category, product_parent, product_title, year;

Miguel has quickly found that the product “San Francisco Bay One Cup Coffee” is the most reviewed product. Let’s look at it further to try and understand how this product is perceived over time by the customers that actually purchased it. Let’s look at the 30 and 90 day rolling averages to look for trends… Execute the following:

select
  A.marketplace, A.product_category, A.product_parent,
  A.product_title, A.review_date,
  date_part(epoch, A.review_date) as ts_review,
  COUNT(1) over (order by A.review_date rows 10 preceding) as total,
  AVG(CAST(A.star_rating as decimal(3,2))) over (order by A.review_date, A.star_rating rows 30 preceding) as avg30,
  AVG(CAST(A.star_rating as decimal(3,2))) over (order by A.review_date, A.star_rating rows 90 preceding) as avg90
  FROM demo.parquet A
  WHERE A.product_parent = 267956568
  ORDER BY A.review_date
  limit 1000;

Extend the Data Warehouse with the Data Lake

SayDoShow
After looking at the review trends for the SF Bay Coffee product, Miguel, wants to gain further insight into the customers across all the product categories. He also wants to leverage the existing customer demographic information in Redshift to augment the data. To support his reporting audience, Miguel will place the combined data set into a new Redshift table.

Execute the following which will combine the loaded customer data with the reviews into a new table:

drop table if exists public.customer_trends;
create table public.customer_trends as
select
  A.product_category,A.product_parent,
  A.review_date as Review_Date,
  C.ca_state as Customer_State,
  AVG(CAST(A.star_rating as decimal(3,2)))
    over (partition by product_parent
          order by A.review_date, A.star_rating rows 30 preceding)
    as review_rating_30rolling_avg,
  AVG(CAST(A.star_rating as decimal(3,2)))
    over (partition by product_parent
          order by A.review_date, A.star_rating rows 90 preceding)
    as review_rating_90rolling_avg
from
  demo.parquet  A,
  public.customer B,
  public.customer_address C
where A.customer_id = B.c_customer_sk
  and B.c_current_addr_sk = C.ca_address_sk
  and marketplace = 'US'
  and A.review_date > '2015-01-01'
  order by A.product_parent, A.review_date;


The result are very useful! Miguel wants to export the result back to Data lake to share it with other teams. He can easily unload data from Redshift to the Data Lake, so other tools can process the data when needed. Note: UNLOAD supports dynamic partitioning and parquet format. By selecting a partition column we can optimize the output for query performance.

Replace the value for <TaskDataBucketName> with the value previously determined.

Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows. Follow the blog for details.

Execute the following which will export the combined table to a new S3 location:

unload ('select * from public.customer_trends')
to 's3://<TaskDataBucketName>/rll-datalake/report/'
iam_role default
FORMAT PARQUET
ALLOWOVERWRITE
PARTITION BY (product_category );

Let’s check the exported data on S3

Replace the value for <TaskDataBucketName> with the value previously determined.

Navigate to the S3:

https://s3.console.aws.amazon.com/s3/buckets/<TaskDataBucketName>/rll-datalake/report/

Before you Leave

Execute the following to clear out any changes you made during the demo.

DROP VIEW   public.product_reviews_agg;
DROP TABLE  public.customer_trends;
DROP SCHEMA demo;

Navigate to S3 and empty your <TaskDataBucketName> bucket.

If you are done using your cluster, please think about deleting the CFN stack or to avoid having to pay for unused resources do these tasks:

  • pause your Redshift Cluster
  • stop the Oracle database
  • stop the DMS task