Redshift: How to build a successful demo for your customers > Data Engineer Persona Demo > Loading & querying semi-structured data

Loading & querying semi-structured data

Demo Video

Download the PowerPoint presentation

Before You Begin

No parameters required in the demo.

Client Tool

For this demo we will use the Query Editor v2 to connect to the Redshift cluster. Be sure to complete the SQL Client Tool setup before proceeding..

Challenge

Miguel has recently been tasked to analyze a stream of data which is landing in S3 that contains Customer, Order, and Lineitem information for a new POS system. This data is contained in JSON documents and arrives in a real-time stream every few minutes. Most reporting will need to be accurate as of yesterday. That reporting needs to be fast and efficient. In addition, some users may be interested in the data that arrives hourly and a small population will require real-time updates. Miguel passes these requirements to Marie to build an ingestion pipeline. Marie will have to complete these tasks to meet the requirements:

Load data using COPY - to get a sense of the data and structure without any knowledge of the schema
Load data using COPY with shredding - to load the data with light/automatic parsing at the top-level
Un-nest data using PartiQL - to create a dataset that is optimal for BI reporting
Flatten data using MVs - to build into an ingestion pipeline for incremental processing
Query current data using Spectrum - to satisfy the requirement for real-time reporting

Load data using COPY

Say	Do	Show
First, Marie will create a simple table to load the JSON data into a field with the SUPER type. Using the SUPER type, Marie does not need to worry about the schema or underlying structure. Notice the noshred option which will cause the data to be loaded into one field.	Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows. `DROP TABLE IF EXISTS public.transaction_raw cascade; CREATE TABLE public.transaction_raw ( data_json super ); copy public.transaction_raw FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json' iam_role default region 'us-east-1' json 'noshred';`
Marie can now inspect the data to see what the schema looks like. A simple select will reveal the JSON string. Marie can see that the top-level of the JSON document contains customer attributes `c_custkey`, `c_phone`, and `c_acctbal`. It also contains a nested array with a struct for the `c_orders` as well as another array with a struct for the `o_lineitems` of each order.	`SELECT * FROM public.transaction_raw LIMIT 1;`
Now Marie will attempt to use the PartiQL language to parse the top level attributes from the SUPER type field. Marie notices that certain fields have double-quote wrappers. This is because Redshift does not enforce types when querying from a SUPER field. This helps with schema evolution to ensure that if the data type changes over time the load process will not fail. Also note, there are `5` rows returned from this query. While only 5 customers are being represented, there are multiple orders and lineitems as we’ll see when we unnest the data.	`SELECT data_json.c_custkey, data_json.c_phone, data_json.c_acctbal FROM public.transaction_raw;`
Marie can modify her query to explicitly type-cast the extracted fields so they display correctly.	`SELECT data_json.c_custkey::int, data_json.c_phone::varchar(20), data_json.c_acctbal::decimal(18,2) FROM public.transaction_raw;`

Load data using COPY with shredding

Say Do Show

Say	Do	Show
Marie decides the first-level of the JOSN document is not likely to change and she would like to have the COPY command parse and load these into separate fields. Marie will replace the no shred option with auto which will cause the data to be loaded in fields which match the column names in the JSON objects. The column mapping will use a case-insensitive match based on the ignorecase keyword.	Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows. `drop table if exists public.transaction cascade; create table public.transaction( c_custkey bigint, c_phone varchar(20), c_acctbal decimal(18,2), c_orders super ); copy public.transaction from 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json' iam_role default region 'us-east-1' json 'auto ignorecase';`
Marie can now query this table without having to un-nest the data for the top level attributes.	`select t.c_custkey, t.c_phone, t.c_acctbal from public.transaction t;`

Marie decides the first-level of the JOSN document is not likely to change and she would like to have the COPY command parse and load these into separate fields. Marie will replace the no shred option with auto which will cause the data to be loaded in fields which match the column names in the JSON objects. The column mapping will use a case-insensitive match based on the ignorecase keyword.

Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows.

drop table if exists public.transaction cascade;
create table public.transaction(
  c_custkey bigint,
  c_phone varchar(20),
  c_acctbal decimal(18,2),
  c_orders super  
);
copy public.transaction from 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json'
iam_role default region 'us-east-1'
json 'auto ignorecase';

Marie can now query this table without having to un-nest the data for the top level attributes.

select t.c_custkey, t.c_phone, t.c_acctbal from public.transaction t;

Un-nest data using PartiQL

Say Do Show

Say	Do	Show
Marie now needs to unnest the `c_orders` which is a complex data structure. Furthermore, the `c_orders` structure contains another nested field, `o_lineitems`. Marie will write a query which un-nests the data and determine the count of `orders` and `lineitems`. Notice how the main table `transaction` is aliased by `t` and to un-nest the data, the `t.c_orders` is referenced in the `FROM` clause and aliased by `o`. Finally, to un-nest the lineitems the `o.o_lineitems` is also referenced in the `FROM` clause. Notice how many `lineitems` are represented in these `5` customer records.	`SELECT count(distinct t.c_custkey) cust_count, count(distinct(o.o_orderkey)) order_count, count(1) lineitem_count FROM public.transaction t, t.c_orders o, o.o_lineitems l;`
To get a fully un-nested and flattened data set, Marie can reference different fields from the `c_orders` and `o_lineitems` structs; taking care to type cast them as appropriate.	`SELECT t.c_custkey::int, t.c_phone::varchar(20), t.c_acctbal::decimal(18,2), o.o_orderdate::date, o.o_orderstatus::varchar(10), l.l_quantity::int, l.l_shipmode::varchar(10), l.l_extendedprice::decimal(18,2) FROM public.transaction t, t.c_orders o, o.o_lineitems l;`

Marie now needs to unnest the c_orders which is a complex data structure. Furthermore, the c_orders structure contains another nested field, o_lineitems. Marie will write a query which un-nests the data and determine the count of orders and lineitems. Notice how the main table transaction is aliased by t and to un-nest the data, the t.c_orders is referenced in the FROM clause and aliased by o. Finally, to un-nest the lineitems the o.o_lineitems is also referenced in the FROM clause. Notice how many lineitems are represented in these 5 customer records.

SELECT count(distinct t.c_custkey) cust_count, count(distinct(o.o_orderkey)) order_count, count(1) lineitem_count
FROM public.transaction t, t.c_orders o, o.o_lineitems l;

To get a fully un-nested and flattened data set, Marie can reference different fields from the c_orders and o_lineitems structs; taking care to type cast them as appropriate.

SELECT t.c_custkey::int, t.c_phone::varchar(20), t.c_acctbal::decimal(18,2), o.o_orderdate::date,
  o.o_orderstatus::varchar(10), l.l_quantity::int, l.l_shipmode::varchar(10), l.l_extendedprice::decimal(18,2)
FROM public.transaction t, t.c_orders o, o.o_lineitems l;

Flatten data using MVs

Say	Do	Show
Marie knows that Miguel and the other analysts will want the best performance when running reports against this data. Rather than expose the `transaction` table to Miguel and require him to run PartiQL queries to un-nest the data, Marie decides to build a process to un-nest the data for him. She decides to use a Materialized View with incremental refresh. Using this strategy, as new data gets inserted to the table it can be added to the materialized view.	`DROP MATERIALIZED VIEW IF EXISTS public.transaction_denorm; CREATE MATERIALIZED VIEW public.transaction_denorm BACKUP NO as select t.c_custkey::int, t.c_phone::varchar(20), t.c_acctbal::decimal(18,2), o.o_orderdate::date, o.o_orderstatus::varchar(10), l.l_quantity::int, l.l_shipmode::varchar(10), l.l_extendedprice::decimal(18,2) from public.transaction t, t.c_orders o, o.o_lineitems l;`
Let’s get a quick count of the records which are now in the materialized view.	`select count(1) from public.transaction_denorm;`
Marie wants to test the incremental refresh of the materialized view. To do this, she can run an `INSERT` statement to load data into the base `transaction` table. She’ll use the `JSON_PARSE` function to convert JSON in string form to the SUPER type.	`INSERT INTO public.transaction VALUES ( 1234, '800-867-5309', 441989.88, JSON_PARSE(' [{ "o_lineitems":[{ "l_shipmode":"TRUCK", "l_quantity":4, "l_extendedprice":28007.64, "l_linenumber":1}], "o_orderdate":"2014-06-01", "o_orderstatus":"O", "o_orderkey":"1234"}]')); REFRESH MATERIALIZED VIEW public.transaction_denorm;`
Let’s see if the record we just added to our `transaction` table was inserted into the `transaction_denorm` through the REFRESH command.	`SELECT * FROM public.transaction_denorm where c_custkey = 1234;`

Query data using Spectrum

Say	Do	Show
Marie plans to execute the ingestion and the materialized view refresh on a hourly basis to fulfill most reporting requirements. However, some users also want to see the data as it arrives. For this use-case Marie will expose the JSON data using Redshift Spectrum. She will first create the external schema.	Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows. `drop schema if exists pos; CREATE external SCHEMA pos FROM data catalog DATABASE 'pos' IAM_ROLE default CREATE external DATABASE if not exists;`
Now, Marie can create the external table. Notice that when create an external table the schema for the struct data type must be defined. Compare this to the SUPER data type which does not require a schema to be defined.	drop table if exists pos.transaction; create external table pos.transaction ( c_custkey int, c_phone varchar, c_acctbal decimal(18,2), c_orders array<struct<o_orderstatus:varchar,o_clerk:varchar,o_lineitems:array<struct<l_returnflag:varchar,l_receiptdate:varchar,l_tax:varchar,l_shipmode:varchar,l_suppkey:varchar,l_shipdate:varchar,l_commitdate:varchar,l_partkey:varchar,l_quantity:varchar,l_linestatus:varchar,l_comment:varchar,l_extendedprice:varchar,l_linenumber:varchar,l_discount:varchar,l_shipinstruct:varchar>>,o_orderdate:varchar,o_shippriority:varchar,o_totalprice:varchar,o_orderkey:varchar,o_comment:varchar,o_orderpriority:varchar>> ) row format serde 'org.openx.data.jsonserde.JsonSerDe' with serdeproperties ( 'paths'='json_data' ) location 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/';
Marie can create a view which matches the `transaction_denorm` table, but references the external table which points to the data in S3.	`create or replace view public.transaction_new as select t.c_custkey::int, t.c_phone::varchar(20), t.c_acctbal::decimal(18,2), o.o_orderdate::date, o.o_orderstatus::varchar(10), l.l_quantity::int, l.l_shipmode::varchar(10), l.l_extendedprice::decimal(18,2) from pos.transaction t, t.c_orders o, o.o_lineitems l with no schema binding;`
Finally, let’s inspect the view.	`select * from public.transaction_new;`

Before You Leave

Execute the following to clear out any changes you made during the demo.

DROP VIEW public.transaction_new;
DROP SCHEMA IF EXISTS pos;
DROP MATERIALIZED VIEW IF EXISTS public.transaction_denorm;
DROP TABLE IF EXISTS public.transaction cascade;
DROP TABLE IF EXISTS public.transaction_raw cascade;

If you are done using your cluster, please think about decommissioning it to avoid having to pay for unused resources. For more documentation on ingesting and querying semistructured data in Amazon Redshift refer to the documentation here: https://docs.aws.amazon.com/redshift/latest/dg/super-overview.html