Loading & querying semi-structured data

Demo Video

Download the PowerPoint presentation

Before You Begin

No parameters required in the demo.

Client Tool

For this demo we will use the Query Editor v2 to connect to the Redshift cluster. Be sure to complete the SQL Client Tool setup before proceeding..

Challenge

Miguel has recently been tasked to analyze a stream of data which is landing in S3 that contains Customer, Order, and Lineitem information for a new POS system. This data is contained in JSON documents and arrives in a real-time stream every few minutes. Most reporting will need to be accurate as of yesterday. That reporting needs to be fast and efficient. In addition, some users may be interested in the data that arrives hourly and a small population will require real-time updates. Miguel passes these requirements to Marie to build an ingestion pipeline. Marie will have to complete these tasks to meet the requirements:

Load data using COPY

SayDoShow
First, Marie will create a simple table to load the JSON data into a field with the SUPER type. Using the SUPER type, Marie does not need to worry about the schema or underlying structure. Notice the noshred option which will cause the data to be loaded into one field.

Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows.

DROP TABLE IF EXISTS public.transaction_raw cascade;
CREATE TABLE public.transaction_raw (
	data_json super
);
copy public.transaction_raw
FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json'
iam_role default region 'us-east-1'
json 'noshred';

Marie can now inspect the data to see what the schema looks like. A simple select will reveal the JSON string. Marie can see that the top-level of the JSON document contains customer attributes c_custkey, c_phone, and c_acctbal. It also contains a nested array with a struct for the c_orders as well as another array with a struct for the o_lineitems of each order.

SELECT * FROM public.transaction_raw LIMIT 1;

Now Marie will attempt to use the PartiQL language to parse the top level attributes from the SUPER type field. Marie notices that certain fields have double-quote wrappers. This is because Redshift does not enforce types when querying from a SUPER field. This helps with schema evolution to ensure that if the data type changes over time the load process will not fail. Also note, there are 5 rows returned from this query. While only 5 customers are being represented, there are multiple orders and lineitems as we’ll see when we unnest the data.

SELECT data_json.c_custkey, data_json.c_phone, data_json.c_acctbal FROM public.transaction_raw;

Marie can modify her query to explicitly type-cast the extracted fields so they display correctly.

SELECT data_json.c_custkey::int, data_json.c_phone::varchar(20), data_json.c_acctbal::decimal(18,2) FROM public.transaction_raw;

Load data using COPY with shredding

SayDoShow
Marie decides the first-level of the JOSN document is not likely to change and she would like to have the COPY command parse and load these into separate fields. Marie will replace the no shred option with auto which will cause the data to be loaded in fields which match the column names in the JSON objects. The column mapping will use a case-insensitive match based on the ignorecase keyword.

Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows.

drop table if exists public.transaction cascade;
create table public.transaction(
  c_custkey bigint,
  c_phone varchar(20),
  c_acctbal decimal(18,2),
  c_orders super  
);
copy public.transaction from 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/customer_order_lineitem.json'
iam_role default region 'us-east-1'
json 'auto ignorecase';

Marie can now query this table without having to un-nest the data for the top level attributes.

select t.c_custkey, t.c_phone, t.c_acctbal from public.transaction t;

Un-nest data using PartiQL

SayDoShow
Marie now needs to unnest the c_orders which is a complex data structure. Furthermore, the c_orders structure contains another nested field, o_lineitems. Marie will write a query which un-nests the data and determine the count of orders and lineitems. Notice how the main table transaction is aliased by t and to un-nest the data, the t.c_orders is referenced in the FROM clause and aliased by o. Finally, to un-nest the lineitems the o.o_lineitems is also referenced in the FROM clause. Notice how many lineitems are represented in these 5 customer records.

SELECT count(distinct t.c_custkey) cust_count, count(distinct(o.o_orderkey)) order_count, count(1) lineitem_count
FROM public.transaction t, t.c_orders o, o.o_lineitems l;

To get a fully un-nested and flattened data set, Marie can reference different fields from the c_orders and o_lineitems structs; taking care to type cast them as appropriate.

SELECT t.c_custkey::int, t.c_phone::varchar(20), t.c_acctbal::decimal(18,2), o.o_orderdate::date,
  o.o_orderstatus::varchar(10), l.l_quantity::int, l.l_shipmode::varchar(10), l.l_extendedprice::decimal(18,2)
FROM public.transaction t, t.c_orders o, o.o_lineitems l;

Flatten data using MVs

SayDoShow
Marie knows that Miguel and the other analysts will want the best performance when running reports against this data. Rather than expose the transaction table to Miguel and require him to run PartiQL queries to un-nest the data, Marie decides to build a process to un-nest the data for him. She decides to use a Materialized View with incremental refresh. Using this strategy, as new data gets inserted to the table it can be added to the materialized view.

DROP MATERIALIZED VIEW IF EXISTS public.transaction_denorm;
CREATE MATERIALIZED VIEW public.transaction_denorm
BACKUP NO as
  select t.c_custkey::int, t.c_phone::varchar(20), t.c_acctbal::decimal(18,2), o.o_orderdate::date, o.o_orderstatus::varchar(10), l.l_quantity::int, l.l_shipmode::varchar(10), l.l_extendedprice::decimal(18,2)
  from public.transaction t, t.c_orders o, o.o_lineitems l;

Let’s get a quick count of the records which are now in the materialized view.

select count(1) from public.transaction_denorm;

Marie wants to test the incremental refresh of the materialized view. To do this, she can run an INSERT statement to load data into the base transaction table. She’ll use the JSON_PARSE function to convert JSON in string form to the SUPER type.

INSERT INTO public.transaction VALUES (
  1234,
  '800-867-5309',
  441989.88,
  JSON_PARSE(' [{
      "o_lineitems":[{
            "l_shipmode":"TRUCK",
            "l_quantity":4,
            "l_extendedprice":28007.64,
						"l_linenumber":1}],
      "o_orderdate":"2014-06-01",
			"o_orderstatus":"O",
			"o_orderkey":"1234"}]'));
REFRESH MATERIALIZED VIEW public.transaction_denorm;

Let’s see if the record we just added to our transaction table was inserted into the transaction_denorm through the REFRESH command.

SELECT * FROM public.transaction_denorm where c_custkey = 1234;

Query data using Spectrum

SayDoShow
Marie plans to execute the ingestion and the materialized view refresh on a hourly basis to fulfill most reporting requirements. However, some users also want to see the data as it arrives. For this use-case Marie will expose the JSON data using Redshift Spectrum. She will first create the external schema.

Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows.

drop schema if exists pos;
CREATE external SCHEMA pos
FROM data catalog DATABASE 'pos'
IAM_ROLE default
CREATE external DATABASE if not exists;

Now, Marie can create the external table. Notice that when create an external table the schema for the struct data type must be defined. Compare this to the SUPER data type which does not require a schema to be defined.

drop table if exists pos.transaction;
create external table pos.transaction (
  c_custkey int,
  c_phone varchar,
  c_acctbal decimal(18,2),   
  c_orders array<struct<o_orderstatus:varchar,o_clerk:varchar,o_lineitems:array<struct<l_returnflag:varchar,l_receiptdate:varchar,l_tax:varchar,l_shipmode:varchar,l_suppkey:varchar,l_shipdate:varchar,l_commitdate:varchar,l_partkey:varchar,l_quantity:varchar,l_linestatus:varchar,l_comment:varchar,l_extendedprice:varchar,l_linenumber:varchar,l_discount:varchar,l_shipinstruct:varchar>>,o_orderdate:varchar,o_shippriority:varchar,o_totalprice:varchar,o_orderkey:varchar,o_comment:varchar,o_orderpriority:varchar>>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
with serdeproperties ( 'paths'='json_data' )
location 's3://redshift-downloads/semistructured/tpch-nested/data/json/customer_orders_lineitem/';

Marie can create a view which matches the transaction_denorm table, but references the external table which points to the data in S3.

create or replace view public.transaction_new as
select t.c_custkey::int, t.c_phone::varchar(20), t.c_acctbal::decimal(18,2), o.o_orderdate::date, o.o_orderstatus::varchar(10), l.l_quantity::int, l.l_shipmode::varchar(10), l.l_extendedprice::decimal(18,2)
from pos.transaction t, t.c_orders o, o.o_lineitems l
with no schema binding;

Finally, let’s inspect the view.

select * from public.transaction_new;

Before You Leave

Execute the following to clear out any changes you made during the demo.

DROP VIEW public.transaction_new;
DROP SCHEMA IF EXISTS pos;
DROP MATERIALIZED VIEW IF EXISTS public.transaction_denorm;
DROP TABLE IF EXISTS public.transaction cascade;
DROP TABLE IF EXISTS public.transaction_raw cascade;

If you are done using your cluster, please think about decommissioning it to avoid having to pay for unused resources. For more documentation on ingesting and querying semistructured data in Amazon Redshift refer to the documentation here: https://docs.aws.amazon.com/redshift/latest/dg/super-overview.html