Redshift: How to build a successful demo for your customers > Data Scientist Persona Demo > Sales forecasting using Python

Sales forecasting using Python

Demo Video

Client Tool

This demo will use Sagemaker jupyter notebook. The screenshots for this demo will leverage the BI Tool of QuickSight. Be sure to complete the BI Tool setup before proceeding.
Please go through Demo preparation first. Then open notebook sales_forecasting.ipynb

Challenge

Kim (Data scientist) has been tasked to track if sales team is on track to hit quarterly sales target, and determine if we have enough workforce to keep up with demand. To achieve this, Kim plans to build an machine learning model to forecast sales for next 30 days which will help driving decisions, such as hiring, resource management, planning and budgeting. To accomplish this Kim will have to do the following steps.

Demo Video
Client Tool
Challenge
Python - ML WorkFlow
Data Preparation
Model Training and Evaluation
Forecasting
Creating Dashboards
Before you Leave

Python - ML WorkFlow

Data Preparation

Say	Do	Show
We have history sales transaction data of last year stored on s3. First Kim created an external table point to it so she can exam the dataset quickly without loading it into Data warehouse.	Go to sales_forecasting notebook and execute step 2 and 3.
Kim found out the data is not clean. For example, in some transactions, CustomerID is null, sale item quantity is null, some records are for damaged goods, wrong order, or samples… Those should be excluded from the forecasting. So she decided to load the data into Redshift data warehouse and leverage its power to cleanup and transform the data efficiently.	Go to sales_forecasting notebook and execute step 4.

Model Training and Evaluation

Say	Do	Show
Kim uses Redshift to aggregate the whole year sales transaction data to daily sales time series data and load them into notebook, then split the data into train and test sets for model training and evaluating. For standard machine learning tasks such as classification and regression, one typically obtains this split by randomly separating examples into train and test sets. However, in forecasting it is important to do this train/test split based on time rather than by time series.	Goto sales_forecasting notebook and execute step 5
After tried various model including Auto regressive model (AR), Moving averages model (MV), AutoRegressive Integrated Moving Average (ARIMA), Kim determined SARIMA is the best one, which has a seasonal correlation and can identify the seasonality of the time series and provide more accurate prediction. To get the right parameter easier and faster, Kim uses auto_arima to tuning the model automatically	Goto sales_forecasting notebook and execute step 6 Depend on your notebook instance type, this step could take 2-5 minutes. In the output of step 6, scroll down to find and show model summary.

Say

Show

Kim uses Redshift to aggregate the whole year sales transaction data to daily sales time series data and load them into notebook, then split the data into train and test sets for model training and evaluating.

For standard machine learning tasks such as classification and regression, one typically obtains this split by randomly separating examples into train and test sets. However, in forecasting it is important to do this train/test split based on time rather than by time series.

Goto sales_forecasting notebook and execute step 5

After tried various model including Auto regressive model (AR), Moving averages model (MV), AutoRegressive Integrated Moving Average (ARIMA), Kim determined SARIMA is the best one, which has a seasonal correlation and can identify the seasonality of the time series and provide more accurate prediction.

To get the right parameter easier and faster, Kim uses auto_arima to tuning the model automatically

Goto sales_forecasting notebook and execute step 6

Depend on your notebook instance type, this step could take 2-5 minutes.

In the output of step 6, scroll down to find and show model summary.

Forecasting

Say Do Show

Now Kim can do forecasting for next 30 days using the trained SARIMA model. We use python matplotlib for visualization. It’s integrated with Pandas DataFrame so we can easily plot the forecasting result in notebook.. Go to sales_forecasting notebook and execute step 7.

To analyze the forecast result further and build dashboard on it, Kim stores the result back into Data warehouse in table ml.sales_forecast. This can be done by using Pandas DataFrame directly if the resultset is small. DataFrame can insert the result to target database in batch. Go to sales_forecasting notebook and execute step 8.

Say	Do	Show
Now Kim can do forecasting for next 30 days using the trained SARIMA model. We use python matplotlib for visualization. It’s integrated with Pandas DataFrame so we can easily plot the forecasting result in notebook..	Go to sales_forecasting notebook and execute step 7.
To analyze the forecast result further and build dashboard on it, Kim stores the result back into Data warehouse in table `ml.sales_forecast`. This can be done by using Pandas DataFrame directly if the resultset is small. DataFrame can insert the result to target database in batch.	Go to sales_forecasting notebook and execute step 8.
Kim also uses SARIMA model to analyze popular product trend and forecast sales for them. Note: It’s safe to ignore the warning here.	Go to sales_forecasting notebook and execute step 9, then select a product and click the `Run Interact` button.

Kim also uses SARIMA model to analyze popular product trend and forecast sales for them.

Note: It’s safe to ignore the warning here.

Go to sales_forecasting notebook and execute step 9, then select a product and click the Run Interact button.

Creating Dashboards

Say	Do	Show
Now Kim can visualize forecast result for next 30 days using Quicksight Dashboards and share results with her team. The first step in building a Quicksight Dashboard is to setup the dataset.	Navigate to Quicksight and Create a new Data Set using the source created earlier.
The actual sales data is stored in “sales_clean” table in “ml” schema. We will visualize it first to compare with forecast data.	Select the “ml” schema from the dropdown list and select table “sales_clean” under the Tables section. Click on “Use custom SQL” button
We will use “Custom SQL” to select the data required and add weekly, monthly buckets	Name the Query as “DS Sales History SQL”, enter the SQL code below into “Custom SQL” section, and click “Confirm query”. On the next screen click on Visualize button `select 'S' as sflag, trunc(invoicedate) as salesdate, DATE_PART(y, invoicedate) \|\| '/' \|\| DATE_PART(w, invoicedate) as week, to_date (DATE_PART(y, invoicedate) \|\| '/' \|\| DATE_PART(mon, invoicedate), 'YYYY/MM') as month, DATE_PART(y, invoicedate) \|\| '/' \|\| DATE_PART(qtr, invoicedate) as quarter, sum(quantity*unitprice) as totalsales from ml.sales_clean where invoicedate < '2011-07-18' group by salesdate, week, month, quarter order by salesdate`
We will use line chart to show the history sales over time.	For the “Sales by day” choose the following parameters: Dataset: “DS Sales History SQL” x-axis = salesdate value = totalsales
We will now add a second data set and visual for the “Forecast for next 30 days” and create an new visualization	Enter the SQL code below into “Custom SQL” section, and click of Finish, and then click on “Save” and name the dataset “DS Forecast SQL” `select 'F' as sflag, trunc(sales_date) as salesdate, DATE_PART(y, sales_date) \|\| '/' \|\| DATE_PART(w, sales_date) as week, to_date (DATE_PART(y, sales_date) \|\| '/' \|\| DATE_PART(mon, sales_date), 'YYYY/MM') as month, DATE_PART(y, sales_date) \|\| '/' \|\| DATE_PART(qtr, sales_date) as quarter, sum(total_sales) as totalsales, sum(lower) as lower, sum(upper) as upper from ml.sales_forecast where sales_date >= '2011-07-01' group by salesdate, week, month, quarter order by salesdate`
Let’s format the visual to make the y-axis align.	Click on Format Visual on each of the chart Set Range to custom range to 80000 for both “Sales by Day” and “Forecast for next 30 days” charts Set Axis Step to 20000 for both “Sales by Day” and “Forecast for next 30 days” charts
If Kims wants to show Sales and Forecast in one view, then she needs to combine the Sales and Forecast data. For this we can create a view from Redshift data warehouse.	In Redshift Query Editor, create a view using below SQL create view ml.v_sales_and_forecast as select 'S' as sflag, trunc(invoicedate) as salesdate, DATE_PART(y, invoicedate) \|\| '/' \|\| DATE_PART(w, invoicedate) as week, to_date (DATE_PART(y, invoicedate) \|\| '/' \|\| DATE_PART(mon, invoicedate), 'YYYY/MM') as month, DATE_PART(y, invoicedate) \|\| '/' \|\| DATE_PART(qtr, invoicedate) as quarter, sum(quantity*unitprice) as totalsales from ml.sales_clean group by salesdate, week, month, quarter UNION select 'F' as sflag, trunc(sales_date) as salesdate, DATE_PART(y, sales_date) \|\| '/' \|\| DATE_PART(w, sales_date) as week, to_date (DATE_PART(y, sales_date) \|\| '/' \|\| DATE_PART(mon, sales_date), 'YYYY/MM') as month, DATE_PART(y, sales_date) \|\| '/' \|\| DATE_PART(qtr, sales_date) as quarter, sum(total_sales) as totalsales from ml.sales_forecast group by salesdate, week, month, quarter order by salesdate;
We will now add the third data set “v_sales_and_forecast” and create visualization for the “Sales and Forecast by Month”	Click on “Add” and then click on “Add data set” From the list choose “v_sales_and_forecast” and click Select
We will now add a third visual for the “Sales and Forecast by Month” and create an new visualization	Select the dataset “v_sales_and_forecast”, and click on “+Add” button as shown and add Visual for the second dataset. For the third visualisation set name to “Sales and Forecast by Month” choose the following parameters: Dataset: v_sales_and_forecast; x-axis = month, value = totalsales For Visual Type: choose Bar Chart Click on Format Visual for “Sales and Forecast by Month” chart Set Range to custom range to 800000 for “Sales and Forecast by Month” Set Axis Step to 200000 for “Sales and Forecast by Month”

Before you Leave

To clear out any QuickSight changes you made during the demo, delete the Dashboard, Analysis & Data Set you created.

Please execute the Cleanup step to clear out any changes you made during the demo.

If you are done using your cluster, please think about deleting the CFN stack or to avoid having to pay for unused resources do these tasks:

pause your Redshift Cluster
stop the Oracle database
stop the DMS task