Prepare Data For Machine Learning Using Amazon Sagemaker Data Wrangler

Demo Video

Coming soon!

Before You Begin

You must run the main Cloud Formation template and capture the following outputs from the Outputs tab.

  1. Parameters for launching Sagemaker Studio cloud formation template.

    • SageMakerRoleArn
    • SubnetAPublicId
    • VpcId
    • TaskDataBucketName
  2. Parameters for connecting to Redshift Cluster

    • Cluster Identifier
    • Database name
    • Database User
    • Redshift Cluster Role

Launch the following Sagemaker Studio cloud formation template by clicking the button below, and provide the parameter values listed in section 1 above (captured from the main template’s output).

Launch

Client Tool

This demo will use Amazon SageMaker Data Wrangler which can be accessed from Amazon SageMaker Studio.

Challenge

Kim (Data scientist) has been tasked to predict clients who are most likely to sign up for long term deposits. But the banking data set that Kim has received is not machine learning ready and Kim must analyze and cleanse the data before a Machine Learning Model can be trained.

Banking data set is related with direct marketing campaigns (phone calls). Campaigns are run to encourage clients to sign up for long term deposits. Banking data set contains client data, campaign data and socio-economic data. Before machine learning model is built Kim is going to analyze the data set and apply various data cleansing techniques like addressing missing values, encoding categorical variables and standardizing numeric variables.

Banking data set is stored in Amazon Redshift Cluster. Kim plans to use Data Wrangler tool to connect to Redshift Data Warehouse and pull data to understand the data set and then work on data preparation process. Kim will have to do the following steps.

  1. Create new Data Flow
  2. Import Data set
  3. Analyze and Visualize
  4. Drop Unused columns
  5. Create Dummy Variables
  6. Standardizing Numeric Variables
  7. Export Processed Data

Data Wrangler Data Preparation - WorkFlow

Create new Data Flow

SayDo Show

We have banking data set stored on Amazon Redshift Data warehouse. Amazon Redshift makes it easy to gain new insights from all your data. Kim queries banking data set and does initial observation of the data set. From initial observations Kim identifies that Data set is not machine learning ready - for example, categorical variables like job and marital fields do not have reference variables which are required for Machine Learning.

Kim decides to use Amazon Sagemaker Data Wrangler client tool to Analyse, Visualize and finally prepare the banking data set so a Machine learning Model can be trained on it.

Once the Studio stack is created, navigate to SageMaker Studio.

  • Find the rs-demo-scientist User Profile, and click on the Open Studio link.

This will launch the Studio’s JupyterLab environment in a new browser tab (might take 2-3 minutes)

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

Kim starts by creating a new data flow so she can start performing data analysis and processing steps.

Goto Sagemaker studio.

  • Select the + sign on the new data flow card under ML tasks and components.

This creates a new untitled.flow file.

Kim decides to name this flow as Bank_details.flow.

Import Data set

SayDo Show

As a first step Kim makes a connection to her Cloud Data warehouse Amazon Redshift. Amazon Sagemaker Data Wrangler built in out of the box feature makes it real easy to connect to Amazon Redshift.

Go to Import tab in Banking_Datails.flow

  1. Click on Add Data Source button on Right hand side.
  2. Choose Amazon Redshift.

A new window Add Amazon Redshift Cluster opens up, choose following

  1. Choose Temporary credentials (IAM) for Type.
  2. Enter a Connection Name. This is a name used by Data Wrangler to identify this connection.
  3. Enter the Cluster Identifier to specify to which cluster you want to connect.
  4. Enter the Database Name of the database to which you want to connect to.
  5. Enter a Database User to identify the user you want to use to connect to the database.
  6. For UNLOAD IAM Role, enter the IAM role ARN of the role that the Amazon Redshift cluster should assume to move and write data to Amazon S3.

After successful connection, Kim wants import her Bank_details data set into data wrangler.

After successful connection to Redshift Cluster, “Query Amazon Redshift” page is displayed Go to Schema and Select Public schema from drop down list.

  1. Type below code in query editor below
       Select * from Bank_details;
    
  2. Hit run button which is found right above the query editor.
  3. After success return of result set, Import button is activated on top.
  4. Click on Import button.
  5. Give a name to this data set “bank_details” and click on add button.

Kim now wants to know more details about her data set. For example, she wants to understand data types of the data set that she has imported.

Data wrangler automatically infers the type of data for each column. Kim wants to make sure data types are correctly set for each column.

Under Data flow tab

  1. Choose + next to the Data types step and select Edit data types.
  2. on right hand side, Observe data types that are automatically inferred by data wrangler tool.
  3. In this lab you can allow default data types.
  4. On top left side, click on back to data flow to exit out of here.

Analyze and Visualize

SayDo Show

As part of data preparation for Machine learning, Kim starts analyzing the data. First she wants to generate the summary stats for Banking data set as she wants to understand demographics about her data set, for example, what is the average age of the clients.

Under Data flow tab

  1. Choose the + next to the Data type step in your data flow and select Add analysis.

  2. From Create analysis, In the Analysis Type drop down, select Table summary from the dropdown list.

  3. Give the table summary a Name. For Example, “Summary”.

  4. Select Preview to preview the data.

  5. Choose Save to save it. It appears under All Analyses.

After looking at summary stats, Kim now wants to create a histogram to understand how different education levels of her clients is distributed against longer term deposit sign ups. For this Kim decides to use built in histogram feature of Data Wrangler tool.

Kim uses histogram chart to visually inspect if education levels and longer term opt-ins have any uneven distribution.

Under Data flow tab.

  1. While still in Analysis tab.
  2. Right hand side, click on Create new Analysis button
  3. Right hand side, under create analysis, from the Analysis Type drop down, select Histogram from the dropdown list.
  4. Give the Analysis a Name. For Example, “Education_Levels_vs_term_signups”.
  5. In x-axis select “Education” from drop down list.
  6. Color by – leave empty.
  7. Facet by – Select “Y” from drop list.
  8. Select Preview to preview the histogram that will be created.
  9. Click on Save to save this analysis.
  10. On top left hand side, click on back to data flow to exit out of Analysis tab

Kim observes clients who have University Degree have most opt ins and clients who have basic education have low number of opt ins.

Next, Kim wants to know if there are any missing values in her banking data set, for this she decides to use custom transformation feature available in Data wrangler tool and apply pandas df.info() method on Banking data set.

Under Data flow tab

  1. Choose the + next to the Data type step in your data flow and select Add Transform.

  2. In the Transforms section, click on Add Step

  3. Select Custom transform.

  4. Select Python (Pandas) from drop down list.

  5. In the code editor, type

       
    df.info()
       
    
  6. Select Preview to preview the data that will be displayed below python code editor.

  7. Click on Add to save this transformation.

  8. On top right hand side click on back to data flow to get back to data flow

Kim observes that total number rows in this data set are 4,119. She also notices that each column has same number of values present. Kim decides that she does not need to drop any columns.

Drop Unused Columns

SayDo Show
As kim goes through the banking data set, she observes that some fields especially socio-economic data can be dropped from the data set in her first iteration of data modelling, so she decides to drop social economic fields and other campaign related fields. Fields that Kim wants to drop are: day_of_week, duration, emp.var.rate, cons.price.idx, euribor3m, nr.employed Under Data flow tab.

  1. Choose the + next to the Custom Pandas step in your data flow and select Add Transform.
  2. In the Transforms section, select add step.
  3. Select Custom transform
  4. Select Python (Pandas) from drop down list.
  5. Enter the following in the code box. Below columns will be dropped from dataset.
    
    cols = ['day_of_week', 'duration', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m','nr_employed']
     
    df = df.drop(cols, axis=1)
    
  6. Click on Preview and then click on add to save it.

A new transformation step is now added to the flow.
Under data preview you can notice that these fields are not now displayed anymore.

Tip: Data Wrangler has built in transformation to drop columns. To use the built-in transformations, under transformation: Choose Manage columns from the right panel. For Input column, choose the column that you want drop, and choose Preview. Verify that the column is has been dropped, then choose Add.

  1. Click on back to data flow to exit out of here

Create Dummy Variables

git stu
SayDo Show
Kim observes there are many dummy variables in banking data set, she decides to use Data wrangler to use built in encoding feature to create reference variables. Kim identifies following categorical feature that she want to create dummy variables for following Categorical Variables: job, marital , education, default, housing , loan , contact, month.

Under Data Flow.

  1. Choose the + next to the Steps(n) step in your data flow and select Add Transform.
  2. In the Transforms section, Click on Add Step
  3. Now select Encode Categorical.
  4. Under Transform: Select one-hot encode from drop down list.
  5. Under Input Column: Select marital from drop down list.
  6. Under Invalid Handling Strategy: Select Skip.
  7. Under Output style: Select Columns from drop down list.
  8. Click on preview and you notice that new reference variables are created for Marital field. Click on Add to save this transform.
  9. For this Demo, skip creating categorical variables for other fields move onto next step.
  10. Click on back to data flow to exit out of here

Standardizing Numeric Variables

SayDo Show
Kim now decides to tackle numeric features. Since Kim is planning to use Gradient descent algorithm she knows that she need to work on feature scaling. She observes that age, campaign, previous, pdays columns are of numeric type that need to go through standardization process. She decides to use Data Wrangler’s built in transformation “Process Numeric” to tackle this challenge.

Under Data Flow.

  1. Choose the + next to the Steps(n) step in your data flow and select Add Transform.
  2. In the Transforms section, Click on Add Step
  3. Create new transformation by using built in Process Numeric transformation.
  4. Under Transform: select Scale values from drop down.
  5. Under Scaler: Select Standard Scaler from the drop down.
  6. Under Input Column: Select Age from the drop down. And select center as measure of standardization.
  7. Click on preview and you notice that age field is scaled now. Click on add to save it.
  8. For this lab you can skip standardizing other numeric variables and move onto next step.
  9. Click on back to data flow to exit out of here

If notice the Age values are now standardized.

Export Processed Data

SayDo Show
Kim has completed her challenge now, she has finished analysis, feature encoding and feature scaling. Now she decides load this Data back to data warehouse where she can use Redshift Machine learning to train a classification model.

Under Data Flow.

  1. Select Steps(n) from the listed list, select last step. In this lab it is Scale Values. Last Step is being select so all the transformations that we added to the flow are applied on data when export is run.
  2. Under Data tab,click on Export Data option.
  3. Provide a S3 bucket to which Redshift cluster role has access to.
  4. Set file type to be CSV, Delimiter to be Comma and Compression to None.

Next Steps

SayDo Show
Kim now has machine learning ready data on s3.

Kim can utilize Redshift ML or Amazon Sagemaker Studio to train a machine learning model.

Before you Leave

Since only one domain is allowed per account per region, you cannot launch multiple Studio demo stacks in same account, region.

Step Visual
You need to manually delete all the non-default apps under both rs-demo-scientist and rs-demo-engineer user profiles, before you can delete this stack, as shown below.

If you are done using your cluster, please think about deleting the CFN stack or to avoid having to pay for unused resources do these tasks:

  • pause your Redshift Cluster
  • stop the Oracle database
  • stop the DMS task