Data Scientist Persona Demo
Use Cases
The Data Scientist demos are divided into multiple use cases, choose the use cases which are most suitable for the customer and the personas you are targeting:
- Machine Learning within Redshift using SQL (Feature Launched 2021)
- Machine Learning in Redshift SQL - Bring your own model
- Machine Learning in Redshift SQL using XGBoost with SageMaker Notebook
- Data Preparation using Data Wrangler
- Sales forecasting using SageMaker Notebook - Python SARIMA Model
- Sales forecasting using SageMaker Studio - Python SARIMA Model
- Energy consumption forecasting using SageMaker - DeepAR Algorithm
A note on typical challeges faced by Data Scientists
“Data scientists are responsible for discovering insights from massive amounts of structured and unstructured data to help shape or meet specific business needs and goals” - CIO Magazine
Most data scientists spend 80 percent of their time finding, cleaning, and reorganizing huge amounts of data
Typical use cases for Data Scientists are:
- Analyze large volume of data both from Redshift and S3 data lake
- Ease of access to raw data
- Flexibility to create data models, ingest data, run algorithms
- Data models tend to be temporary and iterative in nature
Amazon Redshift can simplify your data wrangling process, and help you to cleanup, prepare your data quickly and cost efficiently.
A note on Machine Learning (ML) workflows
- Data preparation, ETL pipeline:
- Clean data, fill missing value, remove invalid rows
- Transform data, may need to join multiple tables
- Split training data and testing data
- Training models
- Running Inference on training data for model evaluation
- Model evaluation, Evaluating the model against test data and also A/B testing different model versions against each other.
- Run model inference on test data, compare the result with expected result
- Set up A/B testing for different models
- Build ML pipeline to inference by using batch transform job
- Run inference in real-time or in daily/weekly batches
- For real-time, Result are used for serving to real-time use-cases like recommendation
- For batch, result are used for sales prediction, and retail pricing
- Store inference result for dashboard visualization and further analysis