16. ML – Amazon SageMaker

Machine Learning

  • the process of training computers, using math and statistical processes, to find and recognize patterns in data
  • After patterns are found, ML generates and updates training models to make increasingly accurate predictions and inferences about future outcomes based on historical and new data
  • Process involved
    • Data collection and preparation: Gathering relevant data and preprocessing it (cleaning, formatting, feature engineering) to make it suitable for model training.
    • Model training: Feeding the prepared data into machine learning algorithms, which learn patterns and relationships within the data to build a model.
    • Model evaluation: Assessing the trained model’s performance using evaluation metrics and techniques like cross-validation.
    • Model deployment: Integrating the trained and evaluated model into applications or systems to make predictions or decisions on new data.
  • Simple and complex ML models differ when balancing a model’s accuracy (number of correctly predicted data points) and a model’s explainability (how much of the ML system can be explained in “human terms”).
    • The output of a simple ML model ( decision trees or logistic regression, for example) may be explainable and produce faster results, but the results may be inaccurate.
    • The output of a complex ML model may be accurate, but the results may be difficult to communicate.
  • Key words
    • churn rate, sometimes known as attrition rate, is the rate at which customers stop doing business with a company over a given period of time. Churn may also apply to the number of subscribers who cancel or don’t renew a subscription. The higher your churn rate, the more customers stop buying from your business.


SageMaker Notebooks

  • direct the process
  • (Jupyter) Notebook Instances on EC2 are spun up from the console
    • S3 data access
    • using Scikit_learn, Spark, Tensorflow
    • Wide variety of built-in models
    • Ability to spin up training instances
    • Ability to deploy trained models for making predictions at scale
  • notebook instances come with a pre-installed R kernel, which includes the reticulate library. This library provides an R to Python interface, enabling you to utilize the features of the SageMaker Python SDK directly within an R script. But it’s not designed for the production deployment of models.
    • to fully running R models, please use custom R docker container with a SageMaker endpoint.

SageMaker Studio

  • Visual IDE for machine learning
  • SageMaker Notebooks – Jupyter notebooks
  • SageMaker Experiments – Organize, capture, compare, and search your ML jobs
    • allows the engineer to automatically track each model’s run, hyperparameters, and results, making it easier to evaluate multiple algorithms and choose the best-performing model
  • SageMaker Debugger
    • used to monitor and debug machine learning training jobs in real-time
      • automatically generate alerts when anomalies or specific conditions, such as class imbalances or overfitting, occur during the training
      • detecting data imbalances, vanishing gradients, or divergence
    • Saves internal model state at periodical intervals
      • Gradients / tensors over time as a model is trained
      • Define rules for detecting unwanted conditions while training
      • A debug job is run for each rule you configure
      • Logs & fires a CloudWatch event when the rule is hit
    • Auto-generated training reports
    • Built-in rules:
      • Monitor system bottlenecks
      • Profile model framework operations
      • Debug model parameters
    • Supported Frameworks & Algorithms:
      • Tensorflow
      • PyTorch
      • MXNet
      • XGBoost
      • SageMaker generic estimator (for use with custom training containers)
    • Debugger API’s available in GitHub
      • Construct hooks & rules for CreateTrainingJob and DescribeTrainingJob API’s
      • SMDebug client library lets you register hooks for accessing training data
    • Debugger ProfilerRule
      • ProfilerReport
      • Hardware system metrics (CPUBottlenck, GPUMemoryIncrease, etc)
      • Framework Metrics (MaxInitializationTime, OverallFrameworkMetrics, StepOutlier)
    • Built-in actions to receive notifications or stop training
      • StopTraining(), Email(), or SMS()
      • In response to Debugger Rules
      • Sends notifications via SNS
    • Profiling system resource usage and training
  • SageMaker Edge Manager (EOL on 2024)
    • Software agent for edge devices
    • Model optimized with SageMaker Neo
    • Collects and samples data for monitoring, labeling, retraining

Debugger vs Monitor

SageMaker AI training job for TensorBoard 

TensorBoard is useful for visualizing training metrics such as accuracy, loss, and gradients, it is primarily a tool for manual inspection rather than real-time, automated monitoring.

SageMaker and Spark

  • Pre-process data as normal with Spark
    • Generate DataFrames
  • Use sagemaker-spark library
  • SageMakerEstimator
    • KMeans, PCA, XGBoost
  • SageMakerModel
  • Notebooks can use the SparkMagic (PySpark) kernel
  • Connect notebook to a remote EMR cluster running Spark (or use Zeppelin)
  • Training dataframe should have:
    • A features column that is a vector of Doubles
    • An optional labels column of Doubles
  • Call fit on your SageMakerEstimator to get a SageMakerModel
  • Call transform on the SageMakerModel to make inferences
  • Works with Spark Pipelines as well.

SageMaker on the Edge

  • SageMaker Neo
    • Train once, run anywhere
      • so it’s for “enhances models’ performance after training”
    • used to optimize machine learning models for inference on different hardware platforms, including edge devices
    • Edge devices
      • ARM, Intel, Nvidia processors
    • Optimizes code for specific devices
      • Tensorflow, MXNet, PyTorch, ONNX, XGBoost, DarkNet, Keras
    • Consists of a compiler and a runtime
    • models are compiled into an optimized binary, allowing them to run with significantly lower latency and reduced compute resources, making it ideal for applications requiring fast decision-making, such as object detections.
  • Neo + AWS IoT Greengrass
    • Neo-compiled models can be deployed to an HTTPS endpoint
      • Hosted on C5, M5, M4, P3, or P2 instances
      • Must be same instance type used for compilation
    • OR! You can deploy to IoT Greengrass
      • This is how you get the model to an actual edge device
      • Inference at the edge with local data, using model trained in the cloud
      • Uses Lambda inference applications

Automate key machine learning tasks and use no-code or low-code solutions

  • SageMaker Canvas
    • No-code machine learning for business analysts
    • Capabilities for tasks
      • such as data preparation, feature engineering, algorithm selection, training and tuning, inference, and more.
    • Upload csv data (csv only for now), select a column to predict, build it, and make predictions
    • Can also join datasets
    • Classification or regression
    • Automatic data cleaning
      • Missing values
      • Outliers
      • Duplicates
    • Share models & datasets with SageMaker Studio
    • The Finer Points
      • Local file uploading must be configured “by your IT administrator.”
        • Set up an S3 bucket with appropriate CORS permissions
      • Can integrate with Okta SSO
      • Canvas lives within a SageMaker Domain that must be manually updated
      • Import from Redshift can be set up
      • Time series forecasting must be enabled via IAM
      • Can run within a VPC
  • SageMaker Autopilot (has been integrated into Canvas)
    • Automates:
      • Algorithm selection
      • Data preprocessing
      • Model tuning
      • All infrastructure
    • It does all the trial & error for you
    • More broadly this is called AutoML
    • Wokflow
      • Load data from S3 for training
      • Select your target column for prediction
      • Automatic model creation
      • Model notebook is available for visibility & control
      • Model leaderboard
        • Ranked list of recommended models
        • You can pick one
      • Deploy & monitor the model, refine via notebook if needed
    • Can add in human guidance
    • With or without code in SageMaker Studio or AWS SDK’s
    • Problem types:
      • Binary classification
      • Multiclass classification
      • Regression
    • Algorithm Types:
      • Linear Learner
      • XGBoost
      • Deep Learning (MLP’s)
      • Ensemble mode
    • Data must be tabular CSV or Parquet
  • SageMaker JumpStart
    • One-click models and algorithms from model zoos
    • provides pre-built models and end-to-end solutions for common machine learning use cases
    • Over 150 open source models in NLP, object detections, image classification, etc.
    • also provides solution templates that set up infrastructure for common use cases, and executable example notebooks for machine learning with SageMaker AI
    • NOT allow testing different algorithms and custom training

Amazon SageMaker Model Cards

  • document machine learning models
  • capture detailed information about each model, including its background, intended use cases, performance metrics, and business context

Amazon SageMaker Model Registry

  • primarily used for model versioning and deployment management
  • focuses more on model governance and deployment aspects than documentation

Amazon SageMaker Model Dashboard

  • a tool primarily used for visualizing and monitoring the performance of machine learning models deployed on SageMaker endpoints.

Amazon SageMaker Inference Recommender

  • helps determine the optimal instance type and configuration for deploying a machine learning model based on performance requirements and cost considerations
  • How it works:
    • Register your model to the model registry
    • Benchmark different endpoint configurations
    • Collect & visualize metrics to decide on instance types
    • Existing models from zoos may have benchmarks already
  • Instance Recommendations
    • Runs load tests on recommended instance types
    • Takes about 45 minutes
  • Endpoint Recommendations
    • Custom load test
    • You specify instances, traffic patterns, latency requirements,
      throughput requirements
    • Takes about 2 hours

Amazon SageMaker Automatic Scaling

  • automatically adjusts the number of instances provisioned for a SageMaker endpoint based on the incoming traffic
  • set up a scaling policy to define target metrics, min/max capacity, cooldown periods
  • Works with CloudWatch
  • Also it’s possible to scheduled actions to perform scaling activities at specific times. These actions can be set to scale either once or on a recurring schedule

Amazon SageMaker Multi-Model Endpoints 

  • deploy multiple models behind a single endpoint.
  • multiple models that need to be served simultaneously
    • Utilize the same set of resources and a shared serving container, which helps reduce hosting costs and deployment overhead. Amazon SageMaker manages the loading of models in memory and scales them based on the traffic patterns to your endpoint.
  • or perform A/B testing between different models.

Amazon SageMaker Automatic Model Tuning

  • hyperparameter optimization of the model
  • “HyperParameter Tuning Job” that trains as many combinations as you’ll allow
  • It learns as it goes
  • Best Practices
    • Don’t optimize too many hyperparameters at once
    • Limit your ranges to as small a range as possible
    • Use logarithmic scales when appropriate
    • Don’t run too many training jobs concurrently
    • This limits how well the process can learn as it goes
    • Make sure training jobs running on multiple instances report the correct objective metric in the end