17. ML – SageMaker Implements

== OPERATIONS ==

— Prepare Data —

Data prep on SageMaker

  • Data Sources
    • usually comes from S3
      • Ideal format varies with algorithm – often it is RecordIO / Protobuf
        • LibSVM is a format for representing data in a simple, text-based, and sparse format, primarily used for support vector machines (SVM).
        • RecordIO/Protobuf is a binary data storage format that allows for efficient storage and retrieval of large datasets, commonly used in deep learning and data processing.
      • The SageMaker AI Processing job requires specific permissions to access objects in an S3 bucket.
        • If the execution role assigned to the job does not have the s3:GetObject permission for the file, it will encounter a 403 Forbidden error when trying to access the file.
        • Not “SageMaker Studio”, which primarily is used for tasks within the SageMaker Studio environment.
    • also ingest from Athena, EMR, Redshift, and Amazon Keyspaces DB
  • Apache Spark integrates with SageMaker
  • More on this later…
  • Scikit_learn, numpy, pandas all at your disposal within a notebook
  • Processing (example)
    • Copy data from S3
    • Spin up a processing container
      • SageMaker built-in or user provided
    • Output processed data to S3

SageMaker Data Wrangler

  • Visual interface (in SageMaker Studio) to prepare data for machine learning
    • Import data
    • Visualize data (Explore and Analysis)
    • Transform data (300+ transformations to choose from)
      • Or integrate your own custom xforms with pandas, PySpark, PySpark SQL
      • Balance Data – create better models for binary classification
        • Random oversampling
        • Random undersampling
        • Synthetic Minority Oversampling Technique (SMOTE)
      • Reduce Dimensionality within a Dataset
        • Principal Component Analysis (PCA)
      • Encode Categorical – creating a numerical representation for categories
      • Handle Outliers
        • Robust Standard Deviation Numeric Outliers – (Q1/Q3 + n x Standard Deviation)
        • Standard Deviation Numeric Outliers – only n x Standard Deviation
        • Quantile Numeric Outliers – only Q1/Q3
        • Min-Max Numeric Outliers – set upper and lower threshold
        • Replace Rare – set a single threshold
      • Handle Missing Values
        • Fill missing (predefined value)
        • Impute missing (mean or median, or most frequent value for categorical data)
        • Drop missing (row)
      • Process Numeric
        • Standard Scaler
        • Robust Scaler
          • removes the median and scales the data using the interquartile range (IQR), making it less vulnerable to outliers than the Standard Scaler or Min-Max Scaler
        • Min Max Scaler
        • Max Absolute Scaler
AspectStandardScalerNormalizer
Operation BasisFeature-wise (across columns)Sample-wise (across rows)
PurposeStandardizes features to zero mean and unit varianceScales samples to unit norm (L2 by default)
Impact on DataAlters the mean and variance of each featureAdjusts the magnitude of each sample vector
Common Use CasesRegression, PCA, algorithms sensitive to varianceText classification, k-NN, direction-focused tasks
  •  preprocessing large text datasets (super on NLP tasks)
    • handle various data formats
    • supports common NLP preprocessing techniques
      • tokenization
      • stemming
      • stop word removal
  • “Quick Model” to train your model with your data and measure its results
  • Troubleshooting
    • Make sure your Studio user has appropriate IAM roles
    • Make sure permissions on your data sources allow Data Wrangler access
      • Add AmazonSageMakerFullAccess policy
    • EC2 instance limit
      • If you get “The following instance type is not available…” errors
      • May need to request a quota increase
      • Service Quotas / Amazon SageMaker / Studio KernelGateway Apps running on ml.m5.4xlarge instance

— Processing Job —

Amazon SageMaker Processing

  • a fully managed service that simplifies the process of running data processing and model evaluation jobs
  • run data pre and post processing, feature engineering, and model evaluation tasks on SageMaker
  • built-in support for TensorFlow and Hugging Face’s Transformers and can directly interact with data stored in Amazon S3

— Feature Creation & Store —

SageMaker Feature Store

  • A “feature” is just a property used to train a machine learning model.
    • Like, you might predict someone’s political party based on “features” such as their address, income, age, etc.
  • Machine learning models require fast, secure access to feature data for training.
  • keep it organized and share features across different models
  • Find, discover, and share features in Studio
  • also allows for cross-account access
  • Store modes
    • Online stores are typically used for low-latency access during inference
    • Offline stores used for training or batch inference
  • Features organized into Feature Groups
    • the primary resource within Feature Store that holds both the data and metadata used for training or making predictions with an ML model
    • A feature group represents a logical collection of features that describe individual records
  • Data Ingestion
    • STREAMING access via PutRecord / GetRecord API’s
    • BATCH access via the offline S3 store (use with anything that hits S3, like Athena, Data Wrangler. Automatically creates a Glue Data Catalog for you.)
  • Security
    • Encrypted at rest and in transit
    • Works with KMS customer master keys
    • Fine-grained access control with IAM
    • May also be secured with AWS PrivateLink
  • Steps
    • Setup new feature store
      • Define a new feature group
        • organizes the features and metadata for model training and inference
      • Connect with an offline store to the feature group
        •  an offline store, which stores historical data in Amazon S3 and supports batch processing
      • Load data into the offline store using batch processing
    • Add/Update new feature into existing groups
      • Utilize the UpdateFeatureGroup command to incorporate the new feature into the feature group. Specify the attribute name and type.
      • Apply the PutRecord command to ingest or overwrite records, ensuring that the newly added feature has data populated across both historical and new records
        • update the records that are missing data for the new attribute.
  • Create an endpoint for feature queries is for querying during inference
  • Schedule regular feature updates also is for maintenance, not for new store establishment

— Model Training —

Training/Deployment on SageMaker

  • Create a training job
    • URL of S3 bucket with training data
    • ML compute resources
    • URL of S3 bucket for output
    • ECR path to training code
  • Training options
    • Built-in training algorithms
    • Spark MLLib
    • Custom Python Tensorflow / MXNet code
    • PyTorch, Scikit-Learn, RLEstimator
    • XGBoost, Hugging Face, Chainer
  • Save your trained model to S3
  • Can deploy two ways:
    • Persistent endpoint for making individual predictions on demand
    • SageMaker Batch Transform to get predictions for an entire dataset
  • Lots of cool options
    • Inference Pipelines for more complex processing
    • SageMaker Neo for deploying to edge devices
    • Elastic Inference for accelerating deep learning models
    • Automatic scaling (increase # of endpoints as needed)
    • Shadow Testing evaluates new models again

SageMaker Training Compiler

  • Integrated into AWS Deep Learning Containers (DLCs)
    • Can’t bring your own container
  • Compile & optimize training jobs on GPU instances
  • Can accelerate training up to 50%
  • Converts models into hardware-optimized instructions
  • Tested with Hugging Face transformers library, or bring your own model
  • Incompatible with SageMaker distributed training libraries
  • Best practices:
    • Ensure GPU instances are used (ml.p3, ml.p4)
    • PyTorch models must use PyTorch/XLA’s model save function
    • Enable debug flag in compiler_config parameter to enable debugging

— Deploy Inference Model —

Inference Model Deployment

  • getting predictions, or inferences, from your trained machine learning models
  • approches
    • Deploy a machine learning model in a low-code or no-code environment
      • deploy pre-trained models using Amazon SageMaker JumpStart through the Amazon SageMaker Studio interface
    • Use code to deploy machine learning models with more flexibility and control
      • deploy their own models with customized settings for their application needs using the ModelBuilder class in the SageMaker AI Python SDK, which provides fine-grained control over various settings, such as instance types, network isolation, and resource allocation.
    • Deploy machine learning models at scale
      • use the AWS SDK for Python (Boto3) and AWS CloudFormation along with your desired Infrastructure as Code (IaC) and CI/CD tools
    • Deploying a model to an endpoint
      • Real-time inference
        • ideal for inference workloads where you have interactive, low latency requirements.
      • Deploy models with Amazon SageMaker Serverless Inference
        • without configuring or managing any of the underlying infrastructure
        • ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts.
      • Asynchronous inference
        • queues incoming requests and processes them asynchronously
        • ideal for requests with large payload sizes (up to 1GB), long processing times (up toAsynchronous Inference one hour), and near real-time latency requirements
    • Cost optimization
      • Model performance optimization with SageMaker Neo.
        • automatically optimizing models to run in environments like AWS Inferentia chips.
      • Automatic scaling of Amazon SageMaker AI models.
        • Use autoscaling to dynamically adjust the compute resources for your endpoints based on incoming traffic patterns, which helps you optimize costs by only paying for the resources you’re using at a given time.
  • Inference Modes
    • Real-time inference
      • inference workloads where you have real-time, interactive, low latency requirements
    • (On-demand) Serverless inference
      • workloads which have idle periods between traffic spurts and can tolerate cold starts.
    • Asynchronous inference
      • queues incoming requests and processes them asynchronously
      • requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements
    • Batch transform
      • Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
      • Get inferences from large datasets.
      • Run inference when you don’t need a persistent endpoint.
      • Associate input records with inferences to help with the interpretation of results.
      • suitable for long-term monitoring and trend analysis
  • Deployment Safeguards
    • Deployment Guardrails
      • For asynchronous or real-time inference endpoints
      • Controls shifting traffic to new models
      • Blue/Green Deployments
        • All at once: shift everything, monitor, terminate blue fleet
      • Canary
        • allows you to deploy new versions of machine learning models or applications to a small subset of users or traffic
      • Linear
        • Shift traffic in linearly spaced steps
        • does not provide the initial small-scale rollout and evaluation phase that Canary deployment offers
      • Auto-rollbacks
    • Shadow Tests / Shadow deployment
      • just runs the new version alongside the old version for testing without affecting live traffic
      • Compare performance of shadow variant to production
      • You monitor in SageMaker console and decide when to promote it
Use case 1Use case 2Use case 3
SageMaker AI featureUse JumpStart in Studio to accelerate your foundational model deployment.Deploy models using ModelBuilder from the SageMaker Python SDK.Deploy and manage models at scale with AWS CloudFormation.
DescriptionUse the Studio UI to deploy pre-trained models from a catalog to pre-configured inference endpoints. This option is ideal for citizen data scientists, or for anyone who wants to deploy a model without configuring complex settings.Use the ModelBuilder class from the Amazon SageMaker AI Python SDK to deploy your own model and configure deployment settings. This option is ideal for experienced data scientists, or for anyone who has their own model to deploy and requires fine-grained control.Use AWS CloudFormation and Infrastructure as Code (IaC) for programmatic control and automation for deploying and managing SageMaker AI models. This option is ideal for advanced users who require consistent and repeatable deployments.
Optimized forFast and streamlined deployments of popular open source modelsDeploying your own modelsOngoing management of models in production
ConsiderationsLack of customization for container settings and specific application needsNo UI, requires that you’re comfortable developing and maintaining Python codeRequires infrastructure management and organizational resources, and also requires familiarity with the AWS SDK for Python (Boto3) or with AWS CloudFormation templates.
Recommended environmentA SageMaker AI domainA Python development environment configured with your AWS credentials and the SageMaker Python SDK installed, or a SageMaker AI IDE such as SageMaker JupyterLabThe AWS CLI, a local development environment, and Infrastructure as Code (IaC) and CI/CD tools

Inference Pipelines

  • Linear sequence of 2-15 containers
  • Any combination of pre-trained built-in algorithms or your own algorithms in Docker containers
  • Combine pre-processing, predictions, post-processing
  • Spark ML and scikit-learn containers OK
    • Spark ML can be run with Glue or EMR
    • Serialized into MLeap format
  • Can handle both real-time inference and batch transforms

— Quality Monitor —

SageMaker Model Monitor

  • Data Drift is the phenomenon where the distribution of the input data changes over time, potentially leading to degraded model performance
  • Monitors
    • Data quality – Monitor drift in data quality.
      • the statistical properties of input data change over time, resulting in a decrease in model accuracy.
        • for example, changes in patient demographics and conditions reflects a shift in data distribution
      • Solution: use Amazon SageMaker Model Monitor, which continuously monitors the data quality of deployed models. This allows data scientists and machine learning engineers to detect drifts in real-time and take corrective actions.
      • Use the ModelDataQualityMonitor class to establish a baseline for input data quality. Deploy the baseline to SageMaker Model Monitor to track changes in data patterns and set up notifications in CloudWatch for any detected deviations as primarily used to track the quality of input data and its distribution but does not monitor the importance of features in model predictions
    • Model quality – Monitor drift in model quality metrics, such as accuracy.
      • Create a performance baseline using the ModelPerformanceMonitor class. Deploy this baseline to SageMaker Model Monitor and configure it to track the model’s prediction accuracy and trigger alerts if any major variations occur as only tracks the overall performance of the model (such as accuracy or precision)
    • Bias drift for models in production – Monitor bias in your model’s predictions.
      • a predictive model starts to favor specific groups over others because of changes in data distribution or model parameters. This can lead to imbalanced outcomes.
        • for example, Disproportionate flagging of certain patient groups indicates bias
      • Solution: use Amazon SageMaker Clarify, detect and explain biases in machine learning models. It offers tools to assess and comprehend bias in the data and model predictions, ensuring the model maintains fairness and equality across various demographic groups.
      • Create a baseline for monitoring bias using the ModelBiasMonitor class. Deploy this baseline to SageMaker Model Monitor and periodically check for bias drift. Set up Amazon CloudWatch to send alerts when violations are detected; mainly for identifying bias in predictions related to specific attributes such as gender or race, not changes in feature attribution between inputs
    • Feature attribution drift for models in production – Monitor drift in feature attribution.
      • the significance of various features in a predictive model changes over time. This can happen when the relationships between features and outcomes evolve.
        • for example, changing impacts of various health factors
      • Explainability includes the ability to detail which features contribute the most to a model prediction for a specific input, or feature attribution. These details can help determine if a particular model input has more influence than expected on overall model behavior.
      • Solution: use Amazon SageMaker Clarify, assist in understanding feature attribution by offering feature importance scores. This helps data scientists monitor and handle changes in feature importance over time, ensuring that the model continues to generate accurate and reliable predictions.
      • The ModelExplainabilityMonitor leverages SHAP (Shapley Additive Explanations) values to track the contribution of each feature. This feature attribution monitoring ensures that if a model begins to emphasize one input (such as driving history) more than another (like age), it can be detected and trigger alerts in Amazon CloudWatch
    • (NOT) Virtual Drift , typically refers to a situation where the model’s performance degrades due to operational issues, such as infrastructure or deployment changes.
  • Get alerts on quality deviations on your deployed models (via CloudWatch)
  • Visualize data drifts
  • Detect anomalies & outliers
  • Detect new features
  • No code needed
  • Could “analyze the training data and generate a baseline of statistics and constraints from processed data”
  • Integrates with SageMaker Clarify for bias detection and data drift monitoring
  • Data is stored in S3 and secured
  • Monitoring jobs are scheduled via a Monitoring Schedule
  • Metrics are emitted to CloudWatch
  • CloudWatch notifications can be used to trigger alarms
  • You’d then take corrective action (retrain the model, audit the data)
  • Integrates with Tensorboard, QuickSight, Tableau
  • Or just visualize within SageMaker Studio
MeaningActivity
Data quality driftproduction data distribution differs from that used for trainingmissing values or errors in the data
Model quality driftpredictions that a model makes differ from actual Ground Truth labels that the model attempts to predictmonitor the performance of a model by comparing the predictions that the model makes with the actual ground truth labels that the model attempts to predict
Bias driftintroduction of bias due to change in production data distribution or applicationstatistical changes in the data distribution, even if the data has high quality
Feature attribution driftranking of individual features changed from training data to live datadetect feature attribution drift by comparing how the ranking of the individual features changed from training data to live data

SageMaker Clarify

  • detects potential bias, i.e., imbalances across different groups / ages / income brackets
    • “Biases in the training data can lead to inaccurate predictions”
    • works as “diagnostic” tool, could directly address the root cause of inaccuracies in the training data or model
    • help you understand why your ML model made a specific prediction
    • and whether bias impacted this prediction during training or inference
    • Also provides tools to help build less biased and more understandable models
    • and can generate model governance reports for risk and compliance teams and external regulators.
  • To setup a SageMaker Clarify processing job, for configuring the processing container, job inputs, outputs, resources, and other parameters
    • OPTION ONE: use the SageMaker CreateProcessingJob API
    • OPTION TWO: the SageMaker Python SDK API, SageMaker ClarifyProcessor
  • With Model Monitor, you can monitor for bias and be alerted to new potential bias via CloudWatch
  • SageMaker Clarify also helps explain model behavior
    • Understand which features contribute the most to your predictions
    • analyzes the model’s predictions on inference data and explains
Bias metricDescriptionExample questionFuther Example
Class Imbalance (CI)Measures the imbalance in the number of members between different facet values.Could there be age-based biases due to not having enough data for the demographic outside a middle-aged facet? For instance, if one gender is significantly underrepresented, CI will reveal this imbalance, which could lead to biased predictions that favor the more represented class. Addressing such imbalances is important to ensure that the model learns fairly from all data groups.
Difference in Proportions of Labels (DPL)Measures the imbalance of positive outcomes between different facet values.Could there be age-based biases in ML predictions due to biased labeling of facet values in the data?For example, if a model predicts higher success rates for one racial group compared to another, DPL will highlight this discrepancy, indicating a potential bias in outcomes. Analyzing DPL can make adjustments to the model or dataset to reduce such disparities and promote fairness.
Kullback-Leibler Divergence (KL)Measures how much the outcome distributions of different facets diverge from each other entropically.

primarily measures the divergence between two probability distributions often used to compare the distribution of outcomes across different groups
How different are the distributions for loan application outcomes for different demographic groups?typically used in broader contexts of distribution comparisons and might not provide a direct assessment of bias related to demographic attributes.
Jensen-Shannon Divergence (JS)Measures how much the outcome distributions of different facets diverge from each other entropically.

primarily measures the divergence between two probability distributions often used to compare the distribution of outcomes across different groups
How different are the distributions for loan application outcomes for different demographic groups?typically used in broader contexts of distribution comparisons and might not provide a direct assessment of bias related to demographic attributes.
Lp-norm (LP)Measures a p-norm difference between distinct demographic distributions of the outcomes associated with different facets in a dataset.

How different are the distributions for loan application outcomes for different demographics?It’s a critical metric for identifying how much the predicted outcomes for one group differ from another, thus helping to assess bias in the model’s predictions.
Total Variation Distance (TVD)Measures half of the L1-norm difference between distinct demographic distributions of the outcomes associated with different facets in a dataset.

TVD quantifies the divergence in outcome distributions between different demographic groups
How different are the distributions for loan application outcomes for different demographics?It’s a critical metric for identifying how much the predicted outcomes for one group differ from another, thus helping to assess bias in the model’s predictions.
Kolmogorov-Smirnov (KS)Measures maximum divergence between outcomes in distributions for different facets in a dataset.Which college application outcomes manifest the greatest disparities by demographic group?It’s a critical metric for identifying how much the predicted outcomes for one group differ from another, thus helping to assess bias in the model’s predictions.
Conditional Demographic Disparity (CDD)Measures the disparity of outcomes between different facets as a whole, but also by subgroups.Do some groups have a larger proportion of rejections for college admission outcomes than their proportion of acceptances?It’s a critical metric for identifying how much the predicted outcomes for one group differ from another, thus helping to assess bias in the model’s predictions.


== IMPLEMENTATIONS ==

— Docker Container —

SageMaker and Docker Containers

  • All models in SageMaker are hosted in Docker containers
  • Docker containers are created from images
  • Images are built from a Dockerfile
  • Images are saved in a repository
    • Amazon Elastic Container Registry (ECR)
  • Recommended to wrap the custom codes into custom Docker container, push it to the Amazon Elastic Container Registry (Amazon ECR), and use it as a processing container in SageMaker AI
  • Amazon SageMaker Containers
    • Library for making containers compatible with SageMaker
    • RUN pip install sagemaker-containers in your Dockerfile
  • Environment variables
    • SAGEMAKER_PROGRAM
      • Run a script inside /opt/ml/code
    • SAGEMAKER_TRAINING_MODULE
    • SAGEMAKER_SERVICE_MODULE
    • SM_MODEL_DIR
    • SM_CHANNELS / SM_CHANNEL_*
    • SM_HPS / SM_HP_*
    • SM_USER_ARGS
  • Production Variants
    • test out multiple models on live traffic using Production Variants
      • Variant Weights tell SageMaker how to distribute traffic among them
      • So, you could roll out a new iteration of your model at say 10% variant weight
      • Once you’re confident in its performance, ramp it up to 100%
    • do A/B tests, and to validate performance in real-world settings
      • Offline validation isn’t always useful
    • Shadow Variants
    • Deployment Guardrails

— Machine Learning Operations (MLOps) —

SageMaker Projects

  • set up dependency management, code repository management, build reproducibility, and artifact sharing
  • SageMaker Studio’s native MLOps solution with CI/CD
    • Build images
    • Prep data, feature engineering
    • Train models
    • Evaluate models
    • Deploy models
    • Monitor & update models
  • Uses code repositories for building & deploying ML solutions
  • Uses SageMaker Pipelines defining steps

SageMaker ML Lineage Tracking (EOL on 2024/25)

  • Model lineage graphs
    • Creates & stores your ML workflow (MLOps)
    • a visualization of your entire ML workflow from data preparation to deployment
    • use entities to represent individual steps in your workflow
  • Keep a running history of your models
  • Tracking for auditing and compliance
  • Automatically or manually-created tracking entities
  • Integrates with AWS Resource Access Manager for cross-account lineage
  • Entities
    • Trial component (processing jobs, training jobs, transform jobs)
    • Trial (a model composed of trial components)
    • Experiment (a group of Trials for a given use case)
    • Context (logical grouping of entities)
    • Action (workflow step, model deployment
    • Artifact (Object or data, such as an S3 bucket or an image in ECR)
    • Association (connects entities together) – has optional AssociationType:
      • ContributedTo
      • AssociatedWith
      • DerivedFrom
      • Produced
      • SameAs
  • Querying
    • Use the LineageQuery API from Python
      • Part of the Amazon SageMaker SDK for Python
    • Do things like find all models / endpoints / etc. that use a given artifact
    • Produce a visualization
      • Requires external Visualizer helper class

Amazon SageMaker (ML) Pipeline

  • allows you to create, automate, and manage end-to-end ML workflows
  • consists of multiple steps, each representing a specific task or operation in the ML workflow.
    • data processing
    • model training
    • model evaluation
    • model deployment
  • series of interconnected steps in directed acyclic graph (DAG)
    • UI or can using the pipeline definition JSON schema
  • composed of
    • name
    • parameters
    • steps
      • A step property is an attribute of a step that represents the output values from a step execution
      • For a step that references a SageMaker job, the step property matches the attributes of that SageMaker job
        • ProcessingStep
        • TrainingStep
        • TransformStep
        • TuningStep
        • AutoMLStep
        • ModelStep
        • CallbackStep
        • LambdaStep
        • ClarifyCheckStep
        • QualityCheckStep
        • EMRStep
  • @step decorator : custom ML job
    • define custom logic within a pipeline step and ensure it works seamlessly within the workflow
    • example, create a simple python function to do data validation
  • @remote decorator : remote function calls
    • It simplifies the execution of machine learning workflows by allowing you to run code remotely in a SageMaker environment, making it easier to scale your workloads, distribute computations, and leverage SageMaker infrastructure efficiently.

AWS Step Functions

  • create workflows, also called State machines, to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning pipelines
  • based on state machines and tasks. In Step Functions, state machines are called workflows, which are a series of event-driven steps. Each step in a workflow is called a state. For example, a Task state represents a unit of work that another AWS service performs, such as calling another AWS service or API. Instances of running workflows performing tasks are called executions in Step Functions.
  • The work in your state machine tasks can also be done using Activities which are workers that exist outside of Step Functions.
Integrated serviceRequest ResponseRun a Job – .syncWait for Callback – .waitForTaskToken
Amazon API GatewayStandard & ExpressNot supportedStandard
Amazon AthenaStandard & ExpressStandardNot supported
AWS BatchStandard & ExpressStandardNot supported
Amazon BedrockStandard & ExpressStandardStandard
AWS CodeBuildStandard & ExpressStandardNot supported
Amazon DynamoDBStandard & ExpressNot supportedNot supported
Amazon ECS/FargateStandard & ExpressStandardStandard
Amazon EKSStandard & ExpressStandardStandard
Amazon EMRStandard & ExpressStandardNot supported
Amazon EMR on EKSStandard & ExpressStandardNot supported
Amazon EMR ServerlessStandard & ExpressStandardNot supported
Amazon EventBridgeStandard & ExpressNot supportedStandard
AWS GlueStandard & ExpressStandardNot supported
AWS Glue DataBrewStandard & ExpressStandardNot supported
AWS LambdaStandard & ExpressNot supportedStandard
AWS Elemental MediaConvertStandard & ExpressStandardNot supported
Amazon SageMaker AIStandard & ExpressStandardNot supported
Amazon SNSStandard & ExpressNot supportedStandard
Amazon SQSStandard & ExpressNot supportedStandard
AWS Step FunctionsStandard & ExpressStandardStandard

— Kubernetes —

Amazon SageMaker Operators for Kubernetes

  • Integrates SageMaker with Kubernetes-based ML infrastructure
  • Components for Kubeflow Pipelines
  • Enables hybrid ML workflows (on-prem + cloud)
  • Enables integration of existing ML platforms built on Kubernetes / Kubeflow

— MISC —

Managing SageMaker Resources

  • In general, algorithms that rely on deep learning will benefit from GPU instances (P3, g4dn) for training
  • Inference is usually less demanding and you can often get away with compute instances there (C5)
  • Can use EC2 Spot instances for training
    • Save up to 90% over on-demand instances
  • Spot instances can be interrupted!
    • Use checkpoints to S3 so training can resume
  • Can increase training time as you need to wait for spot instances to become available
  • Elastic Inference (EI) / Amazon SageMaker Inference
    • attach just the right amount of GPU-powered acceleration to any Amazon EC2 and Amazon SageMaker instance
    • Accelerates deep learning inference
      • At fraction of cost of using a GPU instance for inference
    • EI accelerators may be added alongside a CPU instance
      • ml.eia1.medium / large / xlarge
    • EI accelerators may also be applied to notebooks
    • Works with Tensorflow, PyTorch, and MXNet pre-built containers
      • ONNX may be used to export models to MXNet
    • Works with custom containers built with EI-enabled Tensorflow, PyTorch, or MXNet
    • Works with Image Classification and Object Detection built-in algorithms
  • Automatic Scaling
    • “Real-time endpoints” typically keep instances running even when there is no traffic; so it’s good to solve the peak traffic, but not the best cost-saving
  • Serverless Inference
    • Specify your container, memory requirement, concurrency requirements
    • Underlying capacity is automatically provisioned and scaled
    • Good for infrequent or unpredictable traffic; will scale down to zero when there are no
      requests
      • a cost-effective option for workloads with intermittent or unpredictable traffic
      • “Provisioned Concurrency” ensures that the model is always available, with extra cost for the reserved capacity even unused; but this can prevent the “cold-start” issue
    • Charged based on usage
    • Monitor via CloudWatch
      • ModelSetupTime, Invocations, MemoryUtilization
  • Amazon SageMaker Inference Recommender
  • Availability Zones
    • automatically attempts to distribute instances across availability zones
    • Deploy multiple instances for each production endpoint
    • Configure VPC’s with at least two subnets, each in a different AZ


== SECURITY ==

SageMaker Security

  • Use Identity and Access Management (IAM)
    • User permissions for:
      • CreateTrainingJob
      • CreateModel
      • CreateEndpointConfig
      • CreateTransformJob
      • CreateHyperParameterTuningJob
      • CreateNotebookInstance
      • UpdateNotebookInstance
    • Predefined policies:
      • AmazonSageMakerReadOnly
      • AmazonSageMakerFullAccess
      • AdministratorAccess
      • DataScientist
  • Set up user accounts with only the permissions they need
  • Use MFA
  • Use SSL/TLS when connecting to anything
  • Use CloudTrail to log API and user activity
  • Use encryption
  • AWS Key Management Service (KMS)
    • Accepted by notebooks and all SageMaker jobs
      • Training, tuning, batch transform, endpoints
      • Notebooks and everything under /opt/ml/ and /tmp can be encrypted with a KMS key
  • S3
    • Can use encrypted S3 buckets for training data and hosting models
    • S3 can also use KMS
  • Protecting Data in Transit
    • All traffic supports TLS / SSL
    • IAM roles are assigned to SageMaker to give it permissions to access resources
    • Inter-node training communication may be optionally encrypted
    • Can increase training time and cost with deep learning
    • AKA inter-container traffic encryption
    • Enabled via console or API when setting up a training or tuning job
  • VPC
    • Training jobs run in a Virtual Private Cloud (VPC)
    • You can use a private VPC for even more security
      • You’ll need to set up S3 VPC endpoints
      • Custom endpoint policies and S3 bucket policies can keep this secure
    • Notebooks are Internet-enabled by default
      • If disabled, your VPC needs
        • an interface endpoint (PrivateLink) and allow outbound connections (to other AWS services, like S3, AWS Comprehend), for training and hosting to work
        • or NAT Gateway, and allow outbound connections (to Internet), for training and hosting to work
    • Training and Inference Containers are also Internet-enabled by default
      • Network isolation is an option, but this also prevents S3 access
        • Enable the SageMaker parameter EnableNetworkIsolation for the notebook instances; so the instances wouldn’t be accessible from the Internet
  • Logging and Monitoring
    • CloudWatch can log, monitor and alarm on:
      • Invocations and latency of endpoints
      • Health of instance nodes (CPU, memory, etc)
      • Ground Truth (active workers, how much they are doing)
    • CloudTrail records actions from users, roles, and services within SageMaker
      • Log files delivered to S3 for auditing


== COST OPTIMISATION ==

AWS Cost Management

  • Operations for Cost Allocation Tracking/Analysis
    • STEP ONE: create user-defined tags with key-value pairs that reflect attributes such as project names or departments to ensure proper categorization of resources
    • STEP TWO: apply these tags to the relevant resources to enable tracking
    • STEP THREE:  enable the cost allocation tags in the Billing console
    • (AFTER) STEP FOUR: Configure tag-based cost and usage reports (AWS Cost Allocation Reports) for detailed analysis in Cost Explorer