17. ML – SageMaker Related

== IMPLEMENTATIONS ==

Model Customization

  • Steps
    • Prepare a labeled dataset and, if needed, a validation dataset. Ensure the training data is in the required format, such as JSON Lines (JSONL), for structured input and output pairs.
    • Configure IAM permissions to access the S3 buckets containing your data. You can either use an existing IAM role or let the console create a new one with the necessary permissions.
    • Optionally set up KMS keys and/or a VPC for additional security to protect your data and secure communication.
    • Start a training job by either fine-tuning a model on your dataset or continuing pre-training with additional data. Adjust hyperparameters to optimize performance.
    • Start a training job by either fine-tuning a model on your dataset or continuing pre-training with additional data. Adjust hyperparameters to optimize performance.
    • Buy Provisioned Throughput for the fine-tuned model to support high-throughput deployment and handle the expected load.
    • Deploy the customized model and use it for inference tasks in Amazon Bedrock. The model will now have enhanced capabilities tailored to your specific needs.
  • As summarised
    • Prepare a labeled dataset in JSONL format with fraud examples
    • Adjust hyperparameters and create a Fine-tuning job
    • Analyze the results by reviewing training or validation metrics
    • Purchase provisioned throughput for the fine-tuned model
    • Use the customized model in Amazon Bedrock tasks

— Docker Container —

SageMaker and Docker Containers

  • All models in SageMaker are hosted in Docker containers
  • Docker containers are created from images
  • Images are built from a Dockerfile
  • Images are saved in a repository
    • Amazon Elastic Container Registry (ECR)
  • Recommended to wrap the custom codes into custom Docker container, push it to the Amazon Elastic Container Registry (Amazon ECR), and use it as a processing container in SageMaker AI
  • Amazon SageMaker Containers
    • Library for making containers compatible with SageMaker
    • RUN pip install sagemaker-containers in your Dockerfile
  • Environment variables
    • SAGEMAKER_PROGRAM
      • Run a script inside /opt/ml/code
    • SAGEMAKER_TRAINING_MODULE
    • SAGEMAKER_SERVICE_MODULE
    • SM_MODEL_DIR
    • SM_CHANNELS / SM_CHANNEL_*
    • SM_HPS / SM_HP_*
    • SM_USER_ARGS
  • Production Variants
    • test out multiple models on live traffic using Production Variants
      • Variant Weights tell SageMaker how to distribute traffic among them
      • So, you could roll out a new iteration of your model at say 10% variant weight
      • Once you’re confident in its performance, ramp it up to 100%
    • do A/B tests, and to validate performance in real-world settings
      • Offline validation isn’t always useful
    • Shadow Variants
    • Deployment Guardrails

— Machine Learning Operations (MLOps) —

SageMaker Projects

  • set up dependency management, code repository management, build reproducibility, and artifact sharing
  • SageMaker Studio’s native MLOps solution with CI/CD
    • Build images
    • Prep data, feature engineering
    • Train models
    • Evaluate models
    • Deploy models
    • Monitor & update models
  • Uses code repositories for building & deploying ML solutions
  • Uses SageMaker Pipelines defining steps

SageMaker ML Lineage Tracking (EOL on 2024/25)

  • Model lineage graphs
    • Creates & stores your ML workflow (MLOps)
    • a visualization of your entire ML workflow from data preparation to deployment
    • use entities to represent individual steps in your workflow
  • Keep a running history of your models
  • Tracking for auditing and compliance
  • Automatically or manually-created tracking entities
  • Integrates with AWS Resource Access Manager for cross-account lineage
  • Entities
    • Trial component (processing jobs, training jobs, transform jobs)
    • Trial (a model composed of trial components)
    • Experiment (a group of Trials for a given use case)
    • Context (logical grouping of entities)
    • Action (workflow step, model deployment
    • Artifact (Object or data, such as an S3 bucket or an image in ECR)
    • Association (connects entities together) – has optional AssociationType:
      • ContributedTo
      • AssociatedWith
      • DerivedFrom
      • Produced
      • SameAs
  • Querying
    • Use the LineageQuery API from Python
      • Part of the Amazon SageMaker SDK for Python
    • Do things like find all models / endpoints / etc. that use a given artifact
    • Produce a visualization
      • Requires external Visualizer helper class

Amazon SageMaker (ML) Pipeline

  • allows you to create, automate, and manage end-to-end ML workflows
  • consists of multiple steps, each representing a specific task or operation in the ML workflow.
    • data processing
    • model training
    • model evaluation
    • model deployment
  • series of interconnected steps in directed acyclic graph (DAG)
    • UI or can using the pipeline definition JSON schema
  • composed of
    • name
    • parameters
    • steps
      • A step property is an attribute of a step that represents the output values from a step execution
      • For a step that references a SageMaker job, the step property matches the attributes of that SageMaker job
        • ProcessingStep
        • TrainingStep
        • TransformStep
        • TuningStep
        • AutoMLStep
        • ModelStep
        • CallbackStep
        • LambdaStep
        • ClarifyCheckStep
        • QualityCheckStep
        • EMRStep
  • @step decorator : custom ML job
    • define custom logic within a pipeline step and ensure it works seamlessly within the workflow
    • example, create a simple python function to do data validation
  • @remote decorator : remote function calls
    • It simplifies the execution of machine learning workflows by allowing you to run code remotely in a SageMaker environment, making it easier to scale your workloads, distribute computations, and leverage SageMaker infrastructure efficiently.

AWS Step Functions

  • create workflows, also called State machines, to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning pipelines
  • based on state machines and tasks. In Step Functions, state machines are called workflows, which are a series of event-driven steps. Each step in a workflow is called a state. For example, a Task state represents a unit of work that another AWS service performs, such as calling another AWS service or API. Instances of running workflows performing tasks are called executions in Step Functions.
  • The work in your state machine tasks can also be done using Activities which are workers that exist outside of Step Functions.
Integrated serviceRequest ResponseRun a Job – .syncWait for Callback – .waitForTaskToken
Amazon API GatewayStandard & ExpressNot supportedStandard
Amazon AthenaStandard & ExpressStandardNot supported
AWS BatchStandard & ExpressStandardNot supported
Amazon BedrockStandard & ExpressStandardStandard
AWS CodeBuildStandard & ExpressStandardNot supported
Amazon DynamoDBStandard & ExpressNot supportedNot supported
Amazon ECS/FargateStandard & ExpressStandardStandard
Amazon EKSStandard & ExpressStandardStandard
Amazon EMRStandard & ExpressStandardNot supported
Amazon EMR on EKSStandard & ExpressStandardNot supported
Amazon EMR ServerlessStandard & ExpressStandardNot supported
Amazon EventBridgeStandard & ExpressNot supportedStandard
AWS GlueStandard & ExpressStandardNot supported
AWS Glue DataBrewStandard & ExpressStandardNot supported
AWS LambdaStandard & ExpressNot supportedStandard
AWS Elemental MediaConvertStandard & ExpressStandardNot supported
Amazon SageMaker AIStandard & ExpressStandardNot supported
Amazon SNSStandard & ExpressNot supportedStandard
Amazon SQSStandard & ExpressNot supportedStandard
AWS Step FunctionsStandard & ExpressStandardStandard

— Kubernetes —

Amazon SageMaker Operators for Kubernetes

  • Integrates SageMaker with Kubernetes-based ML infrastructure
  • Components for Kubeflow Pipelines
  • Enables hybrid ML workflows (on-prem + cloud)
  • Enables integration of existing ML platforms built on Kubernetes / Kubeflow


== MISC ==

Managing SageMaker Resources

  • In general, algorithms that rely on deep learning will benefit from GPU instances (P3, g4dn) for training
  • Inference is usually less demanding and you can often get away with compute instances there (C5)
  • Can use EC2 Spot instances for training
    • Save up to 90% over on-demand instances
  • Spot instances can be interrupted!
    • Use checkpoints to S3 so training can resume
  • Can increase training time as you need to wait for spot instances to become available
  • Elastic Inference (EI) / Amazon SageMaker Inference
    • attach just the right amount of GPU-powered acceleration to any Amazon EC2 and Amazon SageMaker instance
    • Accelerates deep learning inference
      • At fraction of cost of using a GPU instance for inference
    • EI accelerators may be added alongside a CPU instance
      • ml.eia1.medium / large / xlarge
    • EI accelerators may also be applied to notebooks
    • Works with Tensorflow, PyTorch, and MXNet pre-built containers
      • ONNX may be used to export models to MXNet
    • Works with custom containers built with EI-enabled Tensorflow, PyTorch, or MXNet
    • Works with Image Classification and Object Detection built-in algorithms
  • Automatic Scaling
    • “Real-time endpoints” typically keep instances running even when there is no traffic; so it’s good to solve the peak traffic, but not the best cost-saving
  • Serverless Inference
    • Specify your container, memory requirement, concurrency requirements
    • Underlying capacity is automatically provisioned and scaled
    • Good for infrequent or unpredictable traffic; will scale down to zero when there are no
      requests
      • a cost-effective option for workloads with intermittent or unpredictable traffic
      • “Provisioned Concurrency” ensures that the model is always available, with extra cost for the reserved capacity even unused; but this can prevent the “cold-start” issue
    • Charged based on usage
    • Monitor via CloudWatch
      • ModelSetupTime, Invocations, MemoryUtilization
  • Amazon SageMaker Inference Recommender
  • Availability Zones
    • automatically attempts to distribute instances across availability zones
    • Deploy multiple instances for each production endpoint
    • Configure VPC’s with at least two subnets, each in a different AZ


== SECURITY ==

SageMaker Security

  • Use Identity and Access Management (IAM)
    • User permissions for:
      • CreateTrainingJob
      • CreateModel
      • CreateEndpointConfig
      • CreateTransformJob
      • CreateHyperParameterTuningJob
      • CreateNotebookInstance
      • UpdateNotebookInstance
    • Predefined policies:
      • AmazonSageMakerReadOnly
      • AmazonSageMakerFullAccess
      • AdministratorAccess
      • DataScientist
  • Set up user accounts with only the permissions they need
  • Use MFA
  • Use SSL/TLS when connecting to anything
  • Use CloudTrail to log API and user activity
  • Use encryption
  • AWS Key Management Service (KMS)
    • Accepted by notebooks and all SageMaker jobs
      • Training, tuning, batch transform, endpoints
      • Notebooks and everything under /opt/ml/ and /tmp can be encrypted with a KMS key
  • S3
    • Can use encrypted S3 buckets for training data and hosting models
    • S3 can also use KMS
  • Protecting Data in Transit
    • All traffic supports TLS / SSL
    • IAM roles are assigned to SageMaker to give it permissions to access resources
    • Inter-node training communication may be optionally encrypted
    • Can increase training time and cost with deep learning
    • AKA inter-container traffic encryption
    • Enabled via console or API when setting up a training or tuning job
  • VPC
    • Training jobs run in a Virtual Private Cloud (VPC)
    • You can use a private VPC for even more security
      • You’ll need to set up S3 VPC endpoints
      • Custom endpoint policies and S3 bucket policies can keep this secure
    • Notebooks are Internet-enabled by default
      • If disabled, your VPC needs
        • an interface endpoint (PrivateLink) and allow outbound connections (to other AWS services, like S3, AWS Comprehend), for training and hosting to work
        • or NAT Gateway, and allow outbound connections (to Internet), for training and hosting to work
    • Training and Inference Containers are also Internet-enabled by default
      • Network isolation is an option, but this also prevents S3 access
        • Enable the SageMaker parameter EnableNetworkIsolation for the notebook instances; so the instances wouldn’t be accessible from the Internet
  • Logging and Monitoring
    • CloudWatch can log, monitor and alarm on:
      • Invocations and latency of endpoints
      • Health of instance nodes (CPU, memory, etc)
      • Ground Truth (active workers, how much they are doing)
    • CloudTrail records actions from users, roles, and services within SageMaker
      • Log files delivered to S3 for auditing


== COST OPTIMISATION ==

AWS Cost Management

  • Operations for Cost Allocation Tracking/Analysis
    • STEP ONE: create user-defined tags with key-value pairs that reflect attributes such as project names or departments to ensure proper categorization of resources
    • STEP TWO: apply these tags to the relevant resources to enable tracking
    • STEP THREE:  enable the cost allocation tags in the Billing console
    • (AFTER) STEP FOUR: Configure tag-based cost and usage reports (AWS Cost Allocation Reports) for detailed analysis in Cost Explorer