16. ML – Amazon SageMaker


SageMaker Notebooks

  • direct the process
  • (Jupyter) Notebook Instances on EC2 are spun up from the console
    • S3 data access
    • using Scikit_learn, Spark, Tensorflow
    • Wide variety of built-in models
    • Ability to spin up training instances
    • Ability to deploy trained models for making predictions at scale


Data prep on SageMaker

  • Data Sources
    • usually comes from S3
      • Ideal format varies with algorithm – often it is RecordIO / Protobuf
        • LibSVM is a format for representing data in a simple, text-based, and sparse format, primarily used for support vector machines (SVM).
        • RecordIO/Protobuf is a binary data storage format that allows for efficient storage and retrieval of large datasets, commonly used in deep learning and data processing.
    • also ingest from Athena, EMR, Redshift, and Amazon Keyspaces DB
  • Can also ingest from Athena, EMR, Redshift, and Amazon Keyspaces DB
  • Apache Spark integrates with SageMaker
  • More on this later…
  • Scikit_learn, numpy, pandas all at your disposal within a notebook
  • Processing (example)
    • Copy data from S3
    • Spin up a processing container
      • SageMaker built-in or user provided
    • Output processed data to S3

Training/Deployment on SageMaker

  • Create a training job
    • URL of S3 bucket with training data
    • ML compute resources
    • URL of S3 bucket for output
    • ECR path to training code
  • Training options
    • Built-in training algorithms
    • Spark MLLib
    • Custom Python Tensorflow / MXNet code
    • PyTorch, Scikit-Learn, RLEstimator
    • XGBoost, Hugging Face, Chainer
  • Save your trained model to S3
  • Can deploy two ways:
    • Persistent endpoint for making individual predictions on demand
    • SageMaker Batch Transform to get predictions for an entire dataset
  • Lots of cool options
    • Inference Pipelines for more complex processing
    • SageMaker Neo for deploying to edge devices
    • Elastic Inference for accelerating deep learning models
    • Automatic scaling (increase # of endpoints as needed)
    • Shadow Testing evaluates new models agai