(Jupyter) Notebook Instances on EC2 are spun up from the console
S3 data access
using Scikit_learn, Spark, Tensorflow
Wide variety of built-in models
Ability to spin up training instances
Ability to deploy trained models for making predictions at scale
Data prep on SageMaker
Data Sources
usually comes from S3
Ideal format varies with algorithm – often it is RecordIO / Protobuf
LibSVM is a format for representing data in a simple, text-based, and sparse format, primarily used for support vector machines (SVM).
RecordIO/Protobuf is a binary data storage format that allows for efficient storage and retrieval of large datasets, commonly used in deep learning and data processing.
also ingest from Athena, EMR, Redshift, and Amazon Keyspaces DB
Can also ingest from Athena, EMR, Redshift, and Amazon Keyspaces DB
Apache Spark integrates with SageMaker
More on this later…
Scikit_learn, numpy, pandas all at your disposal within a notebook
Processing (example)
Copy data from S3
Spin up a processing container
SageMaker built-in or user provided
Output processed data to S3
Training/Deployment on SageMaker
Create a training job
URL of S3 bucket with training data
ML compute resources
URL of S3 bucket for output
ECR path to training code
Training options
Built-in training algorithms
Spark MLLib
Custom Python Tensorflow / MXNet code
PyTorch, Scikit-Learn, RLEstimator
XGBoost, Hugging Face, Chainer
…
Save your trained model to S3
Can deploy two ways:
Persistent endpoint for making individual predictions on demand
SageMaker Batch Transform to get predictions for an entire dataset
Lots of cool options
Inference Pipelines for more complex processing
SageMaker Neo for deploying to edge devices
Elastic Inference for accelerating deep learning models
Automatic scaling (increase # of endpoints as needed)