14. ML – Algorithms

== GENERAL / TABULAR ==

Linear Learner

  • Linear regression
    • Logistic regression produces a binary output
  • Main Use Cases
    • regression (numeric) predictions
    • classification predictions
  • Inputs
    • RecordIO-wrapped protobuf
      • Float32 data only!
    • CSV
      • First column assumed to be the label
    • File or Pipe mode both supported
  • Processes
    • Preprocessing
      • Training data must be normalized (so all features are weighted the same)
      • Input data should be shuffled
    • Training
      • Uses stochastic gradient descent
      • Choose an optimization algorithm (Adam, AdaGrad, SGD, etc)
      • Multiple models are optimized in parallel
      • Tune L1, L2 regularization
    • Validation
      • Most optimal model is selected
  • Hyperparameters
    • Balance_multiclass_weights
      • Gives each class equal importance in loss functions
    • Learning_rate, mini_batch_size
    • L1 : Regularization
    • Wd : Weight decay (L2 regularization)
    • target_precision
      • Use with binary_classifier_model_selection_criteria set to
        recall_at_target_precision
      • Holds precision at this value while maximizing recall
    • target_recall
      • Use with binary_classifier_model_selection_criteria set to
        precision_at_target_recall
      • Holds recall at this value while maximizing precision
  • Instance Types
    • multi-GPU models not suitable

DeepAR Forecast

  • Forecasting one-dimensional time series data
  • Uses RNN’s
  • Allows you to train the same model over several related time series
  • Main Use Cases
    • Finds frequencies and seasonality
  • Input
    • JSON (Gzip or Parquet)
    • Each record must contain:
      • Start: the starting time stamp
      • Target: the time series values
    • Optional
      • Dynamic_feat: dynamic features (such as, was a promotion applied to a product in a time series of product purchases)
      • Cat: categorical features
  • Hyperparameters
    • Context_length
      • Number of time points the model sees before making a prediction
      • Can be smaller than seasonalities; the model will lag one year anyhow.
    • Epochs
    • mini_batch_size
    • Learning_rate
    • Num_cells
  • Instance Types
    • CPU, GPU, single or multiple, all good
    • GPU better for larger models, or with large mini-batch sizes (>512)
    • CPU-only for reference

Random (Cut) Forest

  • Anomaly detection
  • Unsupervised
  • Detect unexpected spikes in time series data
  • Breaks in periodicity
  • Unclassifiable data points
  • Assigns an anomaly score to each data point
  • Input
    • RecordIO-protobuf or CSV
    • Can use File or Pipe mode on either
    • Optional test channel for computing accuracy, precision, recall, and F1 on labeled data (anomaly or not)
  • Processing
    • Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it
    • Data is sampled randomly
    • Then trained
    • RCF shows up in Kinesis Analytics as well; it can work on streaming data too.
  • Hyperparameters
    • Num_trees
      • Increasing reduces noise
    • Num_samples_per_tree
      • Should be chosen such that 1/num_samples_per_tree approximates the ratio of anomalous to normal data
  • Instance Types
    • Does not take advantage of GPUs

IP Insights

  • Unsupervised learning of IP address usage patterns
  • Main Use Cases
    • Identifies suspicious behavior from IP addresses
      • Identify logins from anomalous IP’s
      • Identify accounts creating resources from anomalous IP’s
  • Input
    • User names, account ID’s can be fed in directly; no need to pre-process
    • Training channel, optional validation (computes AUC score)
    • CSV only
      • Entity, IP
  • Processing
    • Uses a neural network to learn latent vector representations of entities and IP addresses.
    • Entities are hashed and embedded
      • Need sufficiently large hash size
    • Automatically generates negative samples during training by randomly pairing entities and IP’s
  • Hyperparameters
    • Num_entity_vectors
      • Hash size
      • Set to twice the number of unique entity identifiers
    • Vector_dim
      • Size of embedding vectors
      • Scales model size
      • Too large results in overfitting
    • Epochs, learning rate, batch size, etc.
  • Instance Types
    • CPU or GPU
    • GPU recommended
    • Can use multiple GPU’s
    • Size of CPU instance depends on vector_dim and num_entity_vectors

K-Nearest-Neighbors (KNN)

  • Classification
    • Find the K closest points to a sample point and return the most frequent label
  • Regression
    • Find the K closest points to a sample point and return the average value
  • Input
    • Train channel contains your data
    • Test channel emits accuracy or MSE
    • recordIO-protobuf or CSV training
      • First column is label
    • File or pipe mode on either
  • Processing
    • Data is first sampled
    • SageMaker includes a dimensionality reduction stage
      • Avoid sparse data (“curse of dimensionality”)
      • At cost of noise / accuracy
      • “sign” or “fjlt” methods
    • Build an index for looking up neighbors
    • Serialize the model
    • Query the model for a given K
  • Hyperparameters
    • K!
    • Sample_size
  • Instance Types
    • Training on CPU or GPU
    • Inference
      • CPU for lower latency
      • GPU for higher throughput on large batches

K-Means

  • Unsupervised clustering
  • Divide data into K groups, where members of a group are as similar as possible to each other
    • You define what “similar” means
    • Measured by Euclidean distance
  • Input
    • Train channel, optional test
      • Train ShardedByS3Key, test FullyReplicated
    • recordIO-protobuf or CSV
    • File or Pipe on either
  • Processing
    • Every observation mapped to n-dimensional space (n = number of features)
    • Works to optimize the center of K clusters
      • “extra cluster centers” may be specified to improve accuracy (which end up getting reduced to k)
      • K = k*x
    • Algorithm:
      • Determine initial cluster centers
        • Random or k-means++ approach
        • K-means++ tries to make initial clusters far apart
      • Iterate over training data and calculate cluster centers
      • Reduce clusters from K to k
        • Using Lloyd’s method with kmeans++
  • Hyperparameters
    • K!
      • Choosing K is tricky
      • Plot within-cluster sum of squares as function of K
      • Use “elbow method”
      • Basically optimize for tightness of clusters
    • Mini_batch_size
    • Extra_center_factor
    • Init_method
  • Instance Types
    • CPU or GPU, but CPU recommended
      • Only one GPU per instance used on GPU

Principal Component Analysis (PCA)

  • Dimensionality reduction
    • Project higher-dimensional data (lots of features) into lower-dimensional (like a 2D plot) while minimizing loss of information
    • The reduced dimensions are called components
      • First component has largest possible
        variability
      • Second component has the next largest…
  • commonly used for feature extraction or visualization
  • Unsupervised
  • Input
    • recordIO-protobuf or CSV
    • File or Pipe on either
  • Processing
    • Covariance matrix is created, then singular value decomposition (SVD)
    • Two modes
      • Regular, for sparse data and moderate number of observations and features
      • Randomized, for large number of observations and features
        • Uses approximation algorithm
  • Hyperparameters
    • Algorithm_mode
    • Subtract_mean
      • Unbias data
  • Instance Types
    • GPU or CPU
      • It depends “on the specifics of the input data”
  • Dealing with sparse data
    • Click prediction
    • Item recommendations
    • Since an individual user doesn’t interact with most pages / products the data is sparse
  • Supervised
    • Classification or regression
  • Limited to pair-wise interactions
    • User -> item for example
  • Finds factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things (users & items?)
  • Usually used in the context of recommender systems
  • Input
    • recordIO-protobuf with Float32
      • Sparse data means CSV isn’t practical
  • Hyperparameters
    • Initialization methods for bias, factors, and linear terms
      • Uniform, normal, or constant
      • Can tune properties of each method
  • Instance Types
    • CPU or GPU
      • CPU recommended
      • GPU only works with dense data

AutoGluon-Tabular

  • an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers
  • Main Use Cases
    • use TabTransformer for regression, classification (binary and multiclass), and ranking problems
  • Inputs
    • CSV
      • the rows representing observations, one column representing the target variable or label, and the remaining columns representing features.
  • Processes
    • automatically recognizes the data type in each column for robust data preprocessing, including special handling of text fields
    • models are stacked in multiple layers and trained in a layer-wise manner that guarantees raw data can be translated into high-quality predictions within a given time constraint
  • Hyperparameters
    • xxxxxxx
  • Instance Types
    • only train on a single machine (CPU or GPU, no multi-GPU)

TabTransformer

  • built on self-attention-based Transformers
  • Main Use Cases
    • xxxxxxx
  • Inputs
    • CSV 
  • Processes
    • xxxxxx
  • Hyperparameters
    • xxxxxxx
  • Instance Types
    • only train on a single machine (CPU or GPU, no multi-GPU)

XGBoost

  • use Gradient Boosting Decision Tree (GBDT) algorithm
  • eXtreme Gradient Boosting
    • Boosted group of decision trees
    • New trees made to correct the errors of previous trees
    • Uses gradient descent to minimize loss as new trees are added
  • Main Use Cases
    • classification
    • regression, using regression trees
    • including financial forecasting, credit scoring, and customer churn prediction
  • Inputs
    • RecordIO-protobuf
    • CSV
    • libsvm
    • Parquet
  • Hyperparameters
    • Subsample
      • Prevents overfitting
    • Eta
      • Step size shrinkage, prevents overfitting
    • Gamma
      • Minimum loss reduction to create a partition; larger = more conservative
    • Alpha
      • L1 regularization term; larger = more conservative
    • Lambda
      • L2 regularization term; larger = more conservative
    • eval_metric
      • Optimize on AUC, error, rmse…
      • For example, if you care about false positives more than accuracy, you might use AUC here
    • scale_pos_weight
      • Adjusts balance of positive and negative weights
      • Helpful for unbalanced classes
      • Might set to sum(negative cases) / sum(positive cases)
    • max_depth
      • Max depth of the tree
      • Too high and you may overfit
  • Instance Types
    • Is memory-bound, not compute-bound
    • So, M5 is a good choice
    • XGBoost 1.2
      • single-instance GPU training is available
      • Must set tree_method hyperparameter to gpu_hist
    • XGBoost 1.5+: Distributed GPU training
      • Must set use_dask_gpu_training to true
      • Set distribution to fully_replicated in TrainingInput
      • Only works with csv or parquet input
  • Tuning Metrix
    • Regression
      • Root Mean Square Error (RMSE) “validation: rmse”
      • Mean Absolute Error (MAE) “validation: mae”
    • (Binary) Classification
      • F1 is best, as combination of precision and recall, especially for imbalanced dataset “validation: f1”
      • then Error “validation: error” or Accuracy “validation: accuracy”
    • Ranking
      • Normalized Discounted Cumulative Gain (NDCG) “validation: ndcg”
      • MAP (Mean Average Precision) “validation: map”

CatBoost

  • use Gradient Boosting Decision Tree (GBDT) algorithm
  • extra two techniques
    • The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm
    • An innovative algorithm for processing categorical features
  • Main Use Cases
    • Good at categorical features, like ecommerce (like product recommendations) and customer behavior analysis
  • Inputs
    • CSV 
  • Processes
    • xxxxxx
  • Hyperparameters
    • xxxxxxx
  • Instance Types
    • only CPUs as memory-bound

LightGBM

  • use Gradient Boosting Decision Tree (GBDT) algorithm
    • GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models
  • extra two techniques
    • Gradient-based One-Side Sampling (GOSS)
    • Exclusive Feature Bundling (EFB)
  • Main Use Cases
    • ideal for scenarios with large-scale datasets and high-dimensional features, such as in real-time bidding systems, recommendation engines, and large-scale classification problems.
  • Inputs
    • CSV 
  • Processes
    • xxxxxx
  • Hyperparameters
    • xxxxxxx
  • Instance Types
    • only CPUs as memory-bound
GBDT variantsStrength
XGBoostgeneral use with structured data
CatBoostcategorical features
LightGBMlarge datasets efficiently

== TEXT ==

  • Input is a sequence of tokens, output is a sequence of tokens
  • transform a sequence of elements (such as words in a sentence) into another sequence
  • Main Use Cases
    • Machine (language) Translation
    • Text summarization
    • Speech to text
  • Implemented with RNN’s and CNN’s with attention
  • Inputs
    • RecordIO-protobuf
      • Tokens must be integer, not floating point as others.
      • Packs into integer tensors with vocabulary files
      • A lot like the TF/IDF lab we did earlier.
    • Start with tokenized text files
    • Must provide training data, validation data, and vocabulary files.
  • Hyperparameters
    • Batch_size
    • Optimizer_type (adam, sgd, rmsprop)
    • Learning_rate
    • Num_layers_encoder
    • Num_layers_decoder
    • Can optimize on:
      • Accuracy
        • Vs. provided validation dataset
      • BLEU score
        • Compares against multiple reference translations
      • Perplexity
        • Cross-entropy
  • Instance Types
    • only use GPU instance types
    • only use a single machine for training (but can use multiple GPUs)
  • Main Use Cases
    • Predict labels for a sentence <- Word Embedding
    • Useful in web searches, information retrieval <- Text Classification
  • Modes
    • (supervised) Text classification
      • Used for perform web searches, information retrieval, ranking, and document classification
    • (unsupervised) Word2Vec
      • Used for sentiment analysis, named entity recognition, machine translation
      • Creates a vector representation of words
      • Semantically similar words are represented by vectors close to each other
      • This is called a word embedding
      • It is useful for NLP, but is not an NLP algorithm in itself!
        • Used in machine translation, sentiment analysis
      • Remember it only works on individual words, not sentences or documents
      • modes
        • Cbow (Continuous Bag of Words)
        • Skip-gram
        • Batch skip-gram
          • Distributed computation over many CPU nodes
  • Input
    • For supervised mode (text classification):
      • One sentence per line
      • First “word” in the sentence is the string “__label__” followed by the label
    • Also, “augmented manifest text format”
    • Word2vec just wants a text file with one training sentence per line.
  • Hyperparameters
    • Text classification:
      • Epochs
      • Learning_rate
      • Word_ngrams
      • Vector_dim
    • Word2vec:
      • Mode (batch_skipgram, skipgram, cbow)
      • Learning_rate
      • Window_size
      • Vector_dim
      • Negative_samples
  • Instance Types
    • For cbow and skipgram, recommend a single ml.p3.2xlarge
      • Any single CPU or single GPU instance will work
    • For batch_skipgram, can use single or multiple CPU instances
    • For text classification, C5 recommended if less than 2GB training data. For larger data sets, use a single GPU instance (ml.p2.xlarge or ml.p3.2xlarge)

  • primarily used for learning vector representations of objects, which can then be used for tasks like similarity search, recommendation, or clustering
  • creates low-dimensional dense embeddings of high-dimensional objects
  • It is basically word2vec, generalized to handle things other than words.
  • Compute nearest neighbors of objects
  • Main Use Cases
    • Visualize clusters
    • Genre prediction
    • Recommendations (similar items or users)
  • Input
    • Data must be tokenized into integers
    • Training data consists of pairs of tokens and/or sequences of tokens
      • Sentence – sentence
      • Labels-sequence (genre to description?)
      • Customer-customer
      • Product-product
      • User-item
  • Processes
    • Process data into JSON Lines and shuffle it
    • Train with two input channels, two encoders, and a comparator
    • Encoder choices:
      • Average-pooled embeddings
      • CNN’s
      • Bidirectional LSTM
    • Comparator is followed by a feed-forward neural network
  • Hyperparameters
    • The usual deep learning ones
      • Dropout, early stopping, epochs, learning
        rate, batch size, layers, activation
        function, optimizer, weight decay
    • Enc1_network, enc2_network
      • Choose hcnn, bilstm, pooled_embedding
  • Instance Types
    • only train on a single machine (CPU or GPU, multi-GPU OK)
    • Inference: use ml.p3.2xlarge
      • Use INFERENCE_PREFERRED_MODE environment variable to optimize for encoder embeddings rather than classification or regression.
  • Organize documents into topics
  • Main Use Cases
    • Classify or summarize documents based on topics
  • Unsupervised
  • using “Neural Variational Inference” topic modelling algorithm
  • You define how many topics you want
  • These topics are a latent representation based on top ranking words
  • Input
    • Four data channels
      • “train” is required
      • “validation”, “test”, and “auxiliary” optional
    • recordIO-protobuf or CSV
    • Words must be tokenized into integers
      • Every document must contain a count for every word in the vocabulary in CSV
      • The “auxiliary” channel is for the vocabulary
    • File or pipe mode
  • Hyperparameters
    • Lowering mini_batch_size and learning_rate can reduce validation loss
      • At expense of training time
    • Num_topics
  • Instance Types
    • CPU and GPU are all good
    • GPU recommended for training
    • CPU OK for inference
  • Another topic modeling algorithm
    • to identify a specified number of topics within a set of text documents
    • each document is considered an observation
    • the words within the documents are the features
    • the topics are the categories
    • LDA learns the topics as a probability distribution over the words in the documents, and each document is characterized as a mixture of these topics
  • Unsupervised
  • The topics themselves are unlabeled; they are just groupings of documents with a shared subset of words
  • Main Use Cases
    • Can be used for things other than words
      • Cluster customers based on purchases
      • Harmonic analysis in music
  • Optional test channel can be used for scoring results
  • Per-word log likelihood
  • Functionally similar to NTM, but CPU-based
    • Therefore maybe cheaper / more efficient
  • Input
    • Train channel, optional test channel
    • recordIO-protobuf or CSV
    • Each document has counts for every word in vocabulary (in CSV format)
    • Pipe mode only supported with recordIO
  • Processing
    • Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it
    • Data is sampled randomly
    • Then trained
    • RCF shows up in Kinesis Analytics as well; it can work on streaming data too.
  • Hyperparameters
    • Num_topics
    • Alpha0
      • Initial guess for concentration parameter
      • Smaller values generate sparse topic mixtures
      • Larger values (>1.0) produce uniform mixtures
  • Instance Types
    • Single-instance CPU training

== VISION ==

Object Detection

  • Takes an image as input, outputs all instances of objects in the image with categories and confidence scores
  • Main Use Cases
    • Image Object Detection / Identifications
  • MXNet
    • Uses a CNN with the Single Shot multibox Detector (SSD) algorithm
      • The base CNN can be VGG-16 or ResNet-50
    • Transfer learning mode / incremental training
    • Use a pre-trained model for the base network weights, instead of random initial weights
    • Uses flip, rescale, and jitter internally to avoid overfitting
  • Tensorflow
    • Uses ResNet, EfficientNet, MobileNet models from the TensorFlow Model Garden
  • Input
    • MXNet
      • RecordIO or image format (jpg or png)
      • With image format, supply a JSON file for annotation data for each image
  • Hyperparameters
    • Mini_batch_size
    • Learning_rate
    • Optimizer
    • Sgd, adam, rmsprop, adadelta
  • Instance Types
    • Use GPU instances for training (multi-GPU and multi-machine OK)
    • Use CPU or GPU for inference
actorMXNetTensorFlow
Community SupportGrowing community, but smaller compared to TensorFlowLarge and vibrant community with extensive resources
Ease of LearningSlightly steeper learning curveUser-friendly, especially with the Keras API
PerformanceEfficient and scalable, capable of handling large-scale projectsSlightly slower compared to MXNet for specific tasks
FlexibilityHighly flexible in terms of supported languages and hardwareOffers a wide range of tools and supports various hardware options
Project ComplexityMay require some experience to handle larger projects effectivelyProvides best practices and design patterns to manage complexity

Image Classification

  • assigns labels to an entire image, categorizing it based on the predominant features.
    • works well for tasks like sorting images into broad categories but cannot identify or count multiple objects within a single image
  • Main Use Cases
    • Assign one or more labels to an image
  • Doesn’t tell you where objects are, just what objects are in the image
  • MXNet
    • Full training mode
      • Network initialized with random weights
    • Transfer learning mode
      • Initialized with pre-trained weights
      • The top fully-connected layer is initialized with random weights
      • Network is fine-tuned with new training data
    • Default image size is 3-channel 224×224 (ImageNet’s dataset)
  • Tensorflow
    • Uses various Tensorflow Hub models (MobileNet, Inception, ResNet, EfficientNet)
    • Top classification layer is available for fine tuning or
      further training
  • Hyperparameters
    • The usual suspects for deep learning
      • Batch size, learning rate, optimizer
    • Optimizer-specific parameters
      • Weight decay, beta 1, beta 2, eps, gamma
      • Slightly different between MXNet and
        Tensorflow versions
  • Instance Types
    • Use GPU instances for training (multi-GPU and multi-machine OK)
    • Use CPU or GPU for inference

Semantic Segmentation

  • Pixel-level object classification
    • classifies each pixel in an image into different categories, providing a detailed map of where different objects or materials are located within the image
  • Different from image classification – that assigns labels to whole images
  • Different from object detection – that assigns labels to bounding boxes
  • Main Use Cases
    • self-driving vehicles
    • medical imaging diagnostics
    • robot sensing
  • Produces a segmentation mask
  • Built on MXNet Gluon and Gluon CV
  • Choice of 3 algorithms:
    • Fully-Convolutional Network (FCN)
    • Pyramid Scene Parsing (PSP)
    • DeepLabV3
  • Choice of backbones:
    • ResNet50
    • ResNet101
    • Both trained on ImageNet
  • Incremental training, or training from scratch, supported too
  • Input
    • JPG Images and PNG annotations
    • For both training and validation
    • Label maps to describe annotations
    • Augmented manifest image format supported for Pipe mode.
    • JPG images accepted for inference
  • Hyperparameters
    • Epochs, learning rate, batch size, optimizer, etc
    • Algorithm
    • Backbone
  • Instance Types
    • Use GPU instances for training (multi-GPU and multi-machine OK)
    • Use CPU or GPU for inference

Data typesupervised unsupervised 
Binary/Multiple – Classification
(Predict if an item belongs to a category: an email spam filter)
Tabular– Linear Learner Algorithm
– K-Nearest Neighbors (k-NN)
– Factorization Machines
– AutoGluon-Tabular
– TabTransformer
– XGBoost
– CatBoost
– LightGBM
Regression
(Predict a numeric/continuous value: estimate the value of a house)
Tabular– Linear Learner Algorithm
– K-Nearest Neighbors (k-NN)
– Factorization Machines
– AutoGluon-Tabular
– TabTransformer
– XGBoost
– CatBoost
– LightGBM
Time-series forecasting

(Based on historical data for a behavior, predict future behavior: predict sales on a new product based on previous sales data.)
TabularDeepAR
Feature engineering: dimensionality reduction

(Drop those columns from a dataset that have a weak relation with the label/target variable: the color of a car when predicting its mileage.)
TabularPrincipal Component Analysis (PCA)
Anomaly detection

(Detect abnormal behavior in application: spot when an IoT sensor is sending abnormal readings)
TabularRandom Cut Forest (RCF)
Clustering or grouping

(Group similar objects/data together: find high-, medium-, and low-spending customers from their transaction histories)
Tabular
K-Means
IP Address Pattern / IP anomaly detection

(Protect your application from suspicious users: detect if an IP address accessing a service might be from a bad actor)
TabularIP Insights
Language Translation

(Convert text from one language to other: Spanish to English)
TextSeq2Seq
Text summarization

(Summarize a long text corpus: an abstract for a research paper)
TextSeq2Seq
Speech-to-text

(Convert audio files to text: transcribe call center conversations for further analysis)
TextSeq2Seq
Text Classification

(Assign pre-defined categories to documents in a corpus: categorize books in a library into academic disciplines)
TextBlazingText, Text Classification – TensorFlow
Topic Modeling/Discovery

(Organize a set of documents into topics (not known in advance): tag a document as belonging to a medical category based on the terms used in the document.)
TextLatent Dirichlet Allocation (LDA), Neural Topic Model (NTM)
Dense Embeddings / Feature Engineering

(Improve the data embeddings of the high-dimensional objects: identify duplicate support tickets or find the correct routing based on similarity of text in the tickets)
TextObject2Vec
Image and multi-label classification

(Label/tag an image based on the content of the image: alerts about adult content in an image)
ImageImage Classification – MXNet
Image classification

(Classify something in an image using transfer learning.)
ImageImage Classification – TensorFlow
Computer vision

(Tag every pixel of an image individually with a category: self-driving cars prepare to identify objects in their way)
ImageSemantic Segmentation
Object detection and classification

(Detect people and objects in an image: police review a large photo gallery for a missing person)
ImageObject Detection – MXNet, Object Detection – TensorFlow

== REINFORCEMENT ==

Reinforcement Learning

  • You have some sort of agent that “explores” some space
  • As it goes, it learns the value of different state changes in different conditions
  • Those values inform subsequent behavior of the agent
  • Examples: Pac-Man, Cat & Mouse game (game AI)
    • Supply chain management
    • HVAC systems
    • Industrial robotics
    • Dialog systems
    • Autonomous vehicles
  • Yields fast on-line performance once the space has been explored
  • Q-Learning
    • A set of environmental states s
    • A set of possible actions in those states a
    • A value of each state/action Q
    • Start off with Q values of 0
    • Explore the space
    • As bad things happen after a given state/action, reduce its Q
    • As rewards happen after a given state/action, increase its Q
    • can “look ahead” more than one step by using a discount factor when computing Q (here s is previous state, s’ is current state)
      • Q(s,a) += discount * (reward(s,a) + max(Q(s’)) – Q(s,a))
  • The exploration problem
    • efficiently explore all of the possible states
    • Simple approach: always choose the action for a given state with the highest Q. If there’s a tie, choose at random
      • But that’s really inefficient, and you might miss a lot of paths that way
    • Better way: introduce an epsilon term
      • If a random number is less than epsilon, don’t follow the highest Q, but choose at random
      • That way, exploration never totally stops
      • Choosing epsilon can be tricky
  • Markov Decision Process (MDP)
    • modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker
      • States are still described as s and s’
      • State transition functions are described as 𝑃𝑎 (𝑠, 𝑠′)
      • Our “Q” values are described as a reward function 𝑅𝑎 (𝑠, 𝑠′)
    • a discrete time stochastic control process.
  • RL in SageMaker
    • Uses a deep learning framework with Tensorflow and MXNet
    • Supports Intel Coach and Ray Rllib toolkits.
    • MATLAB, Simulink
    • EnergyPlus, RoboSchool, PyBullet
    • Amazon Sumerian, AWS RoboMaker
  • Distributed Training with SageMaker RL
    • Can distribute training and/or environment rollout
    • Multi-core and multi-instance
  • Key Terms
    • Environment
      • The layout of the board / maze / etc
    • State
      • Where the player / pieces are
    • Action
      • Move in a given direction, etc
    • Reward
      • Value associated with the action from that state
    • Observation
      • i.e., surroundings in a maze, state of chess board
  • Hyperparameters
    • Parameters of your choosing may be abstracted
    • Hyperparameter tuning in SageMaker can then optimize them
  • Instance Types
    • deep learning – so GPU’s are helpful
    • supports multiple instances and cores