Linear Learner
- Linear regression
- Can handle both regression (numeric) predictions and classification predictions
- Inputs
- RecordIO-wrapped protobuf
- Float32 data only!
- CSV
- First column assumed to be the label
- File or Pipe mode both supported
- RecordIO-wrapped protobuf
- Processes
- Preprocessing
- Training data must be normalized (so all features are weighted the same)
- Input data should be shuffled
- Training
- Uses stochastic gradient descent
- Choose an optimization algorithm (Adam, AdaGrad, SGD, etc)
- Multiple models are optimized in parallel
- Tune L1, L2 regularization
- Validation
- Most optimal model is selected
- Preprocessing
- Hyperparameters
- Balance_multiclass_weights
- Gives each class equal importance in loss functions
- Learning_rate, mini_batch_size
- L1 : Regularization
- Wd : Weight decay (L2 regularization)
- target_precision
- Use with binary_classifier_model_selection_criteria set to
recall_at_target_precision - Holds precision at this value while maximizing recall
- Use with binary_classifier_model_selection_criteria set to
- target_recall
- Use with binary_classifier_model_selection_criteria set to
precision_at_target_recall - Holds recall at this value while maximizing precision
- Use with binary_classifier_model_selection_criteria set to
- Balance_multiclass_weights
- Instance Types
- multi-GPU models not suitable

XGBoost
- eXtreme Gradient Boosting
- Boosted group of decision trees
- New trees made to correct the errors of previous trees
- Uses gradient descent to minimize loss as new trees are added
- Can be used for classification
- And also for regression, using regression trees
- Inputs
- RecordIO-protobuf
- CSV
- libsvm
- Parquet
- Hyperparameters
- Subsample
- Prevents overfitting
- Eta
- Step size shrinkage, prevents overfitting
- Gamma
- Minimum loss reduction to create a partition; larger = more conservative
- Alpha
- L1 regularization term; larger = more conservative
- Lambda
- L2 regularization term; larger = more conservative
- eval_metric
- Optimize on AUC, error, rmse…
- For example, if you care about false positives more than accuracy, you might use AUC here
- scale_pos_weight
- Adjusts balance of positive and negative weights
- Helpful for unbalanced classes
- Might set to sum(negative cases) / sum(positive cases)
- max_depth
- Max depth of the tree
- Too high and you may overfit
- Instance Types
- Is memory-bound, not compute-bound
- So, M5 is a good choice
- XGBoost 1.2
- single-instance GPU training is available
- Must set tree_method hyperparameter to gpu_hist
- XGBoost 1.5+: Distributed GPU training
- Must set use_dask_gpu_training to true
- Set distribution to fully_replicated in TrainingInput
- Only works with csv or parquet input



Sequence to Sequence (Seq2Seq)
- Input is a sequence of tokens, output is a sequence of tokens
- Machine Translation
- Text summarization
- Speech to text
- Implemented with RNN’s and CNN’s with attention


Random Forest

