19. ML – Generative AI – Yannick's non-Bakery

AI inference vs. training

Training is the first phase for an AI model. Training may involve a process of trial and error, or a process of showing the model examples of the desired inputs and outputs, or both.
Inference is the process that follows AI training. The better trained a model is, and the more fine-tuned it is, the better its inferences will be — although they are never guaranteed to be perfect.
- The core of reasoning/inference is to apply a trained AI model to new data and make decisions based on the model’s predictions.

Transformer Architecture

used in Natural Language Processing (NLP)
Meaning is a result of relationships between things, and self-attention is a general way of learning relationships
A transformer is a type of artificial intelligence model that learns to understand and generate human-like text by analyzing patterns in large amounts of text data.
They are specifically designed to comprehend context and meaning by analyzing the relationship between different elements, and they rely almost entirely on a mathematical technique called attention to do so.
Adopts mechanism of “self-attention”
- Weighs significance of each part of the input data
- Processes sequential data (like words, like an RNN), but processes entire input all at once.
- The attention mechanism provides context, so no need to process one word at a time.
- With FFNN, it becomes parallel processing
Model zoos such as Hugging Face offer pre-trained models to start from
- Integrated with Sagemaker via Hugging Face Deep Learning Containers (DLC)
Components
- Encoder
  - to transform the input tokens into contextualized representations. Unlike earlier models that processed tokens independently, the Transformer encoder captures the context of each token with respect to the entire sequence.
  - Input Embeddings
    - converting input tokens – words or subwords – into vectors using embedding layers. These embeddings capture the semantic meaning of the tokens and convert them into numerical vectors.
  - Positional Encoding
    - use positional encodings added to the input embeddings to provide information about the position of each token in the sequence.
  - Stack of Encoder Layers
    - The encoder layer serves to transform all input sequences into a continuous, abstract representation that encapsulates the learned information from the entire sequence.
      - Multi-headed attention mechanism.
        
        enables the models to relate each word in the input with other words
      - Normalization and Residual Connections
      - Feed-Forward Neural Network
      - Output of the Encoder
    - Additionally, it incorporates residual connections around each sublayer, which are then followed by layer normalization.
- Decoder
  - Output Embeddings (mimic the Input Embeddings in Encoder)
  - Positional Encoding
  - Masked Self-Attention Mechanism
    - Similiar with Self-Attention Mechanism in Encoder
    - But, prevents positions from attending to subsequent positions, which means that each word in the sequence isn’t influenced by future tokens.
    - the predictions for a particular position can only depend on known outputs at positions before it.
  - Encoder-Decoder Multi-Head Attention or Cross Attention
  - Feed-Forward Neural Network
  - Linear Classifier and Softmax for Generating Output Probabilities
    - The journey of data through the transformer model culminates in its passage through a final linear layer, which functions as a classifier.
    - The size of this classifier corresponds to the total number of classes involved (number of words contained in the vocabulary). For instance, in a scenario with 1000 distinct classes representing 1000 different words, the classifier’s output will be an array with 1000 elements.
    - This output is then introduced to a softmax layer, which transforms it into a range of probability scores, each lying between 0 and 1. The highest of these probability scores is key, its corresponding index directly points to the word that the model predicts as the next in the sequence.
BERT: Bi-directional Encoder Representations from Transformers
- enables the model to have more context-informed predictions about what the next word should be
- DistilBERT: uses knowledge distillation to reduce model size by 40%
- fine-tune BERT (or DistilBERT etc) with your own additional training data through transfer learning
GPT: Generative Pre-trained Transformer

Feature	GPT (Generative Pre-trained Transformer)	BERT (Bidirectional Encoder Representations from Transformers)
Core Architecture	Autoregressive, generative	Bidirectional, context-based
Training Approach	Predicts the next word in a sequence	Uses masked language modeling to predict words from context
Direction of Context	Unidirectional (forward)	Bidirectional (both forward and backward)
Primary Usage	Text generation	Text analysis and understanding
Generative Capabilities	Yes, designed to generate coherent text	No, focuses on understanding text not generating
Pre-training	Trained on large text corpora	Trained on large text corpora with masked words
Fine-tuning	Necessary for specific tasks	Necessary, but effective with fewer training examples
Output	Generates new text sequences	Provides contextual embeddings for various NLP tasks

Transfer Learning
- Continue training a pre-trained model (fine-tuning)
  - Add new trainable layers to the top of a frozen model Learns to turn old features into predictions on new data
  - Can do both: add new layers, then fine tune as well
- Retrain from scratch
  - If you have large amounts of training data, and it’s fundamentally different from what the model was pre-trained with
- Use it as-is

Self-Attention

Each encoder or decoder has a list of embeddings (vectors) for each token
Self-attention produces a weighted average of all token embeddings. The magic is in computing the attention weights.
This results in tokens being tied to other tokens that are important for it
Three matrices of weights are learned through back-propagation
- Query (Wq)
- Key (Wk)
- Value (Wv)
Every token gets a query (q), key (k), and value (v) vector by multiplying its embedding against these matrices
Compute a score for each token by multiplying (dot product) its query with each key
“Scaled dot-product attention”
Dot product is just one similarity function we can use.
In practice, softmax is then applied to the scores to normalize them.

19. ML – Generative AI

AI inference vs. training

Transformer Architecture

Self-Attention

Transformer Architecture

Self-Attention & Attention-based Neural Networks

AWS Foundation Modles

Amazon SageMaker JumpStart

Amazon Bedrock and Foundation Models

Amazon Q Developer (previous CodeWhisperer)

AWS HealthScribe