21. AI – Fundamentals and Bedrock – Not a Croissant, Yannick

Foundation Model (FM)

GPT-n (OpenAI)
Claude – Anthropic
DALL-E (OpenAI, Microsoft)
LlaMa (Meta)
DeepSeek
Nova
AWS Foundation Models (Base)
- Jurassic-2 (AI21labs)
- Claude (Anthropic)
- Stabel Diffusion (stability.ai)
- Llama (Meta)
- Amazon Titan
- Amazon Nova Reels

Large Language Models (LLM)

interact with the LLM by giving a prompt
Non-deterministic: the generated text may be different for ever y user that uses the same prompt
Generative AI for Images
- Training: Forward diffusion process
  - from Picture to Noise
- Generating: Reverse diffusion process
  - from Noise to Picture

Tokenization

converting raw text into a sequence of tokens
Word-based tokenization: text is split into individual words
Subword tokenization: some words can be split too (helpful for long words…)

Context Window

The number of tokens an LLM can consider when generating text
The larger the context window, the more information and coherence
Large context windows require more memor y and processing power
First factor to look at when considering a model

Embeddings

Create vectors (array of numerical values) out of text, images or audio
Vectors have a high dimensionality to capture many features for one input token, such as semantic meaning, syntactic role, sentiment
Embedding models can power search applications

AWS Bedrock

Build Generative AI (Gen-AI) applications on AWS
Fully-managed service
Pay-per-use pricing model
Unified APIs
- bedrock: Manage, deploy, train models
- bedrock-runtime: Perform inference (execute prompts, generate embeddings) against these models
  - Converse, ConverseStream, InvokeModel, InvokeModelWithResponseStream
- bedrock-agent: Manage, deploy, train LLM agents and knowledge bases
- bedrock-agent-runtime: Perform inference against agents and knowledge bases
  - InvokeAgent, Retrieve, RetrieveAndGenerate
IAM permissions
- Must use with an IAM user (not root)
- User must have relevant Bedrock permissions
  - AmazonBedrockFullAccess
  - AmazonBedrockReadOnly
Amazon Bedrock makes a copy of the FM, available only to you, which you can fur ther fine-tune with your own data
None of your data is used to train the FM
Fine-Tuning a Model
- Adapt a copy of a foundation model with your own data
- Fine-tuning will change the weights of the base foundation model
- Training data must:
  - Adhere to a specific format
  - Be stored in Amazon S3
- You must use “Provisioned Throughput” to use a fine-tuned model
- Instruction-based
  - Improves the performance of a pre-trained FM on domain-specific tasks
  - = further trained on a particular field or area of knowledge
  - Instruction-based fine-tuning uses labeled examples that are prompt-response pairs
  - Single-Turn Messaging
    - system (optional) : context for the conversation.
    - messages : An array of message objects, each
      containing:
      - role : Either “user” or “assistant”
      - content : The text content of the message
  - Multi-Turn Messaging
    - To provide instruction-based fine tuning for a conversation (vs Single-Turn Messaging)
    - Chatbots = multi-turn environment
    - You must alternate between “user” and “assistant” roles
  - Fine-tuning resolves comprehension and accuracy issues that occur when a general-purpose model struggles to interpret specialized or informal language. By providing labeled examples that include the target audience’s unique phrasing, abbreviations, and tone, the model’s internal parameters are adjusted to better predict and understand similar text patterns in future inputs. This results in more accurate and contextually relevant responses that align with the organization’s specific domain or communication style. The approach is particularly effective when an organization’s data contains patterns not typically found in public datasets used for foundation model pre-training.
- Continued Pre-training
  - Also called domain-adaptation fine-tuning, to make a model expert in a specific domain
  - Provide unlabeled data to continue the training of an FM
  - Good to feed industry-specific terminology into a model (acronyms, etc…)
  - Can continue to train the model as more data becomes available
- Low-Rank Adaptation (LoRA)
  - We don’t update the entire model, just slap on some “low-rank matrices” to the attention weights (usually), and train those.
    - “Low-rank” refers to the complexity of the underlying matrices in the model
  - At inference, these fine-tuned weights get added into the base model
  - Base model remains unchanged
  - Very efficient for storage, training, and inference
  - This is different from an “adapter layer”
  - Working Method
    - Freezes Base Model: Keeps the original massive model weights untouched.
    - Injects Adapters: Adds tiny, low-rank matrices (A and B) alongside key layers (often in attention mechanisms).
    - Low-Rank Decomposition.
    - Trains Only Adapters.
    - Merges Weights (Optional): For inference, these adapter weights can be merged back into the base model to eliminate extra latency.

Automatic Evaluation
- Evaluate a model for quality control
- Built-in task types:
  - Text summarization
  - question and answer
  - text classification
  - open-ended text generation…
- Bring your own prompt dataset or use built-in curated prompt datasets as “Benchmark Datasets”
  - Curated collections of data designed specifically at evaluating the performance of language models
  - Wide range of topics, complexities, linguistic phenomena
  - Helpful to measure: accuracy, speed and efficiency, scalability
  - Some benchmarks datasets allow you to very quickly detect any kind of bias and potential
    discrimination against a group of people
- Scores are calculated automatically
- Model scores are calculated using various statistical methods (e.g. BERTScore, F1…)
Human Evaluation
- Choose from Built-in task types (same as Automatic) or add a custom task
Automated Metrics
- ROUGE: Recall-Oriented Understudy for Gisting Evaluation
  - Evaluating automatic summarization and machine translation systems
  - ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.
  - ROUGE uses the F1-score as its default metric because it balances the trade-off between recall and precision. This is especially useful in summarization tasks, where capturing all key points (recall) and avoiding verbosity or irrelevant details (precision) are equally important.
  - ROUGE-N – measure the number of matching n-grams between reference and generated text
  - ROUGE-L – longest common subsequence between reference and generated text
- BLEU: Bilingual Evaluation Understudy
  - Evaluate the quality of generated text, especially for translations
  - BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
  - Considers both precision and penalizes too much brevity
  - Looks at a combination of n-grams (1, 2, 3, 4)
- BERTScore
  - Semantic similarity between generated text
  - Uses pre-trained BERT models (Bidirectional Encoder Representations from Transformers) to compare the contextualized embeddings of both texts and computes the cosine similarity between them.
  - Capable of capturing more nuance between the texts
- Perplexity: how well the model predicts the next token (lower is better)

Business Metrics
- User Satisfaction – gather users’ feedbacks and assess their satisfaction with the model responses
- Average Revenue Per User (ARPU) – average revenue per user attributed to the Gen-AI app
- Cross-Domain Performance – measure the model’s ability to perform cross different domains tasks
- Conversion Rate – generate recommended desired outcomes such as purchases
- Efficiency – evaluate the model’s efficiency in computation, resource utilization…
Guardrails
- Control the interaction between users and Foundation Models (FMs)
- Filter undesirable and harmful content
- Remove Personally Identifiable Information (PII)
- Enhanced privacy
- Reduce hallucinations
- Ability to create multiple Guardrails and monitor and analyze user inputs that can violate the Guardrails
Agents
- Manage and carry out various multi-step tasks related to infrastructure provisioning, application deployment, and operational activities
- Task coordination: perform tasks in the correct order and ensure information is passed correctly between tasks
- Agents are configured to perform specific pre-defined action groups
- Integrate with other systems, services, databases and API to exchange data or initiate actions
- Leverage RAG to retrieve information when necessary

Bedrock & CloudWatch
- Model Invocation Logging
  - Send logs of all invocations to Amazon CloudWatch and S3
  - Can include text, images and embeddings
  - Analyze further and build alerting thanks to CloudWatch Logs Insights
- CloudWatch Metrics
  - Published metrics from Bedrock to CloudWatch
  - Including ContentFilteredCount, which helps to see if Guardrails are functioning
  - Can build CloudWatch Alarms on top of Metrics
Pricing
- On-Demand
  - Pay-as-you-go (no commitment)
  - Text Models – charged for every input/output token processed
  - Embedding Models – charged for every input token processed
  - Image Models – charged for every image generated
  - Works with Base Models only
- Batch:
  - Multiple predictions at a time (output is a single file in Amazon S3)
  - Can provide discounts of up to 50%
- Provisioned Throughput
  - Purchase Model units for a certain time (1 month, 6 months…)
  - Throughput – max. number of input/output tokens processed per minute
  - Works with Base, Fine-tuned, and Custom Models
Model Improvement Techniques Cost Order
- Prompt Engineering
  - No model training needed (no additional computation or fine-tuning)
- Retrieval Augmented Generation (RAG)
  - Uses external knowledge (FM doesn’t need to ”know everything”, less complex)
  - No FM changes (no additional computation or fine-tuning)
- Instruction-based Fine-tuning
  - FM is fine-tuned with specific instructions (requires additional computation)
- Domain Adaptation Fine-tuning
  - Model is trained on a domain-specific dataset (requires intensive computation)
Cost savings
- On-Demand – great for unpredictable workloads, no long-term commitment
- Batch – provides up to 50% discounts
- Provisioned Throughput – (usually) not a cost-saving measure, great to “reserve”
  capacity
- Temperature, Top K, Top P – no impact on pricing
- Model size – usually a smaller model will be cheaper (varies based on providers)
- Number of Input and Output Tokens – main driver of cost

Retrieval Augmented Generation (RAG)

Allows a Foundation Model to reference a data source outside of its training data
Bedrock takes care of creating Vector Embeddings in the database of your choice based on your data
Use where real-time data is needed to be fed into the Foundation Model
PROs
- Faster & cheaper way to incorporate new or proprietary information into “GenAI” vs. fine-tuning
- Updating info is just a matter of updating a database
- Can leverage “semantic search” via vector stores
- Can prevent “hallucinations” when you ask the model about something it wasn’t trained on
- If your boss wants “AI search”, this is an easy way to deliver it.
- Technically you aren’t “training” a model with this data
Cons
- You have made the world’s most overcomplicated search engine
- Very sensitive to the prompt templates you use to incorporate your data
- Non-deterministic
- It can still hallucinate
- Very sensitive to the relevancy of the information you retrieve
RAG Knowledge Base Data Store
- Vector Databases
  - Amazon Aurora PostgreSQL – relational database, proprietary on AWS
  - Amazon S3 Vectors – cost-effective and durable storage with sub-second query performance
- Graph database, as Neo4j & Amazone Neptune Analytics
  - Amazon Neptune Analytics – graph database that enables high performance graph analytics and graph-based RAG (GraphRAG) solutions
- Opensearch for traditional text search (TF/IDF)
  - Amazon OpenSearch Service (Serverless & Managed Cluster) – search & analytics database real time similarity queries, store millions of vector embeddings scalable index management, and fast nearest-neighbor (kNN) search capability
  - Elasticsearch/Opensearch can function as a vectorDB

R in RAG
- Pre-Retrieval
  - Indexing
    - Granularity / chunking (the process of splitting up data prior to storage)
      - Semantic Chunking
        
        Ensure each chunk contains semantically independent information
        
        Embedding-based (LlamaIndex / Langchain)
        
        Model-based (BERT)
        
        LLM-based (Basically tell it to do semantic chunking)
    - Data extraction
  - Query Rewriting
- Retrieval
- Post-Retrieval

[ 🧐QUESTION🧐 ] Create RAG with LLM
- A knowledge base in Amazon Bedrock serves as a managed data retrieval layer for RAG workflows.
- Bedrock automatically performs data ingestion, document parsing, text chunking, and embedding generation using an integrated vector store. The service manages the synchronization of the data source and ensures that newly uploaded or modified documents in S3 are continually available for retrieval.
- When a user submits a query through the Amazon Bedrock API, the service retrieves the most semantically relevant document fragments from the knowledge base and injects them into the LLM’s prompt context.
- Why AWS Kendra is not a good choice?? While Amazon Kendra can index and semantically search the S3 documents, it is only a retrieval service; to connect LLM there would be more extra steps needed
- Compare to Fine-tune
  - Fine-tuning is typically used to adjust a model’s tone, structure, or task-specific behavior, not to embed continuously changing enterprise knowledge.
  - Fine-tuning permanently modifies model parameters and would require repeated retraining whenever new policies or regulations are added.
  - This method would be costly, time-consuming, and not suitable for regulatory data that must remain up-to-date.
  - In contrast, RAG allows models to dynamically reference the latest proprietary documents without altering the base model, which is far more efficient and maintainable.
- To integrate data stored in an Amazon S3 bucket, you can configure the Amazon Bedrock knowledge base to reference the S3 bucket as a data source. This integration automatically handles the ingestion, transformation, and indexing of the documents, converting them into embeddings for efficient retrieval. The Bedrock API is then used to execute RAG queries, allowing the LLM to pull relevant information from the knowledge base during the inference process. This approach significantly reduces the complexity of manually managing data pipelines or embedding storage systems.

Amazon Titan Image Generator

create high-quality images from natural-language text prompts or reference visuals.
capabilities
- inpainting (modifying specific image areas)
- outpainting (extending image borders)
- style transfer
- and background removal
Amazon S3 complements Titan Image Generator by providing secure, durable, and scalable object storage for generated images, metadata, and reference files.

Amazon Titan Multimodal Embeddings

typically used to represent text and images as numerical vectors for tasks like semantic search, clustering, and content retrieval, not for creating new images.
While embeddings can help organize and find similar assets, they do not directly generate visuals.

Amazon Titan Text Embeddings

specializes in converting text into high-dimensional numerical vectors that capture the semantic meaning and relationships within data.
This embedding capability allows applications to understand context, perform semantic search, and power retrieval-augmented generation (RAG) workflows where relevant information is retrieved and used to generate more accurate, context-aware responses.
[ 🧐QUESTION🧐 ] a mechanism that can represent both user queries and document content as semantic embeddings that capture contextual relationships between concepts and terms, also able to integrate easily with existing AWS services to support downstream retrieval-augmented generation (RAG) workflows
- By combining Amazon Titan Text Embeddings for contextual understanding with Amazon OpenSearch Service for vector-based similarity search, organizations can empower AI agents in Amazon Bedrock to retrieve relevant information and reason over data with deeper semantic awareness.
- Titan Text Embeddings, which specializes in converting text into high-dimensional numerical vectors that capture the semantic meaning and relationships within data. This embedding capability allows applications to understand context, perform semantic search, and power retrieval-augmented generation (RAG) workflows where relevant information is retrieved and used to generate more accurate, context-aware responses.
- To make these embeddings useful for retrieval and reasoning tasks, Amazon OpenSearch Service provides a natural complement through its vector search functionality.

Transfer Learning – the broader concept of re-using a pre-trained model to adapt it to a new
related task

Widely used for image classification
And for NLP (models like BERT and GPT)

Amazon Nova Canvas

primarily intended for real-time, human-in-the-loop creative processes, such as brainstorming or refining visuals in a shared workspace.