Foundation Model (FM)
- GPT-n (OpenAI)
- Claude – Anthropic
- DALL-E (OpenAI, Microsoft)
- LlaMa (Meta)
- DeepSeek
- Nova
- AWS Foundation Models (Base)
- Jurassic-2 (AI21labs)
- Claude (Anthropic)
- Stabel Diffusion (stability.ai)
- Llama (Meta)
- Amazon Titan
- Amazon Nova Reels
Large Language Models (LLM)
- interact with the LLM by giving a prompt
- Non-deterministic: the generated text may be different for ever y user that uses the same prompt
- Generative AI for Images
- Training: Forward diffusion process
- from Picture to Noise
- Generating: Reverse diffusion process
- from Noise to Picture
- Training: Forward diffusion process
Tokenization
- converting raw text into a sequence of tokens
- Word-based tokenization: text is split into individual words
- Subword tokenization: some words can be split too (helpful for long words…)
Context Window
- The number of tokens an LLM can consider when generating text
- The larger the context window, the more information and coherence
- Large context windows require more memor y and processing power
- First factor to look at when considering a model
Embeddings
- Create vectors (array of numerical values) out of text, images or audio
- Vectors have a high dimensionality to capture many features for one input token, such as semantic meaning, syntactic role, sentiment
- Embedding models can power search applications


Retrieval-Augmented Generation (RAG)
- Combine the model’s capability with external data sources to generate a more informed and contextually rich response
- The initial prompt is then augmented with the external information
AWS Bedrock
- Build Generative AI (Gen-AI) applications on AWS
- Fully-managed service
- Pay-per-use pricing model
- Unified APIs
- bedrock: Manage, deploy, train models
- bedrock-runtime: Perform inference (execute prompts, generate embeddings) against these models
- Converse, ConverseStream, InvokeModel, InvokeModelWithResponseStream
- bedrock-agent: Manage, deploy, train LLM agents and knowledge bases
- bedrock-agent-runtime: Perform inference against agents and knowledge bases
- InvokeAgent, Retrieve, RetrieveAndGenerate
- IAM permissions
- Must use with an IAM user (not root)
- User must have relevant Bedrock permissions
- AmazonBedrockFullAccess
- AmazonBedrockReadOnly
- Amazon Bedrock makes a copy of the FM, available only to you, which you can fur ther fine-tune with your own data
- None of your data is used to train the FM
- Fine-Tuning a Model
- Adapt a copy of a foundation model with your own data
- Fine-tuning will change the weights of the base foundation model
- Training data must:
- Adhere to a specific format
- Be stored in Amazon S3
- You must use “Provisioned Throughput” to use a fine-tuned model
- Instruction-based
- Improves the performance of a pre-trained FM on domain-specific tasks
- = further trained on a particular field or area of knowledge
- Instruction-based fine-tuning uses labeled examples that are prompt-response pairs
- Single-Turn Messaging
- system (optional) : context for the conversation.
- messages : An array of message objects, each
containing:- role : Either “user” or “assistant”
- content : The text content of the message
- Multi-Turn Messaging
- To provide instruction-based fine tuning for a conversation (vs Single-Turn Messaging)
- Chatbots = multi-turn environment
- You must alternate between “user” and “assistant” roles
- Continued Pre-training
- Also called domain-adaptation fine-tuning, to make a model expert in a specific domain
- Provide unlabeled data to continue the training of an FM
- Good to feed industry-specific terminology into a model (acronyms, etc…)
- Can continue to train the model as more data becomes available
- Low-Rank Adaptation (LoRA)
- We don’t update the entire model, just slap on some “low-rank matrices” to the attention weights (usually), and train those.
- “Low-rank” refers to the complexity of the underlying matrices in the model
- At inference, these fine-tuned weights get added into the base model
- Base model remains unchanged
- Very efficient for storage, training, and inference
- This is different from an “adapter layer”
- Working Method
- Freezes Base Model: Keeps the original massive model weights untouched.
- Injects Adapters: Adds tiny, low-rank matrices (A and B) alongside key layers (often in attention mechanisms).
- Low-Rank Decomposition.
- Trains Only Adapters.
- Merges Weights (Optional): For inference, these adapter weights can be merged back into the base model to eliminate extra latency.
- We don’t update the entire model, just slap on some “low-rank matrices” to the attention weights (usually), and train those.

- Automatic Evaluation
- Evaluate a model for quality control
- Built-in task types:
- Text summarization
- question and answer
- text classification
- open-ended text generation…
- Bring your own prompt dataset or use built-in curated prompt datasets as “Benchmark Datasets”
- Curated collections of data designed specifically at evaluating the performance of language models
- Wide range of topics, complexities, linguistic phenomena
- Helpful to measure: accuracy, speed and efficiency, scalability
- Some benchmarks datasets allow you to very quickly detect any kind of bias and potential
discrimination against a group of people
- Scores are calculated automatically
- Model scores are calculated using various statistical methods (e.g. BERTScore, F1…)
- Human Evaluation
- Choose from Built-in task types (same as Automatic) or add a custom task
- Automated Metrics
- ROUGE: Recall-Oriented Understudy for Gisting Evaluation
- Evaluating automatic summarization and machine translation systems
- ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.
- ROUGE uses the F1-score as its default metric because it balances the trade-off between recall and precision. This is especially useful in summarization tasks, where capturing all key points (recall) and avoiding verbosity or irrelevant details (precision) are equally important.
- ROUGE-N – measure the number of matching n-grams between reference and generated text
- ROUGE-L – longest common subsequence between reference and generated text
- BLEU: Bilingual Evaluation Understudy
- Evaluate the quality of generated text, especially for translations
- BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
- Considers both precision and penalizes too much brevity
- Looks at a combination of n-grams (1, 2, 3, 4)
- BERTScore
- Semantic similarity between generated text
- Uses pre-trained BERT models (Bidirectional Encoder Representations from Transformers) to compare the contextualized embeddings of both texts and computes the cosine similarity between them.
- Capable of capturing more nuance between the texts
- Perplexity: how well the model predicts the next token (lower is better)
- ROUGE: Recall-Oriented Understudy for Gisting Evaluation

- Business Metrics
- User Satisfaction – gather users’ feedbacks and assess their satisfaction with the model responses
- Average Revenue Per User (ARPU) – average revenue per user attributed to the Gen-AI app
- Cross-Domain Performance – measure the model’s ability to perform cross different domains tasks
- Conversion Rate – generate recommended desired outcomes such as purchases
- Efficiency – evaluate the model’s efficiency in computation, resource utilization…
- Guardrails
- Control the interaction between users and Foundation Models (FMs)
- Filter undesirable and harmful content
- Remove Personally Identifiable Information (PII)
- Enhanced privacy
- Reduce hallucinations
- Ability to create multiple Guardrails and monitor and analyze user inputs that can violate the Guardrails
- Agents
- Manage and carry out various multi-step tasks related to infrastructure provisioning, application deployment, and operational activities
- Task coordination: perform tasks in the correct order and ensure information is passed correctly between tasks
- Agents are configured to perform specific pre-defined action groups
- Integrate with other systems, services, databases and API to exchange data or initiate actions
- Leverage RAG to retrieve information when necessary

- Bedrock & CloudWatch
- Model Invocation Logging
- Send logs of all invocations to Amazon CloudWatch and S3
- Can include text, images and embeddings
- Analyze further and build alerting thanks to CloudWatch Logs Insights
- CloudWatch Metrics
- Published metrics from Bedrock to CloudWatch
- Including ContentFilteredCount, which helps to see if Guardrails are functioning
- Can build CloudWatch Alarms on top of Metrics
- Model Invocation Logging
- Pricing
- On-Demand
- Pay-as-you-go (no commitment)
- Text Models – charged for every input/output token processed
- Embedding Models – charged for every input token processed
- Image Models – charged for every image generated
- Works with Base Models only
- Batch:
- Multiple predictions at a time (output is a single file in Amazon S3)
- Can provide discounts of up to 50%
- Provisioned Throughput
- Purchase Model units for a certain time (1 month, 6 months…)
- Throughput – max. number of input/output tokens processed per minute
- Works with Base, Fine-tuned, and Custom Models
- On-Demand
- Model Improvement Techniques Cost Order
- Prompt Engineering
- No model training needed (no additional computation or fine-tuning)
- Retrieval Augmented Generation (RAG)
- Uses external knowledge (FM doesn’t need to ”know everything”, less complex)
- No FM changes (no additional computation or fine-tuning)
- Instruction-based Fine-tuning
- FM is fine-tuned with specific instructions (requires additional computation)
- Domain Adaptation Fine-tuning
- Model is trained on a domain-specific dataset (requires intensive computation)
- Prompt Engineering
- Cost savings
- On-Demand – great for unpredictable workloads, no long-term commitment
- Batch – provides up to 50% discounts
- Provisioned Throughput – (usually) not a cost-saving measure, great to “reserve”
capacity - Temperature, Top K, Top P – no impact on pricing
- Model size – usually a smaller model will be cheaper (varies based on providers)
- Number of Input and Output Tokens – main driver of cost
Retrieval Augmented Generation (RAG)
- Allows a Foundation Model to reference a data source outside of its training data
- Bedrock takes care of creating Vector Embeddings in the database of your choice based on your data
- Use where real-time data is needed to be fed into the Foundation Model
- PROs
- Faster & cheaper way to incorporate new or proprietary information into “GenAI” vs. fine-tuning
- Updating info is just a matter of updating a database
- Can leverage “semantic search” via vector stores
- Can prevent “hallucinations” when you ask the model about something it wasn’t trained on
- If your boss wants “AI search”, this is an easy way to deliver it.
- Technically you aren’t “training” a model with this data
- Cons
- You have made the world’s most overcomplicated search engine
- Very sensitive to the prompt templates you use to incorporate your data
- Non-deterministic
- It can still hallucinate
- Very sensitive to the relevancy of the information you retrieve
- RAG Knowledge Base Data Store
- Vector Databases
- Amazon Aurora PostgreSQL – relational database, proprietary on AWS
- Amazon S3 Vectors – cost-effective and durable storage with sub-second query performance
- Graph database, as Neo4j & Amazone Neptune Analytics
- Amazon Neptune Analytics – graph database that enables high performance graph analytics and graph-based RAG (GraphRAG) solutions
- Opensearch for traditional text search (TF/IDF)
- Amazon OpenSearch Service (Serverless & Managed Cluster) – search & analytics database real time similarity queries, store millions of vector embeddings scalable index management, and fast nearest-neighbor (kNN) search capability
- Elasticsearch/Opensearch can function as a vectorDB
- Vector Databases




- R in RAG
- Pre-Retrieval
- Indexing
- Granularity / chunking (the process of splitting up data prior to storage)
- Semantic Chunking
- Ensure each chunk contains semantically independent information
- Embedding-based (LlamaIndex / Langchain)
- Model-based (BERT)
- LLM-based (Basically tell it to do semantic chunking)
- Semantic Chunking
- Data extraction
- Granularity / chunking (the process of splitting up data prior to storage)
- Query Rewriting
- Indexing
- Retrieval
- Post-Retrieval
- Pre-Retrieval

Transfer Learning – the broader concept of re-using a pre-trained model to adapt it to a new
related task
- Widely used for image classification
- And for NLP (models like BERT and GPT)