21. AI – Fundamentals and Bedrock

Foundation Model (FM)

  • GPT-n (OpenAI)
  • Claude – Anthropic
  • DALL-E (OpenAI, Microsoft)
  • LlaMa (Meta)
  • DeepSeek
  • Nova
  • AWS Foundation Models (Base)
    • Jurassic-2 (AI21labs)
    • Claude (Anthropic)
    • Stabel Diffusion (stability.ai)
    • Llama (Meta)
    • Amazon Titan
    • Amazon Nova Reels

Large Language Models (LLM)

  • interact with the LLM by giving a prompt
  • Non-deterministic: the generated text may be different for ever y user that uses the same prompt
  • Generative AI for Images
    • Training: Forward diffusion process
      • from Picture to Noise
    • Generating: Reverse diffusion process
      • from Noise to Picture

Tokenization

  • converting raw text into a sequence of tokens
  • Word-based tokenization: text is split into individual words
  • Subword tokenization: some words can be split too (helpful for long words…)

Context Window

  • The number of tokens an LLM can consider when generating text
  • The larger the context window, the more information and coherence
  • Large context windows require more memor y and processing power
  • First factor to look at when considering a model

Embeddings

  • Create vectors (array of numerical values) out of text, images or audio
  • Vectors have a high dimensionality to capture many features for one input token, such as semantic meaning, syntactic role, sentiment
  • Embedding models can power search applications

Retrieval-Augmented Generation (RAG)

  • Combine the model’s capability with external data sources to generate a more informed and contextually rich response
  • The initial prompt is then augmented with the external information

AWS Bedrock

  • Build Generative AI (Gen-AI) applications on AWS
  • Fully-managed service
  • Pay-per-use pricing model
  • Unified APIs
    • bedrock: Manage, deploy, train models
    • bedrock-runtime: Perform inference (execute prompts, generate embeddings) against these models
      • Converse, ConverseStream, InvokeModel, InvokeModelWithResponseStream
    • bedrock-agent: Manage, deploy, train LLM agents and knowledge bases
    • bedrock-agent-runtime: Perform inference against agents and knowledge bases
      • InvokeAgent, Retrieve, RetrieveAndGenerate
  • IAM permissions
    • Must use with an IAM user (not root)
    • User must have relevant Bedrock permissions
      • AmazonBedrockFullAccess
      • AmazonBedrockReadOnly
  • Amazon Bedrock makes a copy of the FM, available only to you, which you can fur ther fine-tune with your own data
  • None of your data is used to train the FM
  • Fine-Tuning a Model
    • Adapt a copy of a foundation model with your own data
    • Fine-tuning will change the weights of the base foundation model
    • Training data must:
      • Adhere to a specific format
      • Be stored in Amazon S3
    • You must use “Provisioned Throughput” to use a fine-tuned model
    • Instruction-based
      • Improves the performance of a pre-trained FM on domain-specific tasks
      • = further trained on a particular field or area of knowledge
      • Instruction-based fine-tuning uses labeled examples that are prompt-response pairs
      • Single-Turn Messaging
        • system (optional) : context for the conversation.
        • messages : An array of message objects, each
          containing:
          • role : Either “user” or “assistant”
          • content : The text content of the message
      • Multi-Turn Messaging
        • To provide instruction-based fine tuning for a conversation (vs Single-Turn Messaging)
        • Chatbots = multi-turn environment
        • You must alternate between “user” and “assistant” roles
    • Continued Pre-training
      • Also called domain-adaptation fine-tuning, to make a model expert in a specific domain
      • Provide unlabeled data to continue the training of an FM
      • Good to feed industry-specific terminology into a model (acronyms, etc…)
      • Can continue to train the model as more data becomes available
    • Low-Rank Adaptation (LoRA)
      • We don’t update the entire model, just slap on some “low-rank matrices” to the attention weights (usually), and train those.
        • “Low-rank” refers to the complexity of the underlying matrices in the model
      • At inference, these fine-tuned weights get added into the base model
      • Base model remains unchanged
      • Very efficient for storage, training, and inference
      • This is different from an “adapter layer” 
      • Working Method
        • Freezes Base Model: Keeps the original massive model weights untouched.
        • Injects Adapters: Adds tiny, low-rank matrices (A and B) alongside key layers (often in attention mechanisms).
        • Low-Rank Decomposition.
        • Trains Only Adapters.
        • Merges Weights (Optional): For inference, these adapter weights can be merged back into the base model to eliminate extra latency. 
  • Automatic Evaluation
    • Evaluate a model for quality control
    • Built-in task types:
      • Text summarization
      • question and answer
      • text classification
      • open-ended text generation…
    • Bring your own prompt dataset or use built-in curated prompt datasets as “Benchmark Datasets”
      • Curated collections of data designed specifically at evaluating the performance of language models
      • Wide range of topics, complexities, linguistic phenomena
      • Helpful to measure: accuracy, speed and efficiency, scalability
      • Some benchmarks datasets allow you to very quickly detect any kind of bias and potential
        discrimination against a group of people
    • Scores are calculated automatically
    • Model scores are calculated using various statistical methods (e.g. BERTScore, F1…)
  • Human Evaluation
    • Choose from Built-in task types (same as Automatic) or add a custom task
  • Automated Metrics
    • ROUGE: Recall-Oriented Understudy for Gisting Evaluation
      • Evaluating automatic summarization and machine translation systems
      • ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.
      • ROUGE uses the F1-score as its default metric because it balances the trade-off between recall and precision. This is especially useful in summarization tasks, where capturing all key points (recall) and avoiding verbosity or irrelevant details (precision) are equally important.
      • ROUGE-N – measure the number of matching n-grams between reference and generated text
      • ROUGE-L – longest common subsequence between reference and generated text
    • BLEU: Bilingual Evaluation Understudy
      • Evaluate the quality of generated text, especially for translations
      • BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
      • Considers both precision and penalizes too much brevity
      • Looks at a combination of n-grams (1, 2, 3, 4)
    • BERTScore
      • Semantic similarity between generated text
      • Uses pre-trained BERT models (Bidirectional Encoder Representations from Transformers) to compare the contextualized embeddings of both texts and computes the cosine similarity between them.
      • Capable of capturing more nuance between the texts
    • Perplexity: how well the model predicts the next token (lower is better)
  • Business Metrics
    • User Satisfaction – gather users’ feedbacks and assess their satisfaction with the model responses
    • Average Revenue Per User (ARPU) – average revenue per user attributed to the Gen-AI app
    • Cross-Domain Performance – measure the model’s ability to perform cross different domains tasks
    • Conversion Rate – generate recommended desired outcomes such as purchases
    • Efficiency – evaluate the model’s efficiency in computation, resource utilization…
  • Guardrails
    • Control the interaction between users and Foundation Models (FMs)
    • Filter undesirable and harmful content
    • Remove Personally Identifiable Information (PII)
    • Enhanced privacy
    • Reduce hallucinations
    • Ability to create multiple Guardrails and monitor and analyze user inputs that can violate the Guardrails
  • Agents
    • Manage and carry out various multi-step tasks related to infrastructure provisioning, application deployment, and operational activities
    • Task coordination: perform tasks in the correct order and ensure information is passed correctly between tasks
    • Agents are configured to perform specific pre-defined action groups
    • Integrate with other systems, services, databases and API to exchange data or initiate actions
    • Leverage RAG to retrieve information when necessary
  • Bedrock & CloudWatch
    • Model Invocation Logging
      • Send logs of all invocations to Amazon CloudWatch and S3
      • Can include text, images and embeddings
      • Analyze further and build alerting thanks to CloudWatch Logs Insights
    • CloudWatch Metrics
      • Published metrics from Bedrock to CloudWatch
      • Including ContentFilteredCount, which helps to see if Guardrails are functioning
      • Can build CloudWatch Alarms on top of Metrics
  • Pricing
    • On-Demand
      • Pay-as-you-go (no commitment)
      • Text Models – charged for every input/output token processed
      • Embedding Models – charged for every input token processed
      • Image Models – charged for every image generated
      • Works with Base Models only
    • Batch:
      • Multiple predictions at a time (output is a single file in Amazon S3)
      • Can provide discounts of up to 50%
    • Provisioned Throughput
      • Purchase Model units for a certain time (1 month, 6 months…)
      • Throughput – max. number of input/output tokens processed per minute
      • Works with Base, Fine-tuned, and Custom Models
  • Model Improvement Techniques Cost Order
    • Prompt Engineering
      • No model training needed (no additional computation or fine-tuning)
    • Retrieval Augmented Generation (RAG)
      • Uses external knowledge (FM doesn’t need to ”know everything”, less complex)
      • No FM changes (no additional computation or fine-tuning)
    • Instruction-based Fine-tuning
      • FM is fine-tuned with specific instructions (requires additional computation)
    • Domain Adaptation Fine-tuning
      • Model is trained on a domain-specific dataset (requires intensive computation)
  • Cost savings
    • On-Demand – great for unpredictable workloads, no long-term commitment
    • Batch – provides up to 50% discounts
    • Provisioned Throughput – (usually) not a cost-saving measure, great to “reserve”
      capacity
    • Temperature, Top K, Top P – no impact on pricing
    • Model size – usually a smaller model will be cheaper (varies based on providers)
    • Number of Input and Output Tokens – main driver of cost


Retrieval Augmented Generation (RAG)

  • Allows a Foundation Model to reference a data source outside of its training data
  • Bedrock takes care of creating Vector Embeddings in the database of your choice based on your data
  • Use where real-time data is needed to be fed into the Foundation Model
  • PROs
    • Faster & cheaper way to incorporate new or proprietary information into “GenAI” vs. fine-tuning
    • Updating info is just a matter of updating a database
    • Can leverage “semantic search” via vector stores
    • Can prevent “hallucinations” when you ask the model about something it wasn’t trained on
    • If your boss wants “AI search”, this is an easy way to deliver it.
    • Technically you aren’t “training” a model with this data
  • Cons
    • You have made the world’s most overcomplicated search engine
    • Very sensitive to the prompt templates you use to incorporate your data
    • Non-deterministic
    • It can still hallucinate
    • Very sensitive to the relevancy of the information you retrieve
  • RAG Knowledge Base Data Store
    • Vector Databases
      • Amazon Aurora PostgreSQL – relational database, proprietary on AWS
      • Amazon S3 Vectors – cost-effective and durable storage with sub-second query performance
    • Graph database, as Neo4j & Amazone Neptune Analytics
      • Amazon Neptune Analytics – graph database that enables high performance graph analytics and graph-based RAG (GraphRAG) solutions
    • Opensearch for traditional text search (TF/IDF)
      • Amazon OpenSearch Service (Serverless & Managed Cluster) – search & analytics database real time similarity queries, store millions of vector embeddings scalable index management, and fast nearest-neighbor (kNN) search capability
      • Elasticsearch/Opensearch can function as a vectorDB
  • R in RAG
    • Pre-Retrieval
      • Indexing
        • Granularity / chunking (the process of splitting up data prior to storage)
          • Semantic Chunking
            • Ensure each chunk contains semantically independent information
            • Embedding-based (LlamaIndex / Langchain)
            • Model-based (BERT)
            • LLM-based (Basically tell it to do semantic chunking)
        • Data extraction
      • Query Rewriting
    • Retrieval
    • Post-Retrieval


Transfer Learning – the broader concept of re-using a pre-trained model to adapt it to a new
related task

  • Widely used for image classification
  • And for NLP (models like BERT and GPT)