08. Monitoring

AWS CloudWatch

  • Metrics: is a variable to monitor (CPUUtilization, NetworkIn…); Collect and track key metrics for every AWS services
    • namespace (specify a namespace for each data point, as new metric)
    • dimension is an attributes (instance id, environment, …)
      • Up to 30 dimensions per metric
    • timestamps
    • for EC2 memory
      • CloudWatch does not monitor the memory, swap, and disk space utilization of your instances. If you need to track these metrics, you can install a CloudWatch agent in your EC2 instances.
      • (EC2) Memory usage is a custom metric, using API PutMetricData
    • for Lambda function
      • The ConcurrentExecutions metric in Amazon CloudWatch explicitly measures the number of instances of a Lambda function that are running at the same time.
    • detailed monitoring, just shorten the period to 1-minute; no extra fields
    • Metric Streams
      • near-real-time delivery
      • Option to filter metrics to only stream a subset of them
    • Custom Metrics
      • Use API call PutMetricData
      • Ability to use dimensions (attributes) to segment metrics
        • Instance.id
        • Environment.name
      • Metric resolution (StorageResolution API parameter – two possible value):
        • Standard: 1 minute (60 seconds)
        • High Resolution: 1/5/10/30 second(s) – Higher cost
      • Important: Accepts metric data points two weeks in the past and two hours in the future
    • Anomaly Detection
      • Continuously analyze metrics to determine normal baselines and surface anomalies using ML algorithms
      • It creates a model of the metric’s expected values (based on metric’s past data)
      • Shows you which values in the graph are out of the normal range
      • Allows you to create Alarms based on metric’s expected value (instead of Static Threshold)
      • Ability to exclude specified time periods or events from being trained 
    • Amazon Lookout for Metrics (discontinued on SEP 2025)
      • Automatically detect anomalies within metrics and identify their root causes using Machine Learning
      • It detects and diagnoses errors within your data with no manual intervention
      • Integrates with different AWS Services and 3rd party SaaS apps through AppFlow
      • Send alerts to SNS, Lambda, Slack, Webhooks…
  • Logs: Collect, monitor, analyze and store log files
    • group – arbitrary name, usually representing an application (to encrpyt with custom KMS keys, need to use CloudWatch Logs API)
    • stream – instances within application / log files / containers
    • export
      • Amazon S3, may take up to 12 hour, with API CreateExportTask
      • Not near-real time or real-time
      • Using Logs Subscriptions to export real-time events
        • to Kinesis Data Streams, Kinesis Data Firehose, AWS Lambda
        • with Subscription Filter
          • Cross-Account Subscription (Subscription Filter -> Subscription Destination)
        • Cross-Account Subscription – send log events to resources in a different AWS account (KDS, KDF)
    • log expiration policies (never expire, 1 day to 10 years…)
    • Live Tail – for realtime tail watch
    • By default, no logs from EC2 machine to CloudWatch
      • CloudWatch Logs Agent – only push logs
      • CloudWatch Unified Agent – push logs + collect metrics (extra RAM, Process, Swap) + centralized by SSM Parameter Store
        • Metrics
          • Collected directly on your Linux server / EC2 instance
            • CPU (active, guest, idle, system, user, steal)
            • Disk metrics (free, used, total), Disk IO (writes, reads, bytes, iops)
            • RAM (free, inactive, used, total, cached)
            • Netstat (number of TCP and UDP connections, net packets, bytes)
            • Processes (total, dead, bloqued, idle, running, sleep)
            • Swap Space (free, used, used %)
          • Reminder: out-of-the box metrics for EC2 – disk, CPU, network (high level)
    • Metric Filters
      • Filter expressions
      • Metric filters can be used to trigger alarms
      • Filters do not retroactively filter data. Filters only publish the metric data points for events that happen after the filter was created.
      • • Ability to specify up to 3 Dimensions for the Metric Filter (optional)
    • With “aws logs associate-kms-key“, enable (AWS KMS) encryption for an existing log group, eliminating the need to recreate the log group or manually encrypt logs before submission
    • Log Insight
      • facilitate in-depth analysis of log data
      • enables users to run queries on log data collected from various AWS services and applications
      • Provides a purpose-built query language
        • Automatically discovers fields from AWS services and JSON log
        • events
        • Fetch desired event fields, filter based on conditions, calculate aggregate statistics, sort events, limit number of events…
        • Can save queries and add them to CloudWatch Dashboards
      • It’s a query engine, not a real-time engine
  • Alarms: Re-act in real-time to metrics / events
    • based on a single metric; Composite Alarms are monitoring on multiple other alarms
    • Targets
      • EC2
      • EC2 ASG action
      • Amazon SNS
    • Alarm States:
      • OK
      • INSUFFICIENT_DATA
      • ALARM
    • Settings
      • Period is the length of time to evaluate the metric or expression to create each individual data point for an alarm. It is expressed in seconds. If you choose one minute as the period, there is one datapoint every minute.
        • High resolution custom metrics: 10 sec, 30 sec or multiples of 60 sec
      • Evaluation Period is the number of the most recent periods, or data points, to evaluate when determining alarm state.
      • Datapoints to Alarm is the number of data points within the evaluation period that must be breaching to cause the alarm to go to the ALARM state. The breaching data points do not have to be consecutive, they just must all be within the last number of data points equal to Evaluation Period.
    • Composite Alarms
      • Composite Alarms are monitoring the states of multiple other alarms
      • AND and OR conditions
    • To test alarms and notifications, set the alarm state to Alarm using CLI
      aws cloudwatch set-alarm-state –alarm-name “myalarm” –state-value ALARM –state-reason “testing purposes”
  • Synthetics Canary: monitor your APIs, URLs, Websites, …
    • Configurable script that monitor your APIs, URLs, Websites…
    • Reproduce what your customers do programmatically to find issues before customers are impacted
    • Checks the availability and latency of your endpoints and can store load time data and screenshots of the UI
    • Integration with CloudWatch Alarms
    • Scripts written in Node.js or Python
    • Programmatic access to a headless Google Chrome browser
    • Can run once or on a regular schedule
    • Blueprints
      • Heartbeat Monitor – load URL, store screenshot and an HTTP archive file
      • API Canary – test basic read and write functions of REST APIs
      • Broken Link Checker – check all links inside the URL that you are testing
      • Visual Monitoring – compare a screenshot taken during a canary run with a baseline screenshot
      • Canary Recorder – used with CloudWatch Synthetics Recorder (record your actions on a website and automatically generates a script for that)
      • GUI Workflow Builder – verifies that actions can be taken on your webpage (e.g., test a webpage with a login form)
  • Events, now called Amazon EventBridge
    • Schedule – cron job
    • Event Pattern – rules to react/trigger services
    • Event Bus, a router that receives events and delivers them to zero or more destinations, or targets.
      • (AWS) default, Partner, Custom
    • Schema – the structure template for event (json)
  • CloudWatch Evidently
    • validate/serve new features to specified % of users only
    • Launches (= feature flags) and Experiments (= A/B testing), and Overrides (specific variants assigned to specific user-id)
    • evaluation events stored in CloudWatch Logs or S3

AWS Health Dashboard – Service History

  • Service History
    • Shows all regions, all services health
    • Shows historical information for each day
    • Has an RSS feed you can subscribe to
    • Previously called AWS Service Health Dashboard
  • Your Account
    • Previously called AWS Personal Health Dashboard (PHD)
    • AWS Account Health Dashboard provides alerts and remediation guidance when AWS is experiencing events that may impact you.
    • While the Service Health Dashboard displays the general status of AWS services, Account Health Dashboard gives you a personalized view into the performance and availability of the AWS services underlying your AWS resources.
    • The dashboard displays relevant and timely information to help you manage events in progress and provides proactive notification to help you plan for scheduled activities.
    • Can aggregate data from an entire AWS Organization
    • Global service
    • Shows how AWS outages directly impact you & your AWS resources
    • Alert, remediation, proactive, scheduled activities
  • Health Event Notifications
    • Use EventBridge to react to changes for AWS Health events in your AWS account
      • Example: receive email notifications when EC2 instances in your AWS account are
        scheduled for updates
    • This is possible for Account events (resources that are affected in your account) and Public
      Events (Regional availability of a service)
      • Use cases: send notifications, capture event information, take corrective action…
  • Status Checks
    • Automated checks to identify hardware and software issues
    • System Status Checks
      • Monitors problems with AWS systems (software/hardware issues on the physical host, loss of system power, …)
      • Check Personal Health Dashboard for any scheduled critical maintenance by AWS to your instance’s host
      • Resolution: stop and star t the instance (instance migrated to a new host)
    • Instance Status Checks
      • Monitors software/network configuration of your instance (invalid network configuration, exhausted memory, …)
      • Resolution: reboot the instance or change instance configuration
    • Attached EBS Status Checks
      • Monitors EBS volumes attached to your instance (reachable & complete I/O Operations)
      • Resolution: reboot the instance or replace affected EBS volumes
    • CW Metrics & Recovery
      • CloudWatch Metrics (1 minute interval)
        • StatusCheckFailed_System
        • StatusCheckFailed_Instance
        • StatusCheckFailed_AttachedEBS
        • StatusCheckFailed (for any)
      • Option 1: CloudWatch Alarm
        • Recover EC2 instance with the same private/public IP, EIP, metadata, and Placement Group
        • Send notifications using SNS
      • Option 2: Auto Scaling Group
        • Set min/max/desired 1 to recover an instance but won’t keep the same private and elastic IP

AWS X-Ray

  • Troubleshooting (not monitoring) application performance and errors as “centralized service map visualization”
  • Request tracking across distributed systems
  • Focus on Latency, Errors and Fault analysis
  • Compatible
    • AWS Lambda
    • Elastic Beanstalk
    • ECS
    • ELB
    • API Gateway
    • EC2 Instances or any application server (even on premise)
      • But X-Ray cannot track the memory and swap usage of the instance; only CloudWatch Agents can do.
  • Enable by
    • AWS X-Ray SDK (on applications)
    • Install X-Ray daemon (low lv UDP packet interceptor on OS) (on EC2 or ECS)
      • a software application that listens for traffic on UDP port 2000, gathers raw segment data, and relays it to the AWS X-Ray API.
      • for EC2, X-Ray daemon can be installed via user-data script
      • for ECS, create a Docker image that runs the X-Ray daemon, upload it to a Docker image repository, and then deploy it to your Amazon ECS cluster
      • Lambda runs the daemon automatically any time a function is invoked for a sampled request
    • Enable X-Ray AWS Integration (IAM Role with proper permission) (on AWS services)
      • for ElasticBeanstalk: to enable the X-Ray daemon
        • setting an option in the Elastic Beanstalk console
        • by including the xray-daemon.config configuration file in the .ebextensions directory of your source code.
        • The X-Ray daemon is not provided for Multicontainer Docker
  • Instrumentation means the measure of product’s performance, diagnose errors, and to write trace information
    • AWS X-Ray receives data from services as segments. X-Ray then groups segments that have a common request into traces. X-Ray processes the traces to generate a service graph that provides a visual representation of your application.
      • segments/subsegments -> traces -> service graph
    • Segments: each application / service will send them
    • Subsegments: if you need more details in your segment, especially for DynanmoDB.
    • Trace: segments collected together to form an end-to-end trace
      • A trace segment is just a JSON representation of a request that your application serves.
    • Sampling: decrease the amount of requests sent to X-Ray, reduce cost
      • (default) 1st request each second (aka reservoir: 1), and then 5% of additional requests (aka rate: 0.05)
    • Annotations: Key Value pairs used to index traces (for search) and use with filters
    • Metadata: “EXTRA” Key Value pairs, not indexed, not used for searching
  • A subset of segment fields are indexed by X-Ray for use with filter expressions. You can search for segments associated with specific information in the X-Ray console or by using the GetTraceSummaries API.
  • X-Ray APIs Policy
    • AWSXrayWriteOnlyAccess
      • PutTraceSegments
      • PutTelemetryRecords
      • GetSamplingRules
      • GetSamplingTargets
      • GetSamplingStatisticSummaries
    • AWSXrayReadOnlyAccess – grant console access
      • GetServiceGraph
      • BatchGetTraces
      • GetTraceSummaries
      • GetTraceGraph
    • AWSXRayDaemonWriteAccess 
    • AWSXrayFullAccess – Read + Write + configure encryption key settings and sampling rules
  • APIs
    • GetTraceSummaries – trace summaries, as a list of trace IDs of the application (also with annotations)
    • BatchGetTraces – full traces, retrieve the list of traces (ie activity events)
    • GetGroup – retrieves the group resource details.
    • GetServiceGraph – shows which services process the incoming requests, including the downstream services that they call as a result.
  • If a load balancer or other intermediary forwards a request to your application, X-Ray takes the client IP from the X-Forwarded-For header in the request instead of from the source IP in the IP packet.

Amazon Managed Grafana

  • a fully managed and secure data visualization service that you can use to instantly query, correlate, and visualize operational metrics, logs, and traces
Use caseWhat is it optimized for?Monitoring and observability services
Monitoring and alertingThese services are optimized to provide real-time visibility, proactive issue detection, resource optimization, and efficient incident response, contributing to overall application and infrastructure health.– Amazon CloudWatch
– Amazon CloudWatch Logs
– Amazon EventBridge
Application performance monitoringThese services provide comprehensive insights into application behavior, offer tools for identifying and resolving performance bottlenecks, aid in efficient troubleshooting, and contribute to delivering modern user experiences across distributed and web applications.– Amazon CloudWatch Application Signals
– Amazon Managed Service for Prometheus
– AWS X-Ray
– Amazon CloudWatch Synthetics
Infrastructure observabilityThese services provide a holistic view of your cloud resources, helping you make more informed decisions about resource utilization, performance optimization, and cost-efficiency.– Amazon CloudWatch Metrics
– Amazon CloudWatch Container Insights
Logging and analysisThese services help you efficiently manage and analyze log data, troubleshoot, detect anomalies, support security, meeting compliance requirements, and get actionable insights into your applications and infrastructure.– Amazon Cloudwatch Logs Insights
– Amazon CloudWatch Logs Anomaly Detection
– Amazon Managed Grafana
– Amazon OpenSearch Service
– Amazon Kinesis Data Streams
Security and compliance monitoringOptimized to provide a robust security framework, enabling proactive threat detection, continuous monitoring, compliance tracking, and audit capabilities to help safeguard your AWS resources and maintain a secure and compliant environment.– Amazon GuardDuty
– AWS Config
– AWS CloudTrail
Network monitoringThese services provide visibility into network traffic, enhance security by detecting and preventing threats, enable efficient network traffic management, and support incident response activities.– Amazon CloudWatch – Network Monitor
– Amazon CloudWatch Internet Monitor
– Amazon VPC Flow Logs
– AWS Network Firewall
Distributed tracingThese services provide a comprehensive view of the interactions and dependencies within your distributed applications. They enable you to diagnose performance bottlenecks, optimize application performance, and support the smooth functioning of complex systems by offering insights into how different parts of your application communicate and interact.– AWS Distro for OpenTelemetry
– AWS X-Ray
– Amazon CloudWatch Application Signals (Preview)
Hybrid and multicloud observabilityMaintain reliable operations, provide modern digital experiences for your customers, and get help to meet service level objectives and performance commitments.– Amazon CloudWatch (hybrid and multicloud support)

AWS Distro for OpenTelemetry

  • (opensource) provides a single set of APIs, libraries, agents, and collector services
    • collects distributed traces and metrics from applications
    • also collects metadata from AWS resources and services
    • Auto-instrumentation Agents
    • Send traces and metrics to multiple AWS services and partner solutions
      • X-Ray, CloudWatch, Prometheus…
    • Instrument your apps running on AWS (e.g., EC2, ECS, EKS, Fargate, Lambda) as well as on-premises
    • Migrate from X-Ray to AWS Distro for Temeletry if wants to standardize with open-source APIs from Telemetry or send traces to multiple destinations (AWS and partners) simultaneously

AWS Athena

  • Serverless query service to analyze data stored in Amazon S3
  • Uses standard SQL language to query the files (built on Presto)
  • Supports CSV, JSON, ORC, Avro, and Parquet
  • Commonly used with Amazon Quicksight for reporting/dashboards
    • Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails, etc…
  • Performance Improvement
    • Use columnar data for cost-savings (less scan)
      • Apache Parquet or ORC is recommended
    • Use Glue to convert your data to Parquet or ORC
    • Compress data for smaller retrievals (bzip2, gzip, lz4, snappy, zlip, zstd…)
    • Partition datasets in S3 for easy querying on virtual columns
      • s3://yourBucket/pathToTable
        • /<PARTITION_COLUMN_NAME>=<VALUE>
        • /<PARTITION_COLUMN_NAME>=<VALUE>
        • /<PARTITION_COLUMN_NAME>=<VALUE>
        • /etc…
      • Example: s3://athena-examples/flight/parquet/year=1991/month=1/day=1/
    • Use larger files (> 128 MB) to minimize overhead
  • Federated Query
    • Allows you to run SQL queries across data stored in relational, non-relational, object, and custom data sources (AWS or on-premises)
    • Uses Data Source Connectors that run on AWS Lambda to run Federated Queries (e.g., CloudWatch Logs, DynamoDB, RDS, …)
    • Store the results back in Amazon S3

Logs used in AWS

  • Application Logs
    • Logs that are produced by your application code
    • Contains custom log messages, stack traces, and so on
    • Written to a local file on the filesystem
    • Usually streamed to CloudWatch Logs using a CloudWatch Agent on EC2
    • If using Lambda, direct integration with CloudWatch Logs
    • If using ECS or Fargate, direct integration with CloudWatch Logs
    • If using Elastic Beanstalk, direct integration with CloudWatch Logs
  • Operating System Logs (Event Logs, System Logs)
    • Logs that are generated by your operating system (EC2 or on-premise instance)
    • Informing you of system behavior (ex: /var/log/messages or /var/log/auth.log)
    • Usually streamed to CloudWatch Logs using a CloudWatch Agent
  • Access Logs
    • list of all the requests for individual files that people have requested from a website
    • Example for httpd: /var/log/apache/access.log
    • Usually for load balancers, proxies, web servers, etc…
    • AWS provides some access logs
  • AWS Managed Logs
    • Load Balancer Access Logs (ALB, NLB, CLB) => to S3
      • Access logs for your Load Balancers
    • CloudTrail Logs => to S3 and CloudWatch Logs
      • Logs for API calls made within your account
    • VPC Flow Logs => to S3 and CloudWatch Logs
      • Information about IP traffic going to and from network interfaces in yourVPC
    • Route 53 Access Logs => to CloudWatch Logs
      • Log information about the queries that Route 53 receives
    • S3 Access Logs => to S3
      • Server access logging provides detailed records for the requests that are made to a bucket
    • CloudFront Access Logs => to S3
      • Detailed information about every user request that CloudFront receives
  • CloudWatch Logs for EC2
    • By default, no logs from your EC2 machine will go to CloudWatch
    • You need to run a CloudWatch agent on EC2 to push the log files you want
    • Make sure IAM permissions are correct
    • The CloudWatch log agent can be setup on-premises too