08. Monitoring

AWS CloudWatch

  • Metrics: Collect and track key metrics for every AWS services
    • namespace (specify a namespace for each data point, as new metric)
    • dimension is an attributes (instance id, environment, …)
    • timestamps
    • for EC2 memory
      • CloudWatch does not monitor the memory, swap, and disk space utilization of your instances. If you need to track these metrics, you can install a CloudWatch agent in your EC2 instances.
      • (EC2) Memory usage is a custom metric, using API PutMetricData
    • for Lambda function
      • The ConcurrentExecutions metric in Amazon CloudWatch explicitly measures the number of instances of a Lambda function that are running at the same time.
    • StorageResolution can be 1min (Standard) or 1/5/10/30 sec(High Resolution)
    • Data point range of custom metric would be 2 weeks for past history and 2 hours in future
    • detailed monitoring, just shorten the period to 1-minute; no extra fields
  • Logs: Collect, monitor, analyze and store log files
    • Group – application (to encrpyt with KMS keys, need to use CloudWatch Logs API)
    • stream – instances / log files / containers
    • export
      • Amazon S3, may take up to 12 hour, with API CreateExportTask
      • Using Logs Subscripton to export real-time events to Kinesis Data Streams, Kinesis Data Firehose, AWS Lambda, with Subscription Filter
        • Cross-Account Subscription (Subscription Filter -> Subscription Destination)
    • Live Tail – for realtime tail watch
    • By default, no logs from EC2 machine to CloudWatch
      • CloudWatch Logs Agent – only push logs
      • CloudWatch Unified Agent – push logs + collect metrics (extra RAM, Process, Swap) + centralized by SSM Parameter Store
    • Metric Filters to trigger alarms; not traceback of history
    • With “aws logs associate-kms-key“, enable (AWS KMS) encryption for an existing log group, eliminating the need to recreate the log group or manually encrypt logs before submission
    • Log Insight
      • facilitate in-depth analysis of log data
      • enables users to run queries on log data collected from various AWS services and applications in real-time
  • Alarms: Re-act in real-time to metrics / events
    • based on a single metric; Composite Alarms are monitoring on multiple other alarms
    • Targets
      • EC2
      • EC2 ASG
      • Amazon SNS
    • Settings
      • Period is the length of time to evaluate the metric or expression to create each individual data point for an alarm. It is expressed in seconds. If you choose one minute as the period, there is one datapoint every minute.
      • Evaluation Period is the number of the most recent periods, or data points, to evaluate when determining alarm state.
      • Datapoints to Alarm is the number of data points within the evaluation period that must be breaching to cause the alarm to go to the ALARM state. The breaching data points do not have to be consecutive, they just must all be within the last number of data points equal to Evaluation Period.
  • Synthetics Canary: monitor your APIs, URLs, Websites, …
  • Events, now called Amazon EventBridge
    • Schedule – cron job
    • Event Pattern – rules to react/trigger services
    • Event Bus, a router that receives events and delivers them to zero or more destinations, or targets.
      • (AWS) default, Partner, Custom
    • Schema – the structure template for event (json)
  • CloudWatch Evidently
    • validate/serve new features to specified % of users only
    • Launches (= feature flags) and Experiments (= A/B testing), and Overrides (specific variants assigned to specific user-id)
    • evaluation events stored in CloudWatch Logs or S3


AWS Health Dashboard – Service History

AWS X-Ray

  • Troubleshooting (not monitoring) application performance and errors as “centralized service map visualization”
  • Request tracking across distributed systems
  • Focus on Latency, Errors and Fault analysis
  • Compatible
    • AWS Lambda
    • Elastic Beanstalk
    • ECS
    • ELB
    • API Gateway
    • EC2 Instances or any application server (even on premise)
      • But X-Ray cannot track the memory and swap usage of the instance; only CloudWatch Agents can do.
  • Enable by
    • AWS X-Ray SDK (on applications)
    • Install X-Ray daemon (low lv UDP packet interceptor on OS) (on EC2 or ECS)
      • a software application that listens for traffic on UDP port 2000, gathers raw segment data, and relays it to the AWS X-Ray API.
      • for EC2, X-Ray daemon can be installed via user-data script
      • for ECS, create a Docker image that runs the X-Ray daemon, upload it to a Docker image repository, and then deploy it to your Amazon ECS cluster
      • Lambda runs the daemon automatically any time a function is invoked for a sampled request
    • Enable X-Ray AWS Integration (IAM Role with proper permission) (on AWS services)
      • for ElasticBeanstalk: to enable the X-Ray daemon by including the xray-daemon.config configuration file in the .ebextensions directory of your source code.
  • Instrumentation means the measure of product’s performance, diagnose errors, and to write trace information
    • AWS X-Ray receives data from services as segments. X-Ray then groups segments that have a common request into traces. X-Ray processes the traces to generate a service graph that provides a visual representation of your application.
      • segments/subsegments -> traces -> service graph
    • Segments: each application / service will send them
    • Subsegments: if you need more details in your segment, especially for DynanmoDB.
    • Trace: segments collected together to form an end-to-end trace
      • A trace segment is just a JSON representation of a request that your application serves.
    • Sampling: decrease the amount of requests sent to X-Ray, reduce cost
      • (default) 1st request each second (aka reservoir: 1), and then 5% of additional requests (aka rate: 0.05)
    • Annotations: Key Value pairs used to index traces (for search) and use with filters
    • Metadata: “EXTRA” Key Value pairs, not indexed, not used for searching
  • A subset of segment fields are indexed by X-Ray for use with filter expressions. You can search for segments associated with specific information in the X-Ray console or by using the GetTraceSummaries API.
  • X-Ray APIs Policy
    • AWSXrayWriteOnlyAccess
      • PutTraceSegments
      • PutTelemetryRecords
      • GetSamplingRules
      • GetSamplingTargets
      • GetSamplingStatisticSummaries
    • AWSXrayReadOnlyAccess – grant console access
      • GetServiceGraph
      • BatchGetTraces
      • GetTraceSummaries
      • GetTraceGraph
    • AWSXRayDaemonWriteAccess 
    • AWSXrayFullAccess – Read + Write + configure encryption key settings and sampling rules
  • APIs
    • GetTraceSummaries – trace summaries, as a list of trace IDs of the application (also with annotations)
    • BatchGetTraces – full traces, retrieve the list of traces (ie activity events)
    • GetGroup – retrieves the group resource details.
    • GetServiceGraph – shows which services process the incoming requests, including the downstream services that they call as a result.
  • If a load balancer or other intermediary forwards a request to your application, X-Ray takes the client IP from the X-Forwarded-For header in the request instead of from the source IP in the IP packet.

Amazon Managed Grafana

  • a fully managed and secure data visualization service that you can use to instantly query, correlate, and visualize operational metrics, logs, and traces
Use caseWhat is it optimized for?Monitoring and observability services
Monitoring and alertingThese services are optimized to provide real-time visibility, proactive issue detection, resource optimization, and efficient incident response, contributing to overall application and infrastructure health.– Amazon CloudWatch
– Amazon CloudWatch Logs
– Amazon EventBridge
Application performance monitoringThese services provide comprehensive insights into application behavior, offer tools for identifying and resolving performance bottlenecks, aid in efficient troubleshooting, and contribute to delivering modern user experiences across distributed and web applications.– Amazon CloudWatch Application Signals
– Amazon Managed Service for Prometheus
– AWS X-Ray
– Amazon CloudWatch Synthetics
Infrastructure observabilityThese services provide a holistic view of your cloud resources, helping you make more informed decisions about resource utilization, performance optimization, and cost-efficiency.– Amazon CloudWatch Metrics
– Amazon CloudWatch Container Insights
Logging and analysisThese services help you efficiently manage and analyze log data, troubleshoot, detect anomalies, support security, meeting compliance requirements, and get actionable insights into your applications and infrastructure.– Amazon Cloudwatch Logs Insights
– Amazon CloudWatch Logs Anomaly Detection
– Amazon Managed Grafana
– Amazon OpenSearch Service
– Amazon Kinesis Data Streams
Security and compliance monitoringOptimized to provide a robust security framework, enabling proactive threat detection, continuous monitoring, compliance tracking, and audit capabilities to help safeguard your AWS resources and maintain a secure and compliant environment.– Amazon GuardDuty
– AWS Config
– AWS CloudTrail
Network monitoringThese services provide visibility into network traffic, enhance security by detecting and preventing threats, enable efficient network traffic management, and support incident response activities.– Amazon CloudWatch – Network Monitor
– Amazon CloudWatch Internet Monitor
– Amazon VPC Flow Logs
– AWS Network Firewall
Distributed tracingThese services provide a comprehensive view of the interactions and dependencies within your distributed applications. They enable you to diagnose performance bottlenecks, optimize application performance, and support the smooth functioning of complex systems by offering insights into how different parts of your application communicate and interact.– AWS Distro for OpenTelemetry
– AWS X-Ray
– Amazon CloudWatch Application Signals (Preview)
Hybrid and multicloud observabilityMaintain reliable operations, provide modern digital experiences for your customers, and get help to meet service level objectives and performance commitments.– Amazon CloudWatch (hybrid and multicloud support)