AWS Infrastructure Monitors

Monitor AWS infrastructure health with Sprinto by tracking EC2, EBS, ECS, DynamoDB, ALB, SQS, and ElastiCache metrics through CloudWatch alarms and compliance checks.

Sprinto provides deep integration with AWS services to monitor critical infrastructure metrics across compute, storage, database, and networking layers. These monitors ensure your AWS environment is operating within safe thresholds and aligned with security and compliance expectations.

This article covers the AWS-specific infrastructure monitors in Sprinto, what they check, and how to resolve failing monitors.


Monitored AWS Services

Sprinto evaluates infrastructure compliance and performance across the following AWS services:

  1. EC2 – CPU utilisation

  2. EBS – Volume health and backup

  3. ECS – CPU and memory utilisation

  4. DynamoDB – Write capacity, latency, encryption, point-in-time recovery

  5. SQS – Monitoring visible messages using CloudWatch

  6. ALB / CLB – Latency and load balancer metrics

  7. ElastiCache – CPU utilisation, connection count

  8. CloudWatch – Alarm configuration and active metric collection


Detailed Monitors and Resolution Steps

1. EC2: CPU Utilisation Should Be Monitored

  • What it checks: CloudWatch is configured to track CPU usage for EC2 instances

  • How to resolve:

    • Go to CloudWatch > Alarms > Create Alarm

    • Select EC2 → Choose CPUUtilization metric

    • Define threshold (e.g., >80% for 5 minutes)

    • Set action (e.g., SNS notification)


2. EBS: Volume Health and Backup

  • What it checks:

    • Volumes are healthy (no degraded status)

    • Snapshots or backup policies are active

  • How to resolve:

    • Use AWS Backup or Lifecycle Policies to take regular snapshots

    • Go to EC2 > Volumes → Check Status

    • Enable snapshot creation with tags or scheduled jobs


3. ECS: CPU and Memory Metrics

  • What it checks:

    • CloudWatch metrics for ECS services are enabled

    • Thresholds are defined for resource usage

  • How to resolve:

    • Navigate to CloudWatch > Metrics > ECS

    • Set alarms for CPUUtilization and MemoryUtilization


4. DynamoDB: Write Capacity, Latency, and Backup

  • What it checks:

    • Write capacity units (WCU) and latency metrics

    • Point-in-time recovery (PITR) is enabled

    • Table encryption status

  • How to resolve:

    • Go to DynamoDB > Tables

    • Enable Auto Scaling for WCU

    • Turn on PITR under the Backups tab

    • Ensure Encryption at rest is enabled (using AWS KMS)


5. SQS: Visible Messages Should Be Monitored

  • What it checks:

    • CloudWatch alarm is configured for message backlog

  • How to resolve:

    • Go to CloudWatch > Alarms > Create Alarm

    • Select SQS → Choose ApproximateNumberOfMessagesVisible

    • Set a threshold (e.g., >100 messages)

    • Attach notification or auto-scaling rule


6. ALB / CLB: Latency Should Be Monitored

  • What it checks:

    • CloudWatch alarms are configured for high latency or 5xx errors

  • How to resolve:

    • Go to CloudWatch > Metrics > LoadBalancer

    • Track TargetResponseTime or HTTPCode_ELB_5XX_Count

    • Set alarm thresholds


7. ElastiCache: CPU and Connection Metrics

  • What it checks:

    • CPU utilisation and current connection count via CloudWatch

  • How to resolve:

    • Enable enhanced monitoring for ElastiCache

    • Create alarms for CPU and CurrConnections metrics


8. CloudWatch: Alarm Configuration

  • What it checks:

    • Monitoring is active across key services

    • Alarms are not in INSUFFICIENT_DATA state

  • How to resolve:

    • Periodically audit alarms for gaps or inactive services

    • Ensure metrics are collected with correct granularity


Remediating in Sprinto

  • For automated checks, Sprinto syncs alarm status via integration

  • For manual checks:

    • Upload screenshots of CloudWatch alarms or service configurations

    • Attach backup policy summaries if applicable

  • Use Mark as Resolved once the required action is complete


Best Practices

  • Tag critical resources and apply monitoring only where needed

  • Automate alarm creation using Infrastructure as Code (e.g., Terraform, CloudFormation)

  • Enable notifications via SNS or integrate with incident response tools

  • Set thresholds based on baselined performance, not fixed values

Last updated