Overview
Infrastructure Monitoring is the continuous observation and alerting of system performance, availability, and resource utilization to detect and respond to operational issues. This process supports SOC 2 CC7.1 by ensuring potential system failures or security-impacting events are identified and addressed in a timely manner.
Step-by-Step Process
Define monitoring scope
The Engineering Lead identifies all in-scope infrastructure components, including servers, databases, containers, and cloud services that support SOC 2 scoped systems. The output is a documented monitoring scope list aligned to the system inventory.
Role: Engineering Lead
Configure core metrics
Engineering configures baseline metrics such as CPU utilization, memory usage, disk space, network throughput, and service uptime for each in-scope component. The output is an active set of metrics visible in the monitoring tool.
Role: Engineering Lead
Set alert thresholds
Engineering defines alert thresholds and severity levels based on operational risk (e.g., CPU > 85% for 5 minutes). The output is a documented and enabled alert configuration tied to escalation rules.
Role: Engineering Lead
Enable alert notifications
Engineering configures alert notifications to route to approved channels such as email, Slack, or PagerDuty. The output is verified alert delivery to on-call personnel.
Role: Engineering Lead
Review monitoring dashboards
Engineering reviews infrastructure dashboards on an ongoing basis to identify anomalies or degradation trends. The output is operational awareness and early identification of issues.
Role: Engineering Lead
Respond to alerts
On-call engineers investigate triggered alerts, remediate underlying issues, and document actions taken. The output is resolved alerts with timestamps and resolution notes.
Role: Engineering Lead
Periodically validate monitoring
Engineering performs periodic checks to confirm all in-scope systems are still monitored and alerts are functioning as intended. The output is an updated monitoring validation record.
Role: Engineering Lead
What You Need Before Starting
- Approved system inventory identifying SOC 2 in-scope infrastructure
- Administrative access to monitoring tools (Datadog, New Relic, or CloudWatch)
- On-call rotation and escalation contact list
- Documented alerting and incident response expectations
Evidence Your Auditor Expects
- Dated screenshot of active infrastructure dashboard showing in-scope systems (with timestamp visible)
- Alert configuration export or screenshot showing thresholds and notification channels with last modified date
- Sample alert log demonstrating a triggered alert and resolution timestamp within the audit period
- Monitoring scope document mapped to system inventory with last review date
How This Looks In Your Tools
Datadog
Log in to Datadog and navigate to Infrastructure > Host Map to verify all production hosts are reporting metrics. Go to Metrics > Summary to confirm CPU, memory, disk, and network metrics are actively collecting.
Navigate to Monitors > New Monitor and select the monitor type (e.g., Infrastructure or Metric). Configure thresholds (such as system.cpu.user > 85%) and set alert conditions, then assign notification channels under Notify your team.
Access Dashboards > Dashboard List to review or create an infrastructure dashboard. Confirm dashboards are updated in real time and save any changes with a clear name indicating production scope.
New Relic
Log in to New Relic and go to Infrastructure > Hosts or Infrastructure > Kubernetes to confirm all in-scope entities are reporting data. Review the entity list to ensure no critical systems are missing.
Navigate to Alerts & AI > Alert conditions and create or review alert conditions for key metrics such as CPU, memory, and disk utilization. Assign alert policies and verify notification channels under Alerts & AI > Notification channels.
Go to Dashboards and open the Infrastructure dashboard to review real-time performance trends. Save any updates and confirm dashboards reflect current production infrastructure.
CloudWatch
Log in to the AWS Console and navigate to CloudWatch > Metrics. Review namespaces such as EC2, RDS, and ECS to confirm metrics are being collected for all in-scope resources.
Go to CloudWatch > Alarms and create or review alarms for critical metrics (e.g., CPUUtilization, FreeStorageSpace). Configure alarm thresholds and set notifications using an SNS topic tied to on-call contacts.
Navigate to CloudWatch > Dashboards to review or create dashboards that display infrastructure health. Save dashboards and ensure widgets display current data with correct time ranges.
Common Audit Findings
- Incomplete monitoring coverage
- This occurs when new infrastructure is deployed without being added to monitoring. Prevent this by linking monitoring configuration checks to deployment or provisioning workflows.
- Alert thresholds not defined
- Auditors often find metrics collected without actionable alert thresholds. Establish documented standards for thresholds and review them periodically.
- Alerts not routed to on-call staff
- Alerts may exist but notify inactive or incorrect channels. Regularly test alert notifications and validate the on-call contact list.
- Lack of evidence for ongoing review
- Teams may monitor systems but fail to retain proof of review. Preserve dashboard screenshots and alert logs with timestamps to demonstrate ongoing monitoring.