SOC 2 Infrastructure Monitoring Process

Learn how to implement Infrastructure Monitoring for SOC 2 compliance under CC7.1, including alerts, dashboards, and auditor-ready evidence.

SOC 2 Processes
SOC 2 Infrastructure Monitoring Process

Overview

Infrastructure Monitoring is the continuous observation and alerting of system performance, availability, and resource utilization to detect and respond to operational issues. This process supports SOC 2 CC7.1 by ensuring potential system failures or security-impacting events are identified and addressed in a timely manner.

Step-by-Step Process

  1. Define monitoring scope

    The Engineering Lead identifies all in-scope infrastructure components, including servers, databases, containers, and cloud services that support SOC 2 scoped systems. The output is a documented monitoring scope list aligned to the system inventory.

    Role: Engineering Lead

  2. Configure core metrics

    Engineering configures baseline metrics such as CPU utilization, memory usage, disk space, network throughput, and service uptime for each in-scope component. The output is an active set of metrics visible in the monitoring tool.

    Role: Engineering Lead

  3. Set alert thresholds

    Engineering defines alert thresholds and severity levels based on operational risk (e.g., CPU > 85% for 5 minutes). The output is a documented and enabled alert configuration tied to escalation rules.

    Role: Engineering Lead

  4. Enable alert notifications

    Engineering configures alert notifications to route to approved channels such as email, Slack, or PagerDuty. The output is verified alert delivery to on-call personnel.

    Role: Engineering Lead

  5. Review monitoring dashboards

    Engineering reviews infrastructure dashboards on an ongoing basis to identify anomalies or degradation trends. The output is operational awareness and early identification of issues.

    Role: Engineering Lead

  6. Respond to alerts

    On-call engineers investigate triggered alerts, remediate underlying issues, and document actions taken. The output is resolved alerts with timestamps and resolution notes.

    Role: Engineering Lead

  7. Periodically validate monitoring

    Engineering performs periodic checks to confirm all in-scope systems are still monitored and alerts are functioning as intended. The output is an updated monitoring validation record.

    Role: Engineering Lead

What You Need Before Starting

  • Approved system inventory identifying SOC 2 in-scope infrastructure
  • Administrative access to monitoring tools (Datadog, New Relic, or CloudWatch)
  • On-call rotation and escalation contact list
  • Documented alerting and incident response expectations

Evidence Your Auditor Expects

  • Dated screenshot of active infrastructure dashboard showing in-scope systems (with timestamp visible)
  • Alert configuration export or screenshot showing thresholds and notification channels with last modified date
  • Sample alert log demonstrating a triggered alert and resolution timestamp within the audit period
  • Monitoring scope document mapped to system inventory with last review date

How This Looks In Your Tools

Datadog

Log in to Datadog and navigate to Infrastructure > Host Map to verify all production hosts are reporting metrics. Go to Metrics > Summary to confirm CPU, memory, disk, and network metrics are actively collecting.

Navigate to Monitors > New Monitor and select the monitor type (e.g., Infrastructure or Metric). Configure thresholds (such as system.cpu.user > 85%) and set alert conditions, then assign notification channels under Notify your team.

Access Dashboards > Dashboard List to review or create an infrastructure dashboard. Confirm dashboards are updated in real time and save any changes with a clear name indicating production scope.

New Relic

Log in to New Relic and go to Infrastructure > Hosts or Infrastructure > Kubernetes to confirm all in-scope entities are reporting data. Review the entity list to ensure no critical systems are missing.

Navigate to Alerts & AI > Alert conditions and create or review alert conditions for key metrics such as CPU, memory, and disk utilization. Assign alert policies and verify notification channels under Alerts & AI > Notification channels.

Go to Dashboards and open the Infrastructure dashboard to review real-time performance trends. Save any updates and confirm dashboards reflect current production infrastructure.

CloudWatch

Log in to the AWS Console and navigate to CloudWatch > Metrics. Review namespaces such as EC2, RDS, and ECS to confirm metrics are being collected for all in-scope resources.

Go to CloudWatch > Alarms and create or review alarms for critical metrics (e.g., CPUUtilization, FreeStorageSpace). Configure alarm thresholds and set notifications using an SNS topic tied to on-call contacts.

Navigate to CloudWatch > Dashboards to review or create dashboards that display infrastructure health. Save dashboards and ensure widgets display current data with correct time ranges.

Common Audit Findings

Incomplete monitoring coverage
This occurs when new infrastructure is deployed without being added to monitoring. Prevent this by linking monitoring configuration checks to deployment or provisioning workflows.
Alert thresholds not defined
Auditors often find metrics collected without actionable alert thresholds. Establish documented standards for thresholds and review them periodically.
Alerts not routed to on-call staff
Alerts may exist but notify inactive or incorrect channels. Regularly test alert notifications and validate the on-call contact list.
Lack of evidence for ongoing review
Teams may monitor systems but fail to retain proof of review. Preserve dashboard screenshots and alert logs with timestamps to demonstrate ongoing monitoring.

Related Processes

Key Roles

Engineering Lead