SOC 2 Disaster Recovery Testing Process

Learn how to perform Disaster Recovery Testing under SOC 2 CC9.1, including steps, evidence, and tools to meet compliance requirements.

SOC 2 Processes
SOC 2 Disaster Recovery Testing Process

Overview

Disaster Recovery Testing is the formal process of simulating system outages to verify that critical services can be restored within defined recovery objectives. It ensures the effectiveness of disaster recovery plans and technical controls required under SOC 2 CC9.1.

Step-by-Step Process

  1. Define recovery objectives

    The Engineering Lead reviews and confirms Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for all in-scope systems. These objectives are documented and approved to set clear success criteria for the test.

    Role: Engineering Lead

  2. Identify systems and dependencies

    The Engineering Lead creates an inventory of critical systems, data stores, and third-party dependencies included in the disaster recovery test. The output is a scoped system list aligned to SOC 2 in-scope services.

    Role: Engineering Lead

  3. Select disaster scenario

    The Engineering Lead selects a realistic disaster scenario such as region failure, data corruption, or infrastructure loss. The selected scenario and assumptions are documented in the test plan.

    Role: Engineering Lead

  4. Prepare recovery procedures

    Engineering prepares and reviews recovery runbooks, infrastructure-as-code, and access permissions required to execute the test. Any gaps or outdated steps are corrected before testing begins.

    Role: Engineering Lead

  5. Execute disaster recovery test

    Engineering executes the recovery steps according to the approved plan, restoring systems in a non-production or approved test environment. Start and end times are recorded to measure actual RTO and RPO performance.

    Role: Engineering Lead

  6. Validate system restoration

    Engineering verifies that applications, data integrity, monitoring, and security controls are fully functional after recovery. Validation results are documented with screenshots or logs.

    Role: Engineering Lead

  7. Document results and gaps

    The Engineering Lead documents test results, deviations from objectives, and root causes of any failures. A formal test report is created and stored in the compliance repository.

    Role: Engineering Lead

  8. Track remediation actions

    Engineering creates remediation tickets for identified gaps and tracks them to closure. Evidence of fixes is linked back to the disaster recovery test report.

    Role: Engineering Lead

What You Need Before Starting

  • Approved disaster recovery policy and plan
  • List of SOC 2 in-scope systems and services
  • Access to AWS accounts and recovery environments
  • Current recovery runbooks and Terraform code repositories

Evidence Your Auditor Expects

  • Annual disaster recovery test plan dated and approved (PDF or doc with approval timestamp)
  • Disaster recovery test execution report with start/end times and results (dated)
  • Screenshots or logs showing system restoration timestamps from the test date
  • Remediation tickets with creation and closure dates linked to test findings

How This Looks In Your Tools

AWS

Log in to the AWS Management Console and navigate to the affected service (e.g., EC2 > Instances or RDS > Databases). Initiate recovery actions such as launching instances from AMIs or restoring databases from snapshots, noting the time each action is started and completed.

For region-level testing, go to Route 53 > Hosted Zones and update failover routing records to point to the recovery region. Capture screenshots of the routing change and health check status.

After restoration, navigate to CloudWatch > Metrics and CloudWatch > Logs to verify system health and application logs. Export relevant logs with timestamps as evidence of successful recovery.

Terraform

Open the Terraform repository containing disaster recovery or multi-region configurations. Review the recovery-related modules and variables to ensure they reflect the current environment.

From a secured terminal, run terraform init followed by terraform plan using the disaster recovery workspace or variables. Save the plan output with a timestamp.

Execute terraform apply to provision or restore infrastructure as part of the test, then save the apply output logs. Commit any required changes or tag the repository to reference the test execution date.

Runbook

Open the approved disaster recovery runbook in the documentation system (e.g., Confluence or Google Docs). Confirm the version number and last updated date before starting the test.

Follow each recovery step in sequence, checking off completed actions and recording actual completion times directly in the runbook or an attached test log.

After the test, update the runbook with any corrected steps or clarifications identified during execution, and record the revision date and approver.

Common Audit Findings

No evidence of annual DR testing
This occurs when tests are performed but not formally documented or retained. Prevent this by producing a dated test report and storing it in a centralized compliance repository.
RTO and RPO not measured
Teams often execute recovery without tracking actual recovery times. Record start and completion timestamps to demonstrate alignment with defined objectives.
Runbooks outdated or unapproved
Recovery documentation may not reflect current systems due to environment changes. Require runbook review and approval as part of the testing process.
Identified gaps not remediated
Auditors see findings without follow-up actions. Create remediation tickets and track them to closure with documented evidence.

Related Processes

Key Roles

Engineering Lead