Overview
Disaster Recovery Testing is the formal process of simulating system outages to verify that critical services can be restored within defined recovery objectives. It ensures the effectiveness of disaster recovery plans and technical controls required under SOC 2 CC9.1.
Step-by-Step Process
Define recovery objectives
The Engineering Lead reviews and confirms Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for all in-scope systems. These objectives are documented and approved to set clear success criteria for the test.
Role: Engineering Lead
Identify systems and dependencies
The Engineering Lead creates an inventory of critical systems, data stores, and third-party dependencies included in the disaster recovery test. The output is a scoped system list aligned to SOC 2 in-scope services.
Role: Engineering Lead
Select disaster scenario
The Engineering Lead selects a realistic disaster scenario such as region failure, data corruption, or infrastructure loss. The selected scenario and assumptions are documented in the test plan.
Role: Engineering Lead
Prepare recovery procedures
Engineering prepares and reviews recovery runbooks, infrastructure-as-code, and access permissions required to execute the test. Any gaps or outdated steps are corrected before testing begins.
Role: Engineering Lead
Execute disaster recovery test
Engineering executes the recovery steps according to the approved plan, restoring systems in a non-production or approved test environment. Start and end times are recorded to measure actual RTO and RPO performance.
Role: Engineering Lead
Validate system restoration
Engineering verifies that applications, data integrity, monitoring, and security controls are fully functional after recovery. Validation results are documented with screenshots or logs.
Role: Engineering Lead
Document results and gaps
The Engineering Lead documents test results, deviations from objectives, and root causes of any failures. A formal test report is created and stored in the compliance repository.
Role: Engineering Lead
Track remediation actions
Engineering creates remediation tickets for identified gaps and tracks them to closure. Evidence of fixes is linked back to the disaster recovery test report.
Role: Engineering Lead
What You Need Before Starting
- Approved disaster recovery policy and plan
- List of SOC 2 in-scope systems and services
- Access to AWS accounts and recovery environments
- Current recovery runbooks and Terraform code repositories
Evidence Your Auditor Expects
- Annual disaster recovery test plan dated and approved (PDF or doc with approval timestamp)
- Disaster recovery test execution report with start/end times and results (dated)
- Screenshots or logs showing system restoration timestamps from the test date
- Remediation tickets with creation and closure dates linked to test findings
How This Looks In Your Tools
AWS
Log in to the AWS Management Console and navigate to the affected service (e.g., EC2 > Instances or RDS > Databases). Initiate recovery actions such as launching instances from AMIs or restoring databases from snapshots, noting the time each action is started and completed.
For region-level testing, go to Route 53 > Hosted Zones and update failover routing records to point to the recovery region. Capture screenshots of the routing change and health check status.
After restoration, navigate to CloudWatch > Metrics and CloudWatch > Logs to verify system health and application logs. Export relevant logs with timestamps as evidence of successful recovery.
Terraform
Open the Terraform repository containing disaster recovery or multi-region configurations. Review the recovery-related modules and variables to ensure they reflect the current environment.
From a secured terminal, run terraform init followed by terraform plan using the disaster recovery workspace or variables. Save the plan output with a timestamp.
Execute terraform apply to provision or restore infrastructure as part of the test, then save the apply output logs. Commit any required changes or tag the repository to reference the test execution date.
Runbook
Open the approved disaster recovery runbook in the documentation system (e.g., Confluence or Google Docs). Confirm the version number and last updated date before starting the test.
Follow each recovery step in sequence, checking off completed actions and recording actual completion times directly in the runbook or an attached test log.
After the test, update the runbook with any corrected steps or clarifications identified during execution, and record the revision date and approver.
Common Audit Findings
- No evidence of annual DR testing
- This occurs when tests are performed but not formally documented or retained. Prevent this by producing a dated test report and storing it in a centralized compliance repository.
- RTO and RPO not measured
- Teams often execute recovery without tracking actual recovery times. Record start and completion timestamps to demonstrate alignment with defined objectives.
- Runbooks outdated or unapproved
- Recovery documentation may not reflect current systems due to environment changes. Require runbook review and approval as part of the testing process.
- Identified gaps not remediated
- Auditors see findings without follow-up actions. Create remediation tickets and track them to closure with documented evidence.