Overview
Rollback and Recovery Procedures are the defined steps used to safely revert system changes and restore services when a deployment causes errors or instability. This process ensures availability and integrity of systems in alignment with SOC 2 CC8 change management requirements.
Step-by-Step Process
Detect failed or risky change
Engineering or on-call staff monitor alerts, error rates, and user reports to identify failed or degraded changes. Once identified, the incident and suspected change are logged in the incident tracking system. The output is a documented trigger for rollback consideration.
Role: Engineering Lead
Assess rollback impact
The Engineering Lead evaluates the scope of the change, affected systems, and customer impact to confirm rollback is the appropriate response. Dependencies, data implications, and downtime risks are reviewed. The output is a go/no-go decision for rollback.
Role: Engineering Lead
Initiate rollback execution
An authorized engineer performs the rollback using approved deployment or configuration tools. The rollback targets the last known stable version or configuration. The output is a system state reverted to a stable baseline.
Role: DevOps Engineer
Validate system recovery
Post-rollback, engineering validates application health using monitoring dashboards, smoke tests, and key business transactions. Any residual issues are addressed immediately. The output is confirmation that services are operating normally.
Role: Engineering Lead
Document rollback activity
Details of the rollback, including timestamps, versions, root cause, and approver, are recorded in the change or incident record. Supporting logs and screenshots are attached. The output is a complete audit trail for the rollback event.
Role: DevOps Engineer
Review and improve controls
Engineering reviews the rollback event to identify preventive improvements such as testing gaps or deployment safeguards. Action items are tracked to completion. The output is updated change management or deployment practices.
Role: Engineering Lead
What You Need Before Starting
- Approved change or deployment record with version details
- Access to production systems and deployment tools
- Monitoring and alerting dashboards (e.g., logs, metrics)
- Incident or ticketing system access
Evidence Your Auditor Expects
- Incident or change ticket showing rollback decision with date and time
- Deployment tool logs showing rollback execution timestamp and version ID
- Monitoring dashboard screenshots confirming recovery with date/time visible
- Post-incident review document referencing the rollback event and date
How This Looks In Your Tools
Kubernetes
Access the Kubernetes cluster using kubectl with appropriate credentials. Run kubectl get deployments -n <namespace> to identify the affected deployment, then execute kubectl rollout undo deployment/<deployment-name> -n <namespace> to revert to the previous ReplicaSet.
Verify the rollback by running kubectl rollout status deployment/<deployment-name> -n <namespace> and checking pod health using kubectl get pods. Capture terminal output and cluster dashboard screenshots (e.g., Kubernetes Dashboard > Workloads > Deployments) showing the successful rollback.
AWS CodeDeploy
Log in to the AWS Management Console and navigate to CodeDeploy > Applications > select the application > Deployments. Identify the failed or problematic deployment and select it.
Choose “Stop and Rollback Deployment” from the Actions menu. Confirm the rollback to the last successful deployment, then monitor status until marked as “Succeeded.” Capture deployment history screenshots with timestamps.
Feature flags
Log in to the feature flag management tool (e.g., LaunchDarkly) and navigate to the Flags dashboard. Select the impacted feature flag and toggle it to the off state or reduce rollout percentage to 0%.
Confirm application behavior via monitoring tools and user testing. Export or screenshot the flag change audit log showing who made the change and the exact timestamp.
Common Audit Findings
- Rollback actions not documented
- Teams perform rollbacks quickly but fail to record them in change or incident systems. Prevent this by requiring rollback documentation as a mandatory incident closure step.
- Unauthorized personnel performing rollbacks
- Access controls are too broad, allowing unapproved users to execute rollbacks. Limit deployment permissions and review access quarterly.
- No evidence of recovery validation
- Rollbacks occur without proof that systems fully recovered. Require screenshots or logs from monitoring tools showing post-rollback health.
- Inconsistent rollback methods
- Different engineers use ad hoc rollback approaches. Standardize rollback procedures per tool and train staff annually.