1. Risk Assessment During Incidents:
- Monitor and evaluate incident severity.
- Choose mitigation paths commensurate with risk.
- Informed decisions during broken scenarios.
2. Practice Makes Perfect:
- Regularly practice recovery mechanisms.
- Verify effectiveness through testing.
- Double down on testing for improved reliability.
3. Canary Deployments for Global Changes:
- Implement global changes incrementally.
- Utilize progressive rollout strategies.
- Prevent unintended consequences with canary deployments.
4. Backup Communication Channels:
- Establish non-dependent backup channels.
- Ensure thorough testing of backup communication.
- Vital during unexpected incidents affecting primary channels.
5. Graceful Degradation for Continuous Functionality:
- Design systems for continuous minimum functionality.
- Provide a consistent user experience during degraded modes.
- Intentional and careful construction of degraded performance modes.
6. Disaster Resilience and Recovery Testing:
- Beyond unit and integration testing, prioritize resilience and recovery.
- Simulate extreme scenarios through tabletop exercises.
- Prepare for natural disasters or cyber-attacks.
7. Automated Mitigations for Faster Resolution:
- Automate mitigations for faster resolution.
- Swift responses to clear signals of failure.
- Minimize user impact through automated actions.
8. Frequent Rollouts for Safety:
- Conduct frequent rollouts with thorough testing.
- Reduce surprises through consistent testing.
- Ensure safety in complex, multi-component systems.
9. Diverse Infrastructure for Resilience:
- Maintain diverse infrastructure to prevent total outages.
- Mitigate latent bugs with diverse network backbones.
- The difference between a troublesome outage and a total one.
10. Reduce MTTR with Automated Measures:
- Automate mitigating measures during network failures.
- Clear signals trigger automated mitigation.
- Preserve root-cause analysis for user impact avoidance.
11. Timely Rollouts for Critical Functions:
- Avoid long delays between rollouts, especially in complex systems.
- Frequent rollouts with proper testing prevent critical function failures.
- Maintain diverse infrastructure to identify latent bugs.