Friday 19 January 2024

SRE Insights - 11 Lessons for Tech Resilience



Google SRE Key Takeaways

1. Risk Assessment During Incidents:

- Monitor and evaluate incident severity.
- Choose mitigation paths commensurate with risk.
- Informed decisions during broken scenarios.

2. Practice Makes Perfect:

- Regularly practice recovery mechanisms.
- Verify effectiveness through testing.
- Double down on testing for improved reliability.

3. Canary Deployments for Global Changes:

- Implement global changes incrementally.
- Utilize progressive rollout strategies.
- Prevent unintended consequences with canary deployments.

4. Backup Communication Channels:

- Establish non-dependent backup channels.
- Ensure thorough testing of backup communication.
- Vital during unexpected incidents affecting primary channels.

5. Graceful Degradation for Continuous Functionality:

- Design systems for continuous minimum functionality.
- Provide a consistent user experience during degraded modes.
- Intentional and careful construction of degraded performance modes.

6. Disaster Resilience and Recovery Testing:

- Beyond unit and integration testing, prioritize resilience and recovery.
- Simulate extreme scenarios through tabletop exercises.
- Prepare for natural disasters or cyber-attacks.

7. Automated Mitigations for Faster Resolution:

- Automate mitigations for faster resolution.
- Swift responses to clear signals of failure.
- Minimize user impact through automated actions.

8. Frequent Rollouts for Safety:

- Conduct frequent rollouts with thorough testing.
- Reduce surprises through consistent testing.
- Ensure safety in complex, multi-component systems.

9. Diverse Infrastructure for Resilience:

- Maintain diverse infrastructure to prevent total outages.
- Mitigate latent bugs with diverse network backbones.
- The difference between a troublesome outage and a total one.

10. Reduce MTTR with Automated Measures:

- Automate mitigating measures during network failures.
- Clear signals trigger automated mitigation.
- Preserve root-cause analysis for user impact avoidance.

11. Timely Rollouts for Critical Functions:

- Avoid long delays between rollouts, especially in complex systems.
- Frequent rollouts with proper testing prevent critical function failures.
- Maintain diverse infrastructure to identify latent bugs.

Monitor Unix server without any monitoring tool | Linux server monitoring from command mode | Script for server monitoring in performance testing

#!/bin/bash

# Specify the time duration for which you want to collect metrics
start_time="10:00:00"
end_time="11:00:00"

# Interval in seconds
interval=30

# Output file
output_file="metrics.txt"

# Loop to collect metrics every 30 seconds
current_time=$start_time
while [[ "$current_time" < "$end_time" ]]; do
    sar -u -r -d -n DEV -s "$current_time" -e "$current_time" >> "$output_file"
    current_time=$(date -d "$current_time + $interval seconds" +%H:%M:%S)
    sleep $interval
done

Save this script to a file, make it executable (chmod +x script_name.sh), and then run it. It will collect metrics using the sar command every 30 seconds within the specified time range and append the output to the metrics.txt file.

After the script runs, you can open the metrics.txt file in Excel or another tool for analysis. Keep in mind that running this script for an hour will generate a large amount of data, so ensure you have enough disk space for the output file.