Master Jenkins UserRemoteConfig for dynamic Git repository management. Includes Groovy examples, …
Bash for SRE: Implementing Google's Reliability Engineering Principles in Shell Scripts
Summary
After our third 3 AM alert in a week, I knew something had to change. Our monitoring scripts were generating false positives, our remediation was manual and error-prone, and our on-call engineers were burning out. That’s when I turned to Google’s Site Reliability Engineering (SRE) books and discovered a revolutionary approach to operations that transformed how we used Bash scripts to maintain our systems.
As Google’s SRE book states, “Hope is not a strategy.” Real reliability comes from systematic approaches to automation, monitoring, and incident response. In this article, I’ll show you how to implement Google’s battle-tested SRE principles using Bash — creating scripts that not only detect issues but implement the same reliability practices that keep Google’s massive infrastructure running.
Understanding SRE Principles for Bash Automation
Bash scripts are often the first line of defense in system reliability, but as Robert Love points out in “Linux System Programming: Talking Directly to the Kernel and C Library”, they’re frequently written as afterthoughts rather than engineered solutions. Google’s SRE approach changes that perspective entirely.
“SRE is what happens when you ask a software engineer to design an operations team.” — Google’s Site Reliability Engineering Book
This philosophy transforms how we should approach Bash scripting. Instead of writing scripts to solve immediate problems, we should be engineering reliable systems through automation. Brian Ward’s “How Linux Works” provides the foundation for understanding how Linux systems operate, but Google’s SRE books teach us how to make them reliable at scale.
The core SRE principles that should guide your Bash scripting include:
- Eliminating toil through automation
- Implementing SLOs (Service Level Objectives) and error budgets
- Embracing risk with calculated trade-offs
- Monitoring the right signals
- Automating away repetitive work
- Implementing graceful degradation
- Designing for failure
“Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” — Google SRE Book
Let’s explore how to implement these principles in Bash scripts that transform operations from reactive firefighting to proactive engineering.
Expand your knowledge with Advanced Bash Scripting Techniques for Automation: A Comprehensive Guide
Implementing SLOs and Error Budgets in Bash
One of the most powerful concepts from Google’s SRE practice is the implementation of Service Level Objectives (SLOs) and error budgets. As the SRE book explains:
“An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.” — Google SRE Book
Here’s how we can implement SLO tracking and error budget calculation in Bash:
#!/bin/bash
# SLO Monitoring and Error Budget Calculation
# Based on Google SRE principles for reliability measurement
# Configuration
SERVICE_NAME="api-gateway"
SLO_AVAILABILITY_TARGET=0.995 # 99.5% availability target
LOOKBACK_DAYS=30
ERROR_LOG="/var/log/services/$SERVICE_NAME/errors.log"
REQUEST_LOG="/var/log/services/$SERVICE_NAME/requests.log"
OUTPUT_DIR="/var/lib/monitoring/slo"
function calculate_error_budget {
local service="$1"
local slo_target="$2"
local lookback_days="$3"
local error_log="$4"
local request_log="$5"
echo "Calculating error budget for $service (SLO: ${slo_target}% over $lookback_days days)"
# Get the start time for our lookback period
local start_time
start_time=$(date -d "$lookback_days days ago" "+%Y-%m-%d %H:%M:%S")
# Count total requests in the lookback period
local total_requests
total_requests=$(grep -c "request_id" "$request_log")
# Count error requests in the lookback period
local error_requests
error_requests=$(grep -c "ERROR" "$error_log")
# Calculate current availability
local current_availability
current_availability=$(echo "scale=6; 1 - $error_requests / $total_requests" | bc)
# Calculate error budget
local error_budget_total
error_budget_total=$(echo "scale=6; (1 - $slo_target) * $total_requests" | bc)
local error_budget_used
error_budget_used=$error_requests
local error_budget_remaining
error_budget_remaining=$(echo "scale=6; $error_budget_total - $error_budget_used" | bc)
local error_budget_percent
error_budget_percent=$(echo "scale=2; 100 * $error_budget_remaining / $error_budget_total" | bc)
# Prepare output directory
mkdir -p "$OUTPUT_DIR"
# Write results to file
cat > "$OUTPUT_DIR/${service}_slo_report.txt" << EOF
SLO Report for $service
Generated: $(date "+%Y-%m-%d %H:%M:%S")
Lookback Period: $lookback_days days (since $start_time)
Total Requests: $total_requests
Error Requests: $error_requests
Current Availability: $(echo "scale=4; 100 * $current_availability" | bc)%
SLO Target: $(echo "scale=4; 100 * $slo_target" | bc)%
Error Budget Total: $error_budget_total
Error Budget Used: $error_budget_used
Error Budget Remaining: $error_budget_remaining
Error Budget Remaining: $error_budget_percent%
SLO Status: $([ "$(echo "$current_availability >= $slo_target" | bc)" -eq 1 ] && echo "MEETING SLO ✓" || echo "VIOLATING SLO ✗")
EOF
# Output summary to console
echo "SLO Status for $service: $([ "$(echo "$current_availability >= $slo_target" | bc)" -eq 1 ] && echo "MEETING SLO ✓" || echo "VIOLATING SLO ✗")"
echo "Current Availability: $(echo "scale=4; 100 * $current_availability" | bc)%"
echo "Error Budget Remaining: $error_budget_percent%"
# Return status (0 if meeting SLO, 1 if violating)
[ "$(echo "$current_availability >= $slo_target" | bc)" -eq 1 ]
}
# Execute the function with configured parameters
calculate_error_budget "$SERVICE_NAME" "$SLO_AVAILABILITY_TARGET" "$LOOKBACK_DAYS" "$ERROR_LOG" "$REQUEST_LOG"
# Implement release control based on error budget
if [ $? -eq 0 ] && [ "$(echo "scale=2; 100 * $error_budget_percent > 50" | bc)" -eq 1 ]; then
echo "Sufficient error budget remains for deploying new features."
# Here you could automatically approve deployments in your CD system
# Example: curl -X POST https://jenkins.example.com/job/deploy-new-features/build
else
echo "Error budget too low for feature deployments. Focus on reliability improvements."
# Here you could pause feature deployments and prioritize reliability work
# Example: curl -X POST https://jira.example.com/api/create -d '{"type":"reliability","priority":"high"}'
fi
As the “Shell Scripting” guides recommend, this function encapsulates the complex logic of calculating error budgets while providing clear outputs. The SRE book emphasizes that:
“Error budgets are the primary construct we use to balance reliability with the pace of innovation.” — Google SRE Book
This Bash function implements that principle by tracking your service’s reliability and automatically adjusting the pace of change based on the remaining error budget.
Deepen your understanding in Bulletproof Bash Scripts: Mastering Error Handling for Reliable Automation
Eliminating Toil with Self-Healing Bash Functions
A core tenet of Google’s SRE practice is eliminating toil—manual, repetitive work that scales linearly with service growth. The SRE book defines it clearly:
“If a human operator needs to touch your system during normal operations, you have a bug.” — Google SRE Book
Using principles from “How Linux Works” by Brian Ward and Google’s SRE books, here’s a self-healing Bash function that eliminates common sources of toil:
#!/bin/bash
# Self-Healing System Verifier
# Implements Google SRE principles for toil reduction
function verify_and_heal_system {
local service="$1"
local config_path="${2:-/etc/$service}"
local log_path="${3:-/var/log/$service}"
local executable="${4:-/usr/bin/$service}"
echo "Performing self-healing verification for $service"
# Track actions taken
local healing_actions=()
# 1. Verify configuration files
if [ ! -d "$config_path" ]; then
echo "Configuration directory missing, creating: $config_path"
mkdir -p "$config_path"
healing_actions+=("Created missing config directory: $config_path")
fi
# 2. Check for config file corruption
if [ -f "$config_path/config.yaml" ]; then
if ! yaml-validator "$config_path/config.yaml" &>/dev/null; then
echo "Corrupted configuration detected, restoring from backup"
if [ -f "$config_path/config.yaml.bak" ]; then
cp "$config_path/config.yaml.bak" "$config_path/config.yaml"
healing_actions+=("Restored corrupted config from backup")
else
cp "/etc/defaults/$service/config.yaml" "$config_path/config.yaml"
healing_actions+=("Restored corrupted config from defaults")
fi
fi
fi
# 3. Ensure log directory exists with correct permissions
if [ ! -d "$log_path" ]; then
echo "Log directory missing, creating: $log_path"
mkdir -p "$log_path"
chown "$service:$service" "$log_path"
chmod 755 "$log_path"
healing_actions+=("Created missing log directory: $log_path")
fi
# 4. Check for disk space issues
root_usage=$(df -h / | awk 'NR==2 {print $5}' | tr -d '%')
if [ "$root_usage" -gt 85 ]; then
echo "Disk space critically low (${root_usage}%), cleaning up old logs"
find "$log_path" -name "*.log" -mtime +7 -exec gzip {} \;
find "$log_path" -name "*.gz" -mtime +30 -delete
healing_actions+=("Compressed log files older than 7 days")
healing_actions+=("Removed compressed logs older than 30 days")
fi
# 5. Verify service is running correctly
if ! systemctl is-active "$service" &>/dev/null; then
echo "Service $service is not running, attempting to start"
systemctl start "$service"
sleep 5
if systemctl is-active "$service" &>/dev/null; then
healing_actions+=("Started stopped service: $service")
else
echo "Failed to start service, checking for common issues"
# 5a. Check for port conflicts
port=$(grep -oP 'port:\s*\K\d+' "$config_path/config.yaml" || echo "8080")
if netstat -tuln | grep ":$port " &>/dev/null; then
echo "Port conflict detected on $port"
# Try to find an available port
for new_port in $(seq $((port+1)) $((port+20))); do
if ! netstat -tuln | grep ":$new_port " &>/dev/null; then
echo "Updating configuration to use port $new_port"
sed -i "s/port: $port/port: $new_port/" "$config_path/config.yaml"
systemctl start "$service"
healing_actions+=("Resolved port conflict: changed port from $port to $new_port")
break
fi
done
fi
# 5b. Check for corrupted service file
if [ ! -f "/etc/systemd/system/$service.service" ] || ! systemd-analyze verify "$service" &>/dev/null; then
echo "Service unit file is missing or corrupted, reinstalling"
systemctl daemon-reload
if [ -f "/usr/lib/systemd/system/$service.service.bak" ]; then
cp "/usr/lib/systemd/system/$service.service.bak" "/etc/systemd/system/$service.service"
systemctl daemon-reload
systemctl start "$service"
healing_actions+=("Restored corrupted service unit file")
fi
fi
fi
fi
# 6. Verify dependencies
for dep in $(grep -oP 'depends-on:\s*\K.*' "$config_path/config.yaml" | tr ',' ' ' || echo ""); do
if ! systemctl is-active "$dep" &>/dev/null; then
echo "Dependency $dep is not running, starting it"
systemctl start "$dep"
healing_actions+=("Started required dependency: $dep")
fi
done
# 7. Report healing actions
if [ ${#healing_actions[@]} -eq 0 ]; then
echo "No healing actions needed, system is healthy"
return 0
else
echo "Self-healing completed with ${#healing_actions[@]} actions taken:"
for action in "${healing_actions[@]}"; do
echo " - $action"
done
# Log healing actions for postmortem analysis
echo "[$(date "+%Y-%m-%d %H:%M:%S")] Self-healing actions for $service:" >> "/var/log/self-healing.log"
for action in "${healing_actions[@]}"; do
echo " - $action" >> "/var/log/self-healing.log"
done
return 1
fi
}
# Example usage
verify_and_heal_system "nginx" "/etc/nginx" "/var/log/nginx" "/usr/sbin/nginx"
This function embodies the SRE philosophy:
“Automation is a force multiplier, not a panacea.” — Google SRE Book
By focusing on common failure modes and implementing automated remediation, this script reduces the manual operations work that the SRE book identifies as toil. The function doesn’t just check for problems—it actively fixes them, following the principle from “Unix Power Tools” that good scripts should be autonomous problem solvers.
Explore this further in Advanced String Operations in Bash: Building Custom Functions
Building SRE-grade Monitoring with Bash
Google’s SRE practice emphasizes the “four golden signals” of monitoring:
- Latency
- Traffic
- Errors
- Saturation
“If you can only measure four metrics of your user-facing system, focus on these four.” — Google SRE Book
Here’s a Bash implementation that collects these critical metrics:
#!/bin/bash
# Four Golden Signals Monitoring
# Based on Google SRE monitoring principles
# Configuration
SERVICE_NAME="web-api"
METRICS_DIR="/var/lib/metrics/$SERVICE_NAME"
LOG_FILE="/var/log/$SERVICE_NAME/access.log"
ERROR_LOG="/var/log/$SERVICE_NAME/error.log"
INTERVAL=60 # seconds
RETENTION_DAYS=7
ALERT_THRESHOLD_LATENCY=500 # ms
ALERT_THRESHOLD_ERROR_RATE=0.01 # 1%
ALERT_THRESHOLD_SATURATION=0.8 # 80%
function collect_golden_signals {
echo "Collecting the four golden signals for $SERVICE_NAME"
# Ensure metrics directory exists
mkdir -p "$METRICS_DIR"
# Get current timestamp
local timestamp
timestamp=$(date +%s)
# 1. Measure Latency
# Extract response times from logs, calculate 50th, 95th, and 99th percentiles
if [ -f "$LOG_FILE" ]; then
echo "Measuring latency from logs..."
# For this example, assuming log format includes response_time=XXXms
local all_latencies
all_latencies=$(grep -oP 'response_time=\K\d+(?=ms)' "$LOG_FILE" | sort -n)
local count=0
local total=0
local latency_50p=0
local latency_95p=0
local latency_99p=0
if [ -n "$all_latencies" ]; then
count=$(echo "$all_latencies" | wc -l)
if [ "$count" -gt 0 ]; then
total=$(echo "$all_latencies" | paste -sd+ | bc)
latency_avg=$(echo "scale=2; $total / $count" | bc)
# Calculate percentiles
local line_50p=$((count * 50 / 100))
local line_95p=$((count * 95 / 100))
local line_99p=$((count * 99 / 100))
latency_50p=$(echo "$all_latencies" | sed -n "${line_50p}p")
latency_95p=$(echo "$all_latencies" | sed -n "${line_95p}p")
latency_99p=$(echo "$all_latencies" | sed -n "${line_99p}p")
fi
fi
echo "$timestamp latency_avg $latency_avg" >> "$METRICS_DIR/latency.metrics"
echo "$timestamp latency_50p $latency_50p" >> "$METRICS_DIR/latency.metrics"
echo "$timestamp latency_95p $latency_95p" >> "$METRICS_DIR/latency.metrics"
echo "$timestamp latency_99p $latency_99p" >> "$METRICS_DIR/latency.metrics"
# Check for latency SLO violation
if [ "$latency_95p" -gt "$ALERT_THRESHOLD_LATENCY" ]; then
echo "ALERT: Latency (95th percentile) is ${latency_95p}ms, exceeding threshold of ${ALERT_THRESHOLD_LATENCY}ms"
# Here you would trigger your alerting system
# send_alert "latency" "$SERVICE_NAME" "$latency_95p" "$ALERT_THRESHOLD_LATENCY"
fi
fi
# 2. Measure Traffic
if [ -f "$LOG_FILE" ]; then
echo "Measuring traffic from logs..."
# Count requests in the interval
local start_time=$((timestamp - INTERVAL))
local request_count
# Assuming timestamp is in logs as [YYYY-MM-DD HH:MM:SS]
local start_time_fmt
start_time_fmt=$(date -d "@$start_time" "+%Y-%m-%d %H:%M:%S")
request_count=$(grep -c "\[$start_time_fmt" "$LOG_FILE")
echo "$timestamp requests_per_minute $request_count" >> "$METRICS_DIR/traffic.metrics"
# Calculate requests per second
local rps
rps=$(echo "scale=2; $request_count / $INTERVAL" | bc)
echo "$timestamp requests_per_second $rps" >> "$METRICS_DIR/traffic.metrics"
fi
# 3. Measure Errors
if [ -f "$LOG_FILE" ] && [ -f "$ERROR_LOG" ]; then
echo "Measuring error rate..."
local start_time=$((timestamp - INTERVAL))
local start_time_fmt
start_time_fmt=$(date -d "@$start_time" "+%Y-%m-%d %H:%M:%S")
local request_count
request_count=$(grep -c "\[$start_time_fmt" "$LOG_FILE")
local error_count
error_count=$(grep -c "\[$start_time_fmt" "$ERROR_LOG")
local error_rate=0
if [ "$request_count" -gt 0 ]; then
error_rate=$(echo "scale=4; $error_count / $request_count" | bc)
fi
echo "$timestamp error_count $error_count" >> "$METRICS_DIR/errors.metrics"
echo "$timestamp error_rate $error_rate" >> "$METRICS_DIR/errors.metrics"
# Check for error rate SLO violation
if [ "$(echo "$error_rate > $ALERT_THRESHOLD_ERROR_RATE" | bc)" -eq 1 ]; then
error_percent=$(echo "scale=2; 100 * $error_rate" | bc)
echo "ALERT: Error rate is ${error_percent}%, exceeding threshold of $((ALERT_THRESHOLD_ERROR_RATE * 100))%"
# Here you would trigger your alerting system
# send_alert "error_rate" "$SERVICE_NAME" "$error_rate" "$ALERT_THRESHOLD_ERROR_RATE"
fi
fi
# 4. Measure Saturation
echo "Measuring system saturation..."
# CPU saturation
local cpu_usage
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4}')
cpu_saturation=$(echo "scale=4; $cpu_usage / 100" | bc)
# Memory saturation
local mem_usage
mem_usage=$(free | grep Mem | awk '{print $3/$2 * 100}')
mem_saturation=$(echo "scale=4; $mem_usage / 100" | bc)
# Disk I/O saturation (using iostat if available)
local disk_saturation=0
if command -v iostat &>/dev/null; then
disk_saturation=$(iostat -x | grep -A 1 avg-cpu | tail -1 | awk '{print $14}')
disk_saturation=$(echo "scale=4; $disk_saturation / 100" | bc)
fi
echo "$timestamp cpu_saturation $cpu_saturation" >> "$METRICS_DIR/saturation.metrics"
echo "$timestamp mem_saturation $mem_saturation" >> "$METRICS_DIR/saturation.metrics"
echo "$timestamp disk_saturation $disk_saturation" >> "$METRICS_DIR/saturation.metrics"
# Check for saturation SLO violations
if [ "$(echo "$cpu_saturation > $ALERT_THRESHOLD_SATURATION" | bc)" -eq 1 ]; then
cpu_percent=$(echo "scale=0; 100 * $cpu_saturation" | bc)
echo "ALERT: CPU saturation is at ${cpu_percent}%, exceeding threshold of $((ALERT_THRESHOLD_SATURATION * 100))%"
# Here you would trigger your alerting system
# send_alert "cpu_saturation" "$SERVICE_NAME" "$cpu_saturation" "$ALERT_THRESHOLD_SATURATION"
fi
if [ "$(echo "$mem_saturation > $ALERT_THRESHOLD_SATURATION" | bc)" -eq 1 ]; then
mem_percent=$(echo "scale=0; 100 * $mem_saturation" | bc)
echo "ALERT: Memory saturation is at ${mem_percent}%, exceeding threshold of $((ALERT_THRESHOLD_SATURATION * 100))%"
# Here you would trigger your alerting system
# send_alert "mem_saturation" "$SERVICE_NAME" "$mem_saturation" "$ALERT_THRESHOLD_SATURATION"
fi
# Clean up old metrics files
find "$METRICS_DIR" -name "*.metrics" -mtime +"$RETENTION_DAYS" -delete
echo "Completed collection of golden signals for $SERVICE_NAME"
}
# Example usage
collect_golden_signals
This script embodies the monitoring philosophy described in the SRE book:
“Monitoring is one of the primary means by which service owners keep track of a system’s health and availability.” — Google SRE Book
By focusing on the four golden signals, this Bash function helps identify problems that directly impact users, rather than collecting metrics that might be interesting but aren’t actionable. As Robert Love notes in “Linux System Programming”, effective monitoring focuses on symptoms, not causes.
Discover related concepts in Advanced String Operations in Bash: Building Custom Functions
Implementing Graceful Degradation in Bash Scripts
A key SRE principle is designing systems to degrade gracefully under load or when dependencies fail. As the SRE book states:
“The worst case scenario for a system under load is cascading failure.” — Google SRE Book
Here’s a Bash function that implements cascading dependency checks with graceful degradation:
#!/bin/bash
# Graceful Degradation Handler
# Based on Google SRE principles for managing system dependencies
# Configuration
CONFIG_FILE="/etc/services-config.json"
LOG_FILE="/var/log/service-health.log"
function check_dependency_health {
local service="$1"
local endpoint="${2:-http://localhost:8080/health}"
local timeout="${3:-5}"
# Try to get health status with timeout
if curl -s -f -m "$timeout" "$endpoint" &>/dev/null; then
echo "UP"
return 0
else
echo "DOWN"
return 1
fi
}
function graceful_degradation {
local service_name="$1"
local config_file="${2:-$CONFIG_FILE}"
echo "Checking dependencies for $service_name with graceful degradation"
# Define degradation levels
local degradation_level="NORMAL"
local feature_flags=()
# Load service configuration
if [ ! -f "$config_file" ]; then
echo "ERROR: Config file not found: $config_file"
return 1
fi
# Extract dependencies from config
local dependencies
dependencies=$(jq -r ".services.\"$service_name\".dependencies[]" "$config_file" 2>/dev/null)
if [ -z "$dependencies" ]; then
echo "No dependencies found for $service_name, operating normally"
return 0
fi
echo "Checking critical dependencies..."
# Check critical dependencies
local critical_deps
critical_deps=$(jq -r ".services.\"$service_name\".dependencies[] | select(.critical == true) | .name" "$config_file")
local critical_failures=0
for dep in $critical_deps; do
local endpoint
endpoint=$(jq -r ".services.\"$dep\".healthEndpoint" "$config_file")
echo -n "Checking critical dependency $dep: "
local status
status=$(check_dependency_health "$dep" "$endpoint")
echo "$status"
if [ "$status" == "DOWN" ]; then
((critical_failures++))
echo "WARNING: Critical dependency $dep is DOWN" | tee -a "$LOG_FILE"
fi
done
# Check non-critical dependencies
echo "Checking non-critical dependencies..."
local noncritical_deps
noncritical_deps=$(jq -r ".services.\"$service_name\".dependencies[] | select(.critical == false) | .name" "$config_file")
local noncritical_failures=0
for dep in $noncritical_deps; do
local endpoint
endpoint=$(jq -r ".services.\"$dep\".healthEndpoint" "$config_file")
echo -n "Checking non-critical dependency $dep: "
local status
status=$(check_dependency_health "$dep" "$endpoint")
echo "$status"
if [ "$status" == "DOWN" ]; then
((noncritical_failures++))
# Add appropriate feature flags to disable
local flags
flags=$(jq -r ".services.\"$service_name\".dependencies[] | select(.name == \"$dep\") | .features[]" "$config_file")
for flag in $flags; do
feature_flags+=("$flag")
echo "Disabling feature: $flag due to dependency $dep failure" | tee -a "$LOG_FILE"
done
fi
done
# Determine degradation level
if [ "$critical_failures" -gt 0 ]; then
degradation_level="SEVERE"
echo "ALERT: Service $service_name in SEVERE degradation mode due to critical dependency failures" | tee -a "$LOG_FILE"
elif [ "$noncritical_failures" -gt 0 ]; then
degradation_level="DEGRADED"
echo "WARNING: Service $service_name in DEGRADED mode, disabling ${#feature_flags[@]} features" | tee -a "$LOG_FILE"
else
echo "Service $service_name operating in NORMAL mode, all dependencies healthy" | tee -a "$LOG_FILE"
fi
# Update service configuration with degradation status
if [ "$degradation_level" != "NORMAL" ]; then
echo "Updating service configuration for degraded operation..."
# Write degradation state file
cat > "/var/run/$service_name.degraded" << EOF
{
"degradationLevel": "$degradation_level",
"timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
"disabledFeatures": [$(printf '"%s",' "${feature_flags[@]}" | sed 's/,$//')]
}
EOF
# Reload service to apply degradation changes
if [ -x "/usr/bin/systemctl" ]; then
echo "Reloading service to apply degraded mode..."
systemctl reload "$service_name" || systemctl restart "$service_name"
fi
# Report degradation to monitoring system
echo "Reporting degradation status to monitoring..."
if [ -x "/usr/bin/curl" ]; then
curl -s -XPOST "http://monitoring.internal:9090/v1/events" -d "{
\"service\": \"$service_name\",
\"degradation_level\": \"$degradation_level\",
\"disabled_features\": [$(printf '"%s",' "${feature_flags[@]}" | sed 's/,$//')]
}" || true
fi
else
# If previously degraded, recover
if [ -f "/var/run/$service_name.degraded" ]; then
echo "Recovering from previous degradation..."
rm "/var/run/$service_name.degraded"
# Reload service to apply normal operation
if [ -x "/usr/bin/systemctl" ]; then
systemctl reload "$service_name" || true
fi
# Report recovery to monitoring system
if [ -x "/usr/bin/curl" ]; then
curl -s -XPOST "http://monitoring.internal:9090/v1/events" -d "{
\"service\": \"$service_name\",
\"degradation_level\": \"NORMAL\",
\"recovery\": true
}" || true
fi
fi
fi
# Return status based on degradation level
case "$degradation_level" in
"NORMAL")
return 0
;;
"DEGRADED")
return 10
;;
"SEVERE")
return 20
;;
esac
}
# Example usage
graceful_degradation "payment-service"
This function implements the SRE principle:
“The user experience is almost always improved by a quick failure rather than a slow response.” — Google SRE Book
By detecting which dependencies are down and adjusting the service behavior accordingly, this script enables graceful degradation rather than complete failure. According to principles in “Shell Scripting” guides, this kind of defensive programming is critical for systems that need to maintain availability even when components fail.
Uncover more details in Bulletproof Bash Scripts: Mastering Error Handling for Reliable Automation
Blameless Postmortem Automation with Bash
Google’s SRE practice emphasizes learning from failures through blameless postmortems. As the SRE book notes:
“A blamelessly written postmortem assumes that everyone involved in an incident had good intentions.” — Google SRE Book
Here’s a Bash function that helps automate the collection of data for postmortems:
#!/bin/bash
# Incident Data Collection for Blameless Postmortems
# Based on Google SRE principles for learning from failures
function collect_postmortem_data {
local incident_id="$1"
local service="$2"
local start_time="$3" # Format: YYYY-MM-DD HH:MM:SS
local end_time="${4:-$(date '+%Y-%m-%d %H:%M:%S')}"
local output_dir="${5:-/var/log/incidents/$incident_id}"
echo "Collecting data for incident #$incident_id ($service) from $start_time to $end_time"
# Create output directory
mkdir -p "$output_dir"
# Convert times to Unix timestamps
local start_timestamp
start_timestamp=$(date -d "$start_time" +%s)
local end_timestamp
end_timestamp=$(date -d "$end_time" +%s)
# Record basic incident information
cat > "$output_dir/incident_info.txt" << EOF
Incident ID: $incident_id
Service: $service
Start Time: $start_time
End Time: $end_time
Duration: $(( (end_timestamp - start_timestamp) / 60 )) minutes
Data Collection Time: $(date '+%Y-%m-%d %H:%M:%S')
EOF
# Collect system information
echo "Collecting system information..."
{
echo "=== System Information ==="
echo "Hostname: $(hostname)"
echo "Kernel Version: $(uname -r)"
echo "OS Info: $(cat /etc/os-release | grep PRETTY_NAME | cut -d= -f2)"
echo ""
echo "=== Service Status ==="
systemctl status "$service" 2>/dev/null || echo "Service status not available"
echo ""
echo "=== CPU Info ==="
lscpu
echo ""
echo "=== Memory Information ==="
free -h
echo ""
echo "=== Disk Usage ==="
df -h
echo ""
} > "$output_dir/system_info.txt"
# Collect service logs
echo "Collecting service logs..."
{
echo "=== Service Logs ($start_time to $end_time) ==="
journalctl -u "$service" --since "$start_time" --until "$end_time" 2>/dev/null ||
cat "/var/log/$service/"*".log" 2>/dev/null |
grep -A1000000 -B1000000 "$(date -d "$start_time" '+%b %d %H:%M:%S')" |
grep -A1000000 -B1000000 "$(date -d "$end_time" '+%b %d %H:%M:%S')" ||
echo "No logs found for the service during the incident timeframe"
} > "$output_dir/service_logs.txt"
# Collect metrics data
echo "Collecting metrics data..."
if [ -d "/var/lib/metrics/$service" ]; then
cp -r "/var/lib/metrics/$service" "$output_dir/metrics"
# Generate metric summaries for the incident period
for metric_file in "$output_dir/metrics"/*.metrics; do
if [ -f "$metric_file" ]; then
metric_name=$(basename "$metric_file" .metrics)
echo "Summarizing $metric_name data..."
{
echo "=== $metric_name Summary ==="
grep -v '^#' "$metric_file" |
awk -v start="$start_timestamp" -v end="$end_timestamp" '$1 >= start && $1 <= end' |
awk '{sum+=$3; count++} END {if(count>0) print "Average: " sum/count; print "Count: " count}'
# Extract min/max values
echo "Min: $(grep -v '^#' "$metric_file" |
awk -v start="$start_timestamp" -v end="$end_timestamp" '$1 >= start && $1 <= end' |
sort -k3n | head -1 | awk '{print $3}')"
echo "Max: $(grep -v '^#' "$metric_file" |
awk -v start="$start_timestamp" -v end="$end_timestamp" '$1 >= start && $1 <= end' |
sort -k3n | tail -1 | awk '{print $3}')"
} > "$output_dir/metrics_summary_$metric_name.txt"
fi
done
else
echo "No metrics data found for $service" > "$output_dir/metrics_summary.txt"
fi
# Collect deployment information
echo "Collecting deployment information..."
{
echo "=== Recent Deployments ==="
if [ -f "/var/log/deployments/$service.log" ]; then
grep -B 10 -A 2 "$(date -d "$start_time" '+%Y-%m-%d')" "/var/log/deployments/$service.log" ||
echo "No recent deployments found in logs"
else
echo "No deployment logs found"
fi
# Check git history if applicable
if [ -d "/opt/$service/.git" ]; then
echo ""
echo "=== Recent Code Changes ==="
cd "/opt/$service" || echo "Could not change to directory"
git log --since="$(date -d "$start_time -7 days" '+%Y-%m-%d')" --until="$(date -d "$end_time" '+%Y-%m-%d')" --pretty=format:"%h - %an, %ar : %s"
fi
} > "$output_dir/deployment_info.txt"
# Collect configuration
echo "Collecting configuration files..."
{
echo "=== Configuration Files ==="
if [ -d "/etc/$service" ]; then
echo "Configuration directory contains:"
find "/etc/$service" -type f -name "*.conf" -o -name "*.yaml" -o -name "*.json" -o -name "*.ini" | sort
echo ""
for conf_file in $(find "/etc/$service" -type f -name "*.conf" -o -name "*.yaml" -o -name "*.json" -o -name "*.ini" | sort); do
echo "=== Contents of $(basename "$conf_file") ==="
# Redact potential secrets
cat "$conf_file" | sed -E 's/(password|secret|key|token|credential)([":= ]+)([^",:= ]+)/\1\2[REDACTED]/gi'
echo ""
done
else
echo "No configuration directory found at /etc/$service"
fi
} > "$output_dir/configuration.txt"
# Collect network information
echo "Collecting network information..."
{
echo "=== Network Information ==="
echo "Open ports:"
netstat -tulpn | grep -i "$service" || echo "No ports found for service"
echo ""
echo "Network connections:"
netstat -tan | grep -v LISTEN | head -20
echo ""
echo "DNS resolution:"
host "$(hostname)" || echo "Hostname resolution failed"
} > "$output_dir/network_info.txt"
# Create a timeline of events
echo "Creating incident timeline..."
{
echo "=== Incident Timeline ==="
echo "Reconstructed from logs and metrics. Times are approximate."
echo ""
# Extract significant events from logs
grep -i "error\|warn\|fail\|exception\|timeout" "$output_dir/service_logs.txt" |
grep -v "DEBUG" |
head -50 |
while read -r line; do
# Extract timestamp and message
timestamp=$(echo "$line" | grep -oE '^[A-Z][a-z]{2} [0-9]{1,2} [0-9]{2}:[0-9]{2}:[0-9]{2}' ||
echo "$line" | grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
if [ -n "$timestamp" ]; then
message=$(echo "$line" | sed "s/$timestamp//")
echo "[$timestamp] $message"
fi
done | sort
echo ""
echo "=== Metric Anomalies ==="
# Look for metric spikes or drops
for metric_file in "$output_dir/metrics"/*.metrics; do
if [ -f "$metric_file" ]; then
metric_name=$(basename "$metric_file" .metrics)
# Calculate average and standard deviation
avg=$(grep -v '^#' "$metric_file" |
awk -v start="$((start_timestamp-3600))" -v end="$((start_timestamp-60))" '$1 >= start && $1 <= end' |
awk '{sum+=$3; count++} END {if(count>0) print sum/count; else print 0}')
# Find values that deviate significantly during incident
grep -v '^#' "$metric_file" |
awk -v start="$start_timestamp" -v end="$end_timestamp" -v avg="$avg" '$1 >= start && $1 <= end && ($3 > avg*2 || $3 < avg*0.5)' |
while read -r ts value rest; do
event_time=$(date -d "@$ts" '+%Y-%m-%d %H:%M:%S')
echo "[$event_time] Anomaly in $metric_name: value=$value (baseline≈$avg)"
done
fi
done | sort | head -20
} > "$output_dir/timeline.txt"
# Create a postmortem template
echo "Creating postmortem template..."
cat > "$output_dir/postmortem_template.md" << EOF
# Incident Postmortem: #$incident_id - $service
## Summary
*Brief summary of the incident. What happened? What was the impact?*
## Timeline
*Key events during the incident. Use the timeline.txt file as a reference.*
## Root Cause
*What was the underlying cause of the incident?*
## Impact
*What was the impact to users, systems, and the business?*
## Detection
*How was the incident detected? Was monitoring sufficient?*
## Resolution
*How was the incident resolved?*
## What Went Well
*What aspects of the incident response worked effectively?*
## What Went Wrong
*What could have been improved in our response?*
## Action Items
*Specific, actionable items to prevent recurrence or improve response*
1. [ ] *Action item 1*
2. [ ] *Action item 2*
3. [ ] *Action item 3*
## Lessons Learned
*What broader lessons can we take from this incident?*
EOF
echo "Postmortem data collection complete. All data saved to $output_dir"
echo "To start the postmortem process, edit the template at $output_dir/postmortem_template.md"
return 0
}
# Example usage
collect_postmortem_data "INC-2023-42" "payment-api" "2023-08-25 14:30:00" "2023-08-25 15:45:00"
This function implements the SRE principle:
“Postmortems should be written for all major incidents. A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.” — Google SRE Book
By automating the collection of incident data, this script makes it easier to focus on learning from failures rather than blaming individuals. As Brian Ward emphasizes in “How Linux Works”, understanding system behavior during failures is critical for building more reliable systems.
Journey deeper into this topic with Advanced Bash Scripting Techniques for Automation: A Comprehensive Guide
SRE-Aligned Capacity Planning with Bash
Google’s SRE practice includes principles for capacity planning and load testing. As the SRE book notes:
“Efficient capacity planning is a key component in maintaining system reliability while controlling costs.” — Google SRE Book
Here’s a Bash function that implements capacity forecasting:
#!/bin/bash
# Capacity Planning Helper
# Based on Google SRE principles for forecasting resource needs
function forecast_capacity {
local service="$1"
local metrics_dir="${2:-/var/lib/metrics/$service}"
local forecast_days="${3:-30}"
local output_file="${4:-/var/lib/capacity/$service/forecast.json}"
echo "Generating capacity forecast for $service for the next $forecast_days days"
# Ensure output directory exists
mkdir -p "$(dirname "$output_file")"
# Check that we have enough historical data
if [ ! -d "$metrics_dir" ]; then
echo "ERROR: Metrics directory not found: $metrics_dir"
return 1
fi
# Check if we have required metrics files
required_metrics=("cpu_usage.metrics" "memory_usage.metrics" "disk_usage.metrics" "requests.metrics")
for metric in "${required_metrics[@]}"; do
if [ ! -f "$metrics_dir/$metric" ]; then
echo "WARNING: Missing required metric file: $metrics_dir/$metric"
fi
done
echo "Analyzing historical usage patterns..."
# Calculate growth rates and forecast future usage
declare -A growth_rates
declare -A current_usage
declare -A max_usage
declare -A forecasted_usage
# Process each metric file
for metric_file in "$metrics_dir"/*.metrics; do
if [ -f "$metric_file" ]; then
metric_name=$(basename "$metric_file" .metrics)
echo "Processing $metric_name metrics..."
# Get current usage (average of last day)
current=$(grep -v '^#' "$metric_file" |
tail -n 288 | # Assuming 5-minute intervals = 288 datapoints per day
awk '{sum+=$3; count++} END {print sum/count}')
# Get historical max
max=$(grep -v '^#' "$metric_file" |
awk '{print $3}' |
sort -rn |
head -1)
# Calculate growth rate (using last 30 days)
# Get average from 30 days ago
past=$(grep -v '^#' "$metric_file" |
head -n 288 |
awk '{sum+=$3; count++} END {print sum/count}')
# Avoid division by zero
if (( $(echo "$past > 0" | bc -l) )); then
# Calculate daily growth rate
growth=$(echo "scale=6; ($current / $past) ^ (1/30) - 1" | bc -l)
else
growth="0"
fi
# Store values
current_usage["$metric_name"]=$(printf "%.2f" "$current")
max_usage["$metric_name"]=$(printf "%.2f" "$max")
growth_rates["$metric_name"]=$(printf "%.4f" "$growth")
# Calculate forecasted value
forecast=$(echo "scale=2; $current * (1 + $growth) ^ $forecast_days" | bc -l)
forecasted_usage["$metric_name"]=$forecast
fi
done
# Generate recommendations based on forecasts
recommendations=()
# Check CPU forecast
if [ -n "${forecasted_usage[cpu_usage]}" ]; then
current_cpu=${current_usage[cpu_usage]}
forecast_cpu=${forecasted_usage[cpu_usage]}
if (( $(echo "$forecast_cpu > 85" | bc -l) )); then
recommendations+=("CPU capacity will reach ${forecast_cpu}% in $forecast_days days, which exceeds the 85% threshold. Consider scaling up CPU resources.")
elif (( $(echo "$forecast_cpu > 70" | bc -l) )); then
recommendations+=("CPU capacity will reach ${forecast_cpu}% in $forecast_days days. Monitor closely and prepare for potential scaling.")
fi
fi
# Check Memory forecast
if [ -n "${forecasted_usage[memory_usage]}" ]; then
current_mem=${current_usage[memory_usage]}
forecast_mem=${forecasted_usage[memory_usage]}
if (( $(echo "$forecast_mem > 90" | bc -l) )); then
recommendations+=("Memory usage will reach ${forecast_mem}% in $forecast_days days, which exceeds the 90% threshold. Increase memory capacity.")
elif (( $(echo "$forecast_mem > 75" | bc -l) )); then
recommendations+=("Memory usage will reach ${forecast_mem}% in $forecast_days days. Plan for memory expansion soon.")
fi
fi
# Check Disk forecast
if [ -n "${forecasted_usage[disk_usage]}" ]; then
current_disk=${current_usage[disk_usage]}
forecast_disk=${forecasted_usage[disk_usage]}
if (( $(echo "$forecast_disk > 85" | bc -l) )); then
recommendations+=("Disk usage will reach ${forecast_disk}% in $forecast_days days, which exceeds the 85% threshold. Increase storage capacity.")
elif (( $(echo "$forecast_disk > 70" | bc -l) )); then
recommendations+=("Disk usage will reach ${forecast_disk}% in $forecast_days days. Plan for storage expansion.")
fi
fi
# Check Request forecast
if [ -n "${forecasted_usage[requests]}" ]; then
current_req=${current_usage[requests]}
forecast_req=${forecasted_usage[requests]}
growth_pct=$(echo "scale=2; 100 * ${growth_rates[requests]}" | bc -l)
requests_per_day=$(grep -v '^#' "$metrics_dir/requests.metrics" |
tail -n 288 |
awk '{sum+=$3} END {print sum}')
forecasted_daily_requests=$(echo "scale=0; $requests_per_day * (1 + ${growth_rates[requests]}) ^ $forecast_days" | bc -l)
if (( $(echo "$growth_pct > 5" | bc -l) )); then
recommendations+=("Request volume is growing at ${growth_pct}% daily. Forecast of ${forecasted_daily_requests} daily requests in $forecast_days days. Review scaling strategy.")
fi
fi
# Generate the forecast report
echo "Writing capacity forecast to $output_file..."
# Format JSON with recommendations as array
rec_json=""
for i in "${!recommendations[@]}"; do
if [ $i -gt 0 ]; then
rec_json+=", "
fi
rec_json+="\"${recommendations[$i]}\""
done
# Create JSON output
cat > "$output_file" << EOF
{
"service": "$service",
"forecast_date": "$(date '+%Y-%m-%d')",
"forecast_period_days": $forecast_days,
"current_usage": {
$(for metric in "${!current_usage[@]}"; do echo " \"$metric\": ${current_usage[$metric]},"; done | sed '$s/,$//')
},
"growth_rates_daily": {
$(for metric in "${!growth_rates[@]}"; do echo " \"$metric\": ${growth_rates[$metric]},"; done | sed '$s/,$//')
},
"forecasted_usage": {
$(for metric in "${!forecasted_usage[@]}"; do echo " \"$metric\": ${forecasted_usage[$metric]},"; done | sed '$s/,$//')
},
"recommendations": [
$rec_json
]
}
EOF
# Generate a human-readable summary
echo "Capacity Forecast Summary for $service:"
echo "======================================"
echo "Forecast period: $forecast_days days"
echo ""
echo "Current Usage:"
for metric in "${!current_usage[@]}"; do
echo " $metric: ${current_usage[$metric]}"
done
echo ""
echo "Daily Growth Rates:"
for metric in "${!growth_rates[@]}"; do
growth_pct=$(echo "scale=4; 100 * ${growth_rates[$metric]}" | bc -l)
echo " $metric: ${growth_pct}% per day"
done
echo ""
echo "Forecasted Usage (in $forecast_days days):"
for metric in "${!forecasted_usage[@]}"; do
echo " $metric: ${forecasted_usage[$metric]}"
done
echo ""
echo "Recommendations:"
for rec in "${recommendations[@]}"; do
echo " * $rec"
done
return 0
}
# Example usage
forecast_capacity "web-service" "/var/lib/metrics/web-service" 60
This function implements the capacity planning principles from the SRE book:
“Capacity planning should take future usage into account, not just current usage.” — Google SRE Book
By analyzing growth trends and making forecasts, this script helps you anticipate capacity needs before they become critical. Robert Love’s “Linux System Programming” emphasizes that proactive capacity management is essential for maintaining system reliability under increasing load.
Enrich your learning with Bash Scripting Meets AWS: A DevOps Love Story
Conclusion: Embracing the SRE Mindset with Bash
Throughout this article, we’ve explored how to implement Google’s SRE principles using Bash scripts. From monitoring the four golden signals to implementing error budgets, these practices can transform your operations from reactive firefighting to proactive engineering.
As the Google SRE book concludes:
“The SRE approach is about applying engineering to operations, creating scalable and reliable software systems.” — Google SRE Book
The functions we’ve explored demonstrate that Bash scripts, when written with SRE principles in mind, can be powerful tools for implementing reliable automation. By combining the insights from “Linux System Programming” by Robert Love, “Unix Power Tools”, “How Linux Works” by Brian Ward, and Google’s SRE books, we can create scripts that don’t just automate tasks but build more reliable systems.
I encourage you to integrate these patterns into your own Bash scripts. Start small—perhaps with implementing the four golden signals monitoring—and gradually incorporate more SRE principles as you become comfortable with them. Remember that as “Shell Scripting” guides emphasize, the goal isn’t perfection on day one but continuous improvement toward more reliable systems.
What SRE principles have you implemented in your Bash scripts? I’d love to hear your experiences and additional techniques in the comments below!
Similar Articles
Related Content
More from devops
Explore how OpenAI transforms development workflows, empowering developers and DevOps teams with …
Discover the best Linux automation tools like Ansible, Terraform, and Docker. Learn how to automate …
You Might Also Like
Master text processing with this comprehensive sed command cheat sheet. From basic substitutions to …
Master the art of replacing text across multiple files with sed. This step-by-step guide covers …
Discover how to extract filenames from paths in Bash using commands like basename, dirname, and …
Knowledge Quiz
Test your general knowledge with this quick quiz!
The quiz consists of 5 multiple-choice questions.
Take as much time as you need.
Your score will be shown at the end.
Question 1 of 5
Quiz Complete!
Your score: 0 out of 5
Loading next question...
Contents
- Understanding SRE Principles for Bash Automation
- Implementing SLOs and Error Budgets in Bash
- Eliminating Toil with Self-Healing Bash Functions
- Building SRE-grade Monitoring with Bash
- Implementing Graceful Degradation in Bash Scripts
- Blameless Postmortem Automation with Bash
- SRE-Aligned Capacity Planning with Bash
- Conclusion: Embracing the SRE Mindset with Bash