Bash for SRE: Implementing Google's Reliability Engineering Principles in Shell Scripts

Karandeep Singh
26 minutes read


Learn to integrate Google’s SRE principles into your Bash scripts with practical functions for implementing SLOs, reducing toil, and building more reliable systems through automation.

After our third 3 AM alert in a week, I knew something had to change. Our monitoring scripts were generating false positives, our remediation was manual and error-prone, and our on-call engineers were burning out. That’s when I turned to Google’s Site Reliability Engineering (SRE) books and discovered a revolutionary approach to operations that transformed how we used Bash scripts to maintain our systems.

As Google’s SRE book states, “Hope is not a strategy.” Real reliability comes from systematic approaches to automation, monitoring, and incident response. In this article, I’ll show you how to implement Google’s battle-tested SRE principles using Bash — creating scripts that not only detect issues but implement the same reliability practices that keep Google’s massive infrastructure running.

Understanding SRE Principles for Bash Automation

Bash scripts are often the first line of defense in system reliability, but as Robert Love points out in “Linux System Programming: Talking Directly to the Kernel and C Library”, they’re frequently written as afterthoughts rather than engineered solutions. Google’s SRE approach changes that perspective entirely.

“SRE is what happens when you ask a software engineer to design an operations team.” — Google’s Site Reliability Engineering Book

This philosophy transforms how we should approach Bash scripting. Instead of writing scripts to solve immediate problems, we should be engineering reliable systems through automation. Brian Ward’s “How Linux Works” provides the foundation for understanding how Linux systems operate, but Google’s SRE books teach us how to make them reliable at scale.

The core SRE principles that should guide your Bash scripting include:

  1. Eliminating toil through automation
  2. Implementing SLOs (Service Level Objectives) and error budgets
  3. Embracing risk with calculated trade-offs
  4. Monitoring the right signals
  5. Automating away repetitive work
  6. Implementing graceful degradation
  7. Designing for failure

“Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” — Google SRE Book

Let’s explore how to implement these principles in Bash scripts that transform operations from reactive firefighting to proactive engineering.

Implementing SLOs and Error Budgets in Bash

One of the most powerful concepts from Google’s SRE practice is the implementation of Service Level Objectives (SLOs) and error budgets. As the SRE book explains:

“An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.” — Google SRE Book

Here’s how we can implement SLO tracking and error budget calculation in Bash:


# SLO Monitoring and Error Budget Calculation
# Based on Google SRE principles for reliability measurement

# Configuration
SLO_AVAILABILITY_TARGET=0.995  # 99.5% availability target

function calculate_error_budget {
    local service="$1"
    local slo_target="$2"
    local lookback_days="$3"
    local error_log="$4"
    local request_log="$5"
    echo "Calculating error budget for $service (SLO: ${slo_target}% over $lookback_days days)"
    # Get the start time for our lookback period
    local start_time
    start_time=$(date -d "$lookback_days days ago" "+%Y-%m-%d %H:%M:%S")
    # Count total requests in the lookback period
    local total_requests
    total_requests=$(grep -c "request_id" "$request_log")
    # Count error requests in the lookback period
    local error_requests
    error_requests=$(grep -c "ERROR" "$error_log")
    # Calculate current availability
    local current_availability
    current_availability=$(echo "scale=6; 1 - $error_requests / $total_requests" | bc)
    # Calculate error budget
    local error_budget_total
    error_budget_total=$(echo "scale=6; (1 - $slo_target) * $total_requests" | bc)
    local error_budget_used
    local error_budget_remaining
    error_budget_remaining=$(echo "scale=6; $error_budget_total - $error_budget_used" | bc)
    local error_budget_percent
    error_budget_percent=$(echo "scale=2; 100 * $error_budget_remaining / $error_budget_total" | bc)
    # Prepare output directory
    mkdir -p "$OUTPUT_DIR"
    # Write results to file
    cat > "$OUTPUT_DIR/${service}_slo_report.txt" << EOF
SLO Report for $service
Generated: $(date "+%Y-%m-%d %H:%M:%S")
Lookback Period: $lookback_days days (since $start_time)

Total Requests: $total_requests
Error Requests: $error_requests
Current Availability: $(echo "scale=4; 100 * $current_availability" | bc)%
SLO Target: $(echo "scale=4; 100 * $slo_target" | bc)%

Error Budget Total: $error_budget_total
Error Budget Used: $error_budget_used
Error Budget Remaining: $error_budget_remaining
Error Budget Remaining: $error_budget_percent%

SLO Status: $([ "$(echo "$current_availability >= $slo_target" | bc)" -eq 1 ] && echo "MEETING SLO ✓" || echo "VIOLATING SLO ✗")

    # Output summary to console
    echo "SLO Status for $service: $([ "$(echo "$current_availability >= $slo_target" | bc)" -eq 1 ] && echo "MEETING SLO ✓" || echo "VIOLATING SLO ✗")"
    echo "Current Availability: $(echo "scale=4; 100 * $current_availability" | bc)%"
    echo "Error Budget Remaining: $error_budget_percent%"
    # Return status (0 if meeting SLO, 1 if violating)
    [ "$(echo "$current_availability >= $slo_target" | bc)" -eq 1 ]

# Execute the function with configured parameters

# Implement release control based on error budget
if [ $? -eq 0 ] && [ "$(echo "scale=2; 100 * $error_budget_percent > 50" | bc)" -eq 1 ]; then
    echo "Sufficient error budget remains for deploying new features."
    # Here you could automatically approve deployments in your CD system
    # Example: curl -X POST
    echo "Error budget too low for feature deployments. Focus on reliability improvements."
    # Here you could pause feature deployments and prioritize reliability work
    # Example: curl -X POST -d '{"type":"reliability","priority":"high"}'

As the “Shell Scripting” guides recommend, this function encapsulates the complex logic of calculating error budgets while providing clear outputs. The SRE book emphasizes that:

“Error budgets are the primary construct we use to balance reliability with the pace of innovation.” — Google SRE Book

This Bash function implements that principle by tracking your service’s reliability and automatically adjusting the pace of change based on the remaining error budget.

Eliminating Toil with Self-Healing Bash Functions

A core tenet of Google’s SRE practice is eliminating toil—manual, repetitive work that scales linearly with service growth. The SRE book defines it clearly:

“If a human operator needs to touch your system during normal operations, you have a bug.” — Google SRE Book

Using principles from “How Linux Works” by Brian Ward and Google’s SRE books, here’s a self-healing Bash function that eliminates common sources of toil:


# Self-Healing System Verifier
# Implements Google SRE principles for toil reduction

function verify_and_heal_system {
    local service="$1"
    local config_path="${2:-/etc/$service}"
    local log_path="${3:-/var/log/$service}"
    local executable="${4:-/usr/bin/$service}"
    echo "Performing self-healing verification for $service"
    # Track actions taken
    local healing_actions=()
    # 1. Verify configuration files
    if [ ! -d "$config_path" ]; then
        echo "Configuration directory missing, creating: $config_path"
        mkdir -p "$config_path"
        healing_actions+=("Created missing config directory: $config_path")
    # 2. Check for config file corruption
    if [ -f "$config_path/config.yaml" ]; then
        if ! yaml-validator "$config_path/config.yaml" &>/dev/null; then
            echo "Corrupted configuration detected, restoring from backup"
            if [ -f "$config_path/config.yaml.bak" ]; then
                cp "$config_path/config.yaml.bak" "$config_path/config.yaml"
                healing_actions+=("Restored corrupted config from backup")
                cp "/etc/defaults/$service/config.yaml" "$config_path/config.yaml"
                healing_actions+=("Restored corrupted config from defaults")
    # 3. Ensure log directory exists with correct permissions
    if [ ! -d "$log_path" ]; then
        echo "Log directory missing, creating: $log_path"
        mkdir -p "$log_path"
        chown "$service:$service" "$log_path"
        chmod 755 "$log_path"
        healing_actions+=("Created missing log directory: $log_path")
    # 4. Check for disk space issues
    root_usage=$(df -h / | awk 'NR==2 {print $5}' | tr -d '%')
    if [ "$root_usage" -gt 85 ]; then
        echo "Disk space critically low (${root_usage}%), cleaning up old logs"
        find "$log_path" -name "*.log" -mtime +7 -exec gzip {} \;
        find "$log_path" -name "*.gz" -mtime +30 -delete
        healing_actions+=("Compressed log files older than 7 days")
        healing_actions+=("Removed compressed logs older than 30 days")
    # 5. Verify service is running correctly
    if ! systemctl is-active "$service" &>/dev/null; then
        echo "Service $service is not running, attempting to start"
        systemctl start "$service"
        sleep 5
        if systemctl is-active "$service" &>/dev/null; then
            healing_actions+=("Started stopped service: $service")
            echo "Failed to start service, checking for common issues"
            # 5a. Check for port conflicts
            port=$(grep -oP 'port:\s*\K\d+' "$config_path/config.yaml" || echo "8080")
            if netstat -tuln | grep ":$port " &>/dev/null; then
                echo "Port conflict detected on $port"
                # Try to find an available port
                for new_port in $(seq $((port+1)) $((port+20))); do
                    if ! netstat -tuln | grep ":$new_port " &>/dev/null; then
                        echo "Updating configuration to use port $new_port"
                        sed -i "s/port: $port/port: $new_port/" "$config_path/config.yaml"
                        systemctl start "$service"
                        healing_actions+=("Resolved port conflict: changed port from $port to $new_port")
            # 5b. Check for corrupted service file
            if [ ! -f "/etc/systemd/system/$service.service" ] || ! systemd-analyze verify "$service" &>/dev/null; then
                echo "Service unit file is missing or corrupted, reinstalling"
                systemctl daemon-reload
                if [ -f "/usr/lib/systemd/system/$service.service.bak" ]; then
                    cp "/usr/lib/systemd/system/$service.service.bak" "/etc/systemd/system/$service.service"
                    systemctl daemon-reload
                    systemctl start "$service"
                    healing_actions+=("Restored corrupted service unit file")
    # 6. Verify dependencies
    for dep in $(grep -oP 'depends-on:\s*\K.*' "$config_path/config.yaml" | tr ',' ' ' || echo ""); do
        if ! systemctl is-active "$dep" &>/dev/null; then
            echo "Dependency $dep is not running, starting it"
            systemctl start "$dep"
            healing_actions+=("Started required dependency: $dep")
    # 7. Report healing actions
    if [ ${#healing_actions[@]} -eq 0 ]; then
        echo "No healing actions needed, system is healthy"
        return 0
        echo "Self-healing completed with ${#healing_actions[@]} actions taken:"
        for action in "${healing_actions[@]}"; do
            echo " - $action"
        # Log healing actions for postmortem analysis
        echo "[$(date "+%Y-%m-%d %H:%M:%S")] Self-healing actions for $service:" >> "/var/log/self-healing.log"
        for action in "${healing_actions[@]}"; do
            echo " - $action" >> "/var/log/self-healing.log"
        return 1

# Example usage
verify_and_heal_system "nginx" "/etc/nginx" "/var/log/nginx" "/usr/sbin/nginx"

This function embodies the SRE philosophy:

“Automation is a force multiplier, not a panacea.” — Google SRE Book

By focusing on common failure modes and implementing automated remediation, this script reduces the manual operations work that the SRE book identifies as toil. The function doesn’t just check for problems—it actively fixes them, following the principle from “Unix Power Tools” that good scripts should be autonomous problem solvers.

Building SRE-grade Monitoring with Bash

Google’s SRE practice emphasizes the “four golden signals” of monitoring:

  1. Latency
  2. Traffic
  3. Errors
  4. Saturation

“If you can only measure four metrics of your user-facing system, focus on these four.” — Google SRE Book

Here’s a Bash implementation that collects these critical metrics:


# Four Golden Signals Monitoring
# Based on Google SRE monitoring principles

# Configuration
INTERVAL=60  # seconds

function collect_golden_signals {
    echo "Collecting the four golden signals for $SERVICE_NAME"
    # Ensure metrics directory exists
    mkdir -p "$METRICS_DIR"
    # Get current timestamp
    local timestamp
    timestamp=$(date +%s)
    # 1. Measure Latency
    # Extract response times from logs, calculate 50th, 95th, and 99th percentiles
    if [ -f "$LOG_FILE" ]; then
        echo "Measuring latency from logs..."
        # For this example, assuming log format includes response_time=XXXms
        local all_latencies
        all_latencies=$(grep -oP 'response_time=\K\d+(?=ms)' "$LOG_FILE" | sort -n)
        local count=0
        local total=0
        local latency_50p=0
        local latency_95p=0
        local latency_99p=0
        if [ -n "$all_latencies" ]; then
            count=$(echo "$all_latencies" | wc -l)
            if [ "$count" -gt 0 ]; then
                total=$(echo "$all_latencies" | paste -sd+ | bc)
                latency_avg=$(echo "scale=2; $total / $count" | bc)
                # Calculate percentiles
                local line_50p=$((count * 50 / 100))
                local line_95p=$((count * 95 / 100))
                local line_99p=$((count * 99 / 100))
                latency_50p=$(echo "$all_latencies" | sed -n "${line_50p}p")
                latency_95p=$(echo "$all_latencies" | sed -n "${line_95p}p")
                latency_99p=$(echo "$all_latencies" | sed -n "${line_99p}p")
        echo "$timestamp latency_avg $latency_avg" >> "$METRICS_DIR/latency.metrics"
        echo "$timestamp latency_50p $latency_50p" >> "$METRICS_DIR/latency.metrics"
        echo "$timestamp latency_95p $latency_95p" >> "$METRICS_DIR/latency.metrics"
        echo "$timestamp latency_99p $latency_99p" >> "$METRICS_DIR/latency.metrics"
        # Check for latency SLO violation
        if [ "$latency_95p" -gt "$ALERT_THRESHOLD_LATENCY" ]; then
            echo "ALERT: Latency (95th percentile) is ${latency_95p}ms, exceeding threshold of ${ALERT_THRESHOLD_LATENCY}ms"
            # Here you would trigger your alerting system
            # send_alert "latency" "$SERVICE_NAME" "$latency_95p" "$ALERT_THRESHOLD_LATENCY"
    # 2. Measure Traffic
    if [ -f "$LOG_FILE" ]; then
        echo "Measuring traffic from logs..."
        # Count requests in the interval
        local start_time=$((timestamp - INTERVAL))
        local request_count
        # Assuming timestamp is in logs as [YYYY-MM-DD HH:MM:SS]
        local start_time_fmt
        start_time_fmt=$(date -d "@$start_time" "+%Y-%m-%d %H:%M:%S")
        request_count=$(grep -c "\[$start_time_fmt" "$LOG_FILE")
        echo "$timestamp requests_per_minute $request_count" >> "$METRICS_DIR/traffic.metrics"
        # Calculate requests per second
        local rps
        rps=$(echo "scale=2; $request_count / $INTERVAL" | bc)
        echo "$timestamp requests_per_second $rps" >> "$METRICS_DIR/traffic.metrics"
    # 3. Measure Errors
    if [ -f "$LOG_FILE" ] && [ -f "$ERROR_LOG" ]; then
        echo "Measuring error rate..."
        local start_time=$((timestamp - INTERVAL))
        local start_time_fmt
        start_time_fmt=$(date -d "@$start_time" "+%Y-%m-%d %H:%M:%S")
        local request_count
        request_count=$(grep -c "\[$start_time_fmt" "$LOG_FILE")
        local error_count
        error_count=$(grep -c "\[$start_time_fmt" "$ERROR_LOG")
        local error_rate=0
        if [ "$request_count" -gt 0 ]; then
            error_rate=$(echo "scale=4; $error_count / $request_count" | bc)
        echo "$timestamp error_count $error_count" >> "$METRICS_DIR/errors.metrics"
        echo "$timestamp error_rate $error_rate" >> "$METRICS_DIR/errors.metrics"
        # Check for error rate SLO violation
        if [ "$(echo "$error_rate > $ALERT_THRESHOLD_ERROR_RATE" | bc)" -eq 1 ]; then
            error_percent=$(echo "scale=2; 100 * $error_rate" | bc)
            echo "ALERT: Error rate is ${error_percent}%, exceeding threshold of $((ALERT_THRESHOLD_ERROR_RATE * 100))%"
            # Here you would trigger your alerting system
            # send_alert "error_rate" "$SERVICE_NAME" "$error_rate" "$ALERT_THRESHOLD_ERROR_RATE"
    # 4. Measure Saturation
    echo "Measuring system saturation..."
    # CPU saturation
    local cpu_usage
    cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4}')
    cpu_saturation=$(echo "scale=4; $cpu_usage / 100" | bc)
    # Memory saturation
    local mem_usage
    mem_usage=$(free | grep Mem | awk '{print $3/$2 * 100}')
    mem_saturation=$(echo "scale=4; $mem_usage / 100" | bc)
    # Disk I/O saturation (using iostat if available)
    local disk_saturation=0
    if command -v iostat &>/dev/null; then
        disk_saturation=$(iostat -x | grep -A 1 avg-cpu | tail -1 | awk '{print $14}')
        disk_saturation=$(echo "scale=4; $disk_saturation / 100" | bc)
    echo "$timestamp cpu_saturation $cpu_saturation" >> "$METRICS_DIR/saturation.metrics"
    echo "$timestamp mem_saturation $mem_saturation" >> "$METRICS_DIR/saturation.metrics"
    echo "$timestamp disk_saturation $disk_saturation" >> "$METRICS_DIR/saturation.metrics"
    # Check for saturation SLO violations
    if [ "$(echo "$cpu_saturation > $ALERT_THRESHOLD_SATURATION" | bc)" -eq 1 ]; then
        cpu_percent=$(echo "scale=0; 100 * $cpu_saturation" | bc)
        echo "ALERT: CPU saturation is at ${cpu_percent}%, exceeding threshold of $((ALERT_THRESHOLD_SATURATION * 100))%"
        # Here you would trigger your alerting system
        # send_alert "cpu_saturation" "$SERVICE_NAME" "$cpu_saturation" "$ALERT_THRESHOLD_SATURATION"
    if [ "$(echo "$mem_saturation > $ALERT_THRESHOLD_SATURATION" | bc)" -eq 1 ]; then
        mem_percent=$(echo "scale=0; 100 * $mem_saturation" | bc)
        echo "ALERT: Memory saturation is at ${mem_percent}%, exceeding threshold of $((ALERT_THRESHOLD_SATURATION * 100))%"
        # Here you would trigger your alerting system
        # send_alert "mem_saturation" "$SERVICE_NAME" "$mem_saturation" "$ALERT_THRESHOLD_SATURATION"
    # Clean up old metrics files
    find "$METRICS_DIR" -name "*.metrics" -mtime +"$RETENTION_DAYS" -delete
    echo "Completed collection of golden signals for $SERVICE_NAME"

# Example usage

This script embodies the monitoring philosophy described in the SRE book:

“Monitoring is one of the primary means by which service owners keep track of a system’s health and availability.” — Google SRE Book

By focusing on the four golden signals, this Bash function helps identify problems that directly impact users, rather than collecting metrics that might be interesting but aren’t actionable. As Robert Love notes in “Linux System Programming”, effective monitoring focuses on symptoms, not causes.

Implementing Graceful Degradation in Bash Scripts

A key SRE principle is designing systems to degrade gracefully under load or when dependencies fail. As the SRE book states:

“The worst case scenario for a system under load is cascading failure.” — Google SRE Book

Here’s a Bash function that implements cascading dependency checks with graceful degradation:


# Graceful Degradation Handler
# Based on Google SRE principles for managing system dependencies

# Configuration

function check_dependency_health {
    local service="$1"
    local endpoint="${2:-http://localhost:8080/health}"
    local timeout="${3:-5}"
    # Try to get health status with timeout
    if curl -s -f -m "$timeout" "$endpoint" &>/dev/null; then
        echo "UP"
        return 0
        echo "DOWN"
        return 1

function graceful_degradation {
    local service_name="$1"
    local config_file="${2:-$CONFIG_FILE}"
    echo "Checking dependencies for $service_name with graceful degradation"
    # Define degradation levels
    local degradation_level="NORMAL"
    local feature_flags=()
    # Load service configuration
    if [ ! -f "$config_file" ]; then
        echo "ERROR: Config file not found: $config_file"
        return 1
    # Extract dependencies from config
    local dependencies
    dependencies=$(jq -r ".services.\"$service_name\".dependencies[]" "$config_file" 2>/dev/null)
    if [ -z "$dependencies" ]; then
        echo "No dependencies found for $service_name, operating normally"
        return 0
    echo "Checking critical dependencies..."
    # Check critical dependencies
    local critical_deps
    critical_deps=$(jq -r ".services.\"$service_name\".dependencies[] | select(.critical == true) | .name" "$config_file")
    local critical_failures=0
    for dep in $critical_deps; do
        local endpoint
        endpoint=$(jq -r ".services.\"$dep\".healthEndpoint" "$config_file")
        echo -n "Checking critical dependency $dep: "
        local status
        status=$(check_dependency_health "$dep" "$endpoint")
        echo "$status"
        if [ "$status" == "DOWN" ]; then
            echo "WARNING: Critical dependency $dep is DOWN" | tee -a "$LOG_FILE"
    # Check non-critical dependencies
    echo "Checking non-critical dependencies..."
    local noncritical_deps
    noncritical_deps=$(jq -r ".services.\"$service_name\".dependencies[] | select(.critical == false) | .name" "$config_file")
    local noncritical_failures=0
    for dep in $noncritical_deps; do
        local endpoint
        endpoint=$(jq -r ".services.\"$dep\".healthEndpoint" "$config_file")
        echo -n "Checking non-critical dependency $dep: "
        local status
        status=$(check_dependency_health "$dep" "$endpoint")
        echo "$status"
        if [ "$status" == "DOWN" ]; then
            # Add appropriate feature flags to disable
            local flags
            flags=$(jq -r ".services.\"$service_name\".dependencies[] | select(.name == \"$dep\") | .features[]" "$config_file")
            for flag in $flags; do
                echo "Disabling feature: $flag due to dependency $dep failure" | tee -a "$LOG_FILE"
    # Determine degradation level
    if [ "$critical_failures" -gt 0 ]; then
        echo "ALERT: Service $service_name in SEVERE degradation mode due to critical dependency failures" | tee -a "$LOG_FILE"
    elif [ "$noncritical_failures" -gt 0 ]; then
        echo "WARNING: Service $service_name in DEGRADED mode, disabling ${#feature_flags[@]} features" | tee -a "$LOG_FILE"
        echo "Service $service_name operating in NORMAL mode, all dependencies healthy" | tee -a "$LOG_FILE"
    # Update service configuration with degradation status
    if [ "$degradation_level" != "NORMAL" ]; then
        echo "Updating service configuration for degraded operation..."
        # Write degradation state file
        cat > "/var/run/$service_name.degraded" << EOF
  "degradationLevel": "$degradation_level",
  "timestamp": "$(date -u +"%Y-%m-%dT%H:%M:%SZ")",
  "disabledFeatures": [$(printf '"%s",' "${feature_flags[@]}" | sed 's/,$//')]

        # Reload service to apply degradation changes
        if [ -x "/usr/bin/systemctl" ]; then
            echo "Reloading service to apply degraded mode..."
            systemctl reload "$service_name" || systemctl restart "$service_name"
        # Report degradation to monitoring system
        echo "Reporting degradation status to monitoring..."
        if [ -x "/usr/bin/curl" ]; then
            curl -s -XPOST "http://monitoring.internal:9090/v1/events" -d "{
              \"service\": \"$service_name\",
              \"degradation_level\": \"$degradation_level\",
              \"disabled_features\": [$(printf '"%s",' "${feature_flags[@]}" | sed 's/,$//')]
            }" || true
        # If previously degraded, recover
        if [ -f "/var/run/$service_name.degraded" ]; then
            echo "Recovering from previous degradation..."
            rm "/var/run/$service_name.degraded"
            # Reload service to apply normal operation
            if [ -x "/usr/bin/systemctl" ]; then
                systemctl reload "$service_name" || true
            # Report recovery to monitoring system
            if [ -x "/usr/bin/curl" ]; then
                curl -s -XPOST "http://monitoring.internal:9090/v1/events" -d "{
                  \"service\": \"$service_name\",
                  \"degradation_level\": \"NORMAL\",
                  \"recovery\": true
                }" || true
    # Return status based on degradation level
    case "$degradation_level" in
            return 0
            return 10
            return 20

# Example usage
graceful_degradation "payment-service"

This function implements the SRE principle:

“The user experience is almost always improved by a quick failure rather than a slow response.” — Google SRE Book

By detecting which dependencies are down and adjusting the service behavior accordingly, this script enables graceful degradation rather than complete failure. According to principles in “Shell Scripting” guides, this kind of defensive programming is critical for systems that need to maintain availability even when components fail.

Blameless Postmortem Automation with Bash

Google’s SRE practice emphasizes learning from failures through blameless postmortems. As the SRE book notes:

“A blamelessly written postmortem assumes that everyone involved in an incident had good intentions.” — Google SRE Book

Here’s a Bash function that helps automate the collection of data for postmortems:


# Incident Data Collection for Blameless Postmortems
# Based on Google SRE principles for learning from failures

function collect_postmortem_data {
    local incident_id="$1"
    local service="$2"
    local start_time="$3"  # Format: YYYY-MM-DD HH:MM:SS
    local end_time="${4:-$(date '+%Y-%m-%d %H:%M:%S')}"
    local output_dir="${5:-/var/log/incidents/$incident_id}"
    echo "Collecting data for incident #$incident_id ($service) from $start_time to $end_time"
    # Create output directory
    mkdir -p "$output_dir"
    # Convert times to Unix timestamps
    local start_timestamp
    start_timestamp=$(date -d "$start_time" +%s)
    local end_timestamp
    end_timestamp=$(date -d "$end_time" +%s)
    # Record basic incident information
    cat > "$output_dir/incident_info.txt" << EOF
Incident ID: $incident_id
Service: $service
Start Time: $start_time
End Time: $end_time
Duration: $(( (end_timestamp - start_timestamp) / 60 )) minutes
Data Collection Time: $(date '+%Y-%m-%d %H:%M:%S')

    # Collect system information
    echo "Collecting system information..."
        echo "=== System Information ==="
        echo "Hostname: $(hostname)"
        echo "Kernel Version: $(uname -r)"
        echo "OS Info: $(cat /etc/os-release | grep PRETTY_NAME | cut -d= -f2)"
        echo ""
        echo "=== Service Status ==="
        systemctl status "$service" 2>/dev/null || echo "Service status not available"
        echo ""
        echo "=== CPU Info ==="
        echo ""
        echo "=== Memory Information ==="
        free -h
        echo ""
        echo "=== Disk Usage ==="
        df -h
        echo ""
    } > "$output_dir/system_info.txt"
    # Collect service logs
    echo "Collecting service logs..."
        echo "=== Service Logs ($start_time to $end_time) ==="
        journalctl -u "$service" --since "$start_time" --until "$end_time" 2>/dev/null || 
            cat "/var/log/$service/"*".log" 2>/dev/null | 
            grep -A1000000 -B1000000 "$(date -d "$start_time" '+%b %d %H:%M:%S')" |
            grep -A1000000 -B1000000 "$(date -d "$end_time" '+%b %d %H:%M:%S')" ||
            echo "No logs found for the service during the incident timeframe"
    } > "$output_dir/service_logs.txt"
    # Collect metrics data
    echo "Collecting metrics data..."
    if [ -d "/var/lib/metrics/$service" ]; then
        cp -r "/var/lib/metrics/$service" "$output_dir/metrics"
        # Generate metric summaries for the incident period
        for metric_file in "$output_dir/metrics"/*.metrics; do
            if [ -f "$metric_file" ]; then
                metric_name=$(basename "$metric_file" .metrics)
                echo "Summarizing $metric_name data..."
                    echo "=== $metric_name Summary ==="
                    grep -v '^#' "$metric_file" | 
                        awk -v start="$start_timestamp" -v end="$end_timestamp" '$1 >= start && $1 <= end' |
                        awk '{sum+=$3; count++} END {if(count>0) print "Average: " sum/count; print "Count: " count}'
                    # Extract min/max values
                    echo "Min: $(grep -v '^#' "$metric_file" | 
                        awk -v start="$start_timestamp" -v end="$end_timestamp" '$1 >= start && $1 <= end' |
                        sort -k3n | head -1 | awk '{print $3}')"
                    echo "Max: $(grep -v '^#' "$metric_file" | 
                        awk -v start="$start_timestamp" -v end="$end_timestamp" '$1 >= start && $1 <= end' |
                        sort -k3n | tail -1 | awk '{print $3}')"
                } > "$output_dir/metrics_summary_$metric_name.txt"
        echo "No metrics data found for $service" > "$output_dir/metrics_summary.txt"
    # Collect deployment information
    echo "Collecting deployment information..."
        echo "=== Recent Deployments ==="
        if [ -f "/var/log/deployments/$service.log" ]; then
            grep -B 10 -A 2 "$(date -d "$start_time" '+%Y-%m-%d')" "/var/log/deployments/$service.log" || 
                echo "No recent deployments found in logs"
            echo "No deployment logs found"
        # Check git history if applicable
        if [ -d "/opt/$service/.git" ]; then
            echo ""
            echo "=== Recent Code Changes ==="
            cd "/opt/$service" || echo "Could not change to directory"
            git log --since="$(date -d "$start_time -7 days" '+%Y-%m-%d')" --until="$(date -d "$end_time" '+%Y-%m-%d')" --pretty=format:"%h - %an, %ar : %s"
    } > "$output_dir/deployment_info.txt"
    # Collect configuration
    echo "Collecting configuration files..."
        echo "=== Configuration Files ==="
        if [ -d "/etc/$service" ]; then
            echo "Configuration directory contains:"
            find "/etc/$service" -type f -name "*.conf" -o -name "*.yaml" -o -name "*.json" -o -name "*.ini" | sort
            echo ""
            for conf_file in $(find "/etc/$service" -type f -name "*.conf" -o -name "*.yaml" -o -name "*.json" -o -name "*.ini" | sort); do
                echo "=== Contents of $(basename "$conf_file") ==="
                # Redact potential secrets
                cat "$conf_file" | sed -E 's/(password|secret|key|token|credential)([":= ]+)([^",:= ]+)/\1\2[REDACTED]/gi'
                echo ""
            echo "No configuration directory found at /etc/$service"
    } > "$output_dir/configuration.txt"
    # Collect network information
    echo "Collecting network information..."
        echo "=== Network Information ==="
        echo "Open ports:"
        netstat -tulpn | grep -i "$service" || echo "No ports found for service"
        echo ""
        echo "Network connections:"
        netstat -tan | grep -v LISTEN | head -20
        echo ""
        echo "DNS resolution:"
        host "$(hostname)" || echo "Hostname resolution failed"
    } > "$output_dir/network_info.txt"
    # Create a timeline of events
    echo "Creating incident timeline..."
        echo "=== Incident Timeline ==="
        echo "Reconstructed from logs and metrics. Times are approximate."
        echo ""
        # Extract significant events from logs
        grep -i "error\|warn\|fail\|exception\|timeout" "$output_dir/service_logs.txt" | 
        grep -v "DEBUG" | 
        head -50 | 
        while read -r line; do
            # Extract timestamp and message
            timestamp=$(echo "$line" | grep -oE '^[A-Z][a-z]{2} [0-9]{1,2} [0-9]{2}:[0-9]{2}:[0-9]{2}' || 
                       echo "$line" | grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
            if [ -n "$timestamp" ]; then
                message=$(echo "$line" | sed "s/$timestamp//")
                echo "[$timestamp] $message"
        done | sort
        echo ""
        echo "=== Metric Anomalies ==="
        # Look for metric spikes or drops
        for metric_file in "$output_dir/metrics"/*.metrics; do
            if [ -f "$metric_file" ]; then
                metric_name=$(basename "$metric_file" .metrics)
                # Calculate average and standard deviation
                avg=$(grep -v '^#' "$metric_file" | 
                     awk -v start="$((start_timestamp-3600))" -v end="$((start_timestamp-60))" '$1 >= start && $1 <= end' |
                     awk '{sum+=$3; count++} END {if(count>0) print sum/count; else print 0}')
                # Find values that deviate significantly during incident
                grep -v '^#' "$metric_file" | 
                awk -v start="$start_timestamp" -v end="$end_timestamp" -v avg="$avg" '$1 >= start && $1 <= end && ($3 > avg*2 || $3 < avg*0.5)' |
                while read -r ts value rest; do
                    event_time=$(date -d "@$ts" '+%Y-%m-%d %H:%M:%S')
                    echo "[$event_time] Anomaly in $metric_name: value=$value (baseline≈$avg)"
        done | sort | head -20
    } > "$output_dir/timeline.txt"
    # Create a postmortem template
    echo "Creating postmortem template..."
    cat > "$output_dir/" << EOF
# Incident Postmortem: #$incident_id - $service

## Summary
*Brief summary of the incident. What happened? What was the impact?*

## Timeline
*Key events during the incident. Use the timeline.txt file as a reference.*

## Root Cause
*What was the underlying cause of the incident?*

## Impact
*What was the impact to users, systems, and the business?*

## Detection
*How was the incident detected? Was monitoring sufficient?*

## Resolution
*How was the incident resolved?*

## What Went Well
*What aspects of the incident response worked effectively?*

## What Went Wrong
*What could have been improved in our response?*

## Action Items
*Specific, actionable items to prevent recurrence or improve response*

1. [ ] *Action item 1*
2. [ ] *Action item 2*
3. [ ] *Action item 3*

## Lessons Learned
*What broader lessons can we take from this incident?*


    echo "Postmortem data collection complete. All data saved to $output_dir"
    echo "To start the postmortem process, edit the template at $output_dir/"
    return 0

# Example usage
collect_postmortem_data "INC-2023-42" "payment-api" "2023-08-25 14:30:00" "2023-08-25 15:45:00"

This function implements the SRE principle:

“Postmortems should be written for all major incidents. A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.” — Google SRE Book

By automating the collection of incident data, this script makes it easier to focus on learning from failures rather than blaming individuals. As Brian Ward emphasizes in “How Linux Works”, understanding system behavior during failures is critical for building more reliable systems.

SRE-Aligned Capacity Planning with Bash

Google’s SRE practice includes principles for capacity planning and load testing. As the SRE book notes:

“Efficient capacity planning is a key component in maintaining system reliability while controlling costs.” — Google SRE Book

Here’s a Bash function that implements capacity forecasting:


# Capacity Planning Helper
# Based on Google SRE principles for forecasting resource needs

function forecast_capacity {
    local service="$1"
    local metrics_dir="${2:-/var/lib/metrics/$service}"
    local forecast_days="${3:-30}"
    local output_file="${4:-/var/lib/capacity/$service/forecast.json}"
    echo "Generating capacity forecast for $service for the next $forecast_days days"
    # Ensure output directory exists
    mkdir -p "$(dirname "$output_file")"
    # Check that we have enough historical data
    if [ ! -d "$metrics_dir" ]; then
        echo "ERROR: Metrics directory not found: $metrics_dir"
        return 1
    # Check if we have required metrics files
    required_metrics=("cpu_usage.metrics" "memory_usage.metrics" "disk_usage.metrics" "requests.metrics")
    for metric in "${required_metrics[@]}"; do
        if [ ! -f "$metrics_dir/$metric" ]; then
            echo "WARNING: Missing required metric file: $metrics_dir/$metric"
    echo "Analyzing historical usage patterns..."
    # Calculate growth rates and forecast future usage
    declare -A growth_rates
    declare -A current_usage
    declare -A max_usage
    declare -A forecasted_usage
    # Process each metric file
    for metric_file in "$metrics_dir"/*.metrics; do
        if [ -f "$metric_file" ]; then
            metric_name=$(basename "$metric_file" .metrics)
            echo "Processing $metric_name metrics..."
            # Get current usage (average of last day)
            current=$(grep -v '^#' "$metric_file" | 
                      tail -n 288 | # Assuming 5-minute intervals = 288 datapoints per day
                      awk '{sum+=$3; count++} END {print sum/count}')
            # Get historical max
            max=$(grep -v '^#' "$metric_file" | 
                  awk '{print $3}' |
                  sort -rn |
                  head -1)
            # Calculate growth rate (using last 30 days)
            # Get average from 30 days ago
            past=$(grep -v '^#' "$metric_file" | 
                   head -n 288 |
                   awk '{sum+=$3; count++} END {print sum/count}')
            # Avoid division by zero
            if (( $(echo "$past > 0" | bc -l) )); then
                # Calculate daily growth rate
                growth=$(echo "scale=6; ($current / $past) ^ (1/30) - 1" | bc -l)
            # Store values
            current_usage["$metric_name"]=$(printf "%.2f" "$current")
            max_usage["$metric_name"]=$(printf "%.2f" "$max")
            growth_rates["$metric_name"]=$(printf "%.4f" "$growth")
            # Calculate forecasted value
            forecast=$(echo "scale=2; $current * (1 + $growth) ^ $forecast_days" | bc -l)
    # Generate recommendations based on forecasts
    # Check CPU forecast
    if [ -n "${forecasted_usage[cpu_usage]}" ]; then
        if (( $(echo "$forecast_cpu > 85" | bc -l) )); then
            recommendations+=("CPU capacity will reach ${forecast_cpu}% in $forecast_days days, which exceeds the 85% threshold. Consider scaling up CPU resources.")
        elif (( $(echo "$forecast_cpu > 70" | bc -l) )); then
            recommendations+=("CPU capacity will reach ${forecast_cpu}% in $forecast_days days. Monitor closely and prepare for potential scaling.")
    # Check Memory forecast
    if [ -n "${forecasted_usage[memory_usage]}" ]; then
        if (( $(echo "$forecast_mem > 90" | bc -l) )); then
            recommendations+=("Memory usage will reach ${forecast_mem}% in $forecast_days days, which exceeds the 90% threshold. Increase memory capacity.")
        elif (( $(echo "$forecast_mem > 75" | bc -l) )); then
            recommendations+=("Memory usage will reach ${forecast_mem}% in $forecast_days days. Plan for memory expansion soon.")
    # Check Disk forecast
    if [ -n "${forecasted_usage[disk_usage]}" ]; then
        if (( $(echo "$forecast_disk > 85" | bc -l) )); then
            recommendations+=("Disk usage will reach ${forecast_disk}% in $forecast_days days, which exceeds the 85% threshold. Increase storage capacity.")
        elif (( $(echo "$forecast_disk > 70" | bc -l) )); then
            recommendations+=("Disk usage will reach ${forecast_disk}% in $forecast_days days. Plan for storage expansion.")
    # Check Request forecast
    if [ -n "${forecasted_usage[requests]}" ]; then
        growth_pct=$(echo "scale=2; 100 * ${growth_rates[requests]}" | bc -l)
        requests_per_day=$(grep -v '^#' "$metrics_dir/requests.metrics" | 
                          tail -n 288 | 
                          awk '{sum+=$3} END {print sum}')
        forecasted_daily_requests=$(echo "scale=0; $requests_per_day * (1 + ${growth_rates[requests]}) ^ $forecast_days" | bc -l)
        if (( $(echo "$growth_pct > 5" | bc -l) )); then
            recommendations+=("Request volume is growing at ${growth_pct}% daily. Forecast of ${forecasted_daily_requests} daily requests in $forecast_days days. Review scaling strategy.")
    # Generate the forecast report
    echo "Writing capacity forecast to $output_file..."
    # Format JSON with recommendations as array
    for i in "${!recommendations[@]}"; do
        if [ $i -gt 0 ]; then
            rec_json+=", "
    # Create JSON output
    cat > "$output_file" << EOF
  "service": "$service",
  "forecast_date": "$(date '+%Y-%m-%d')",
  "forecast_period_days": $forecast_days,
  "current_usage": {
$(for metric in "${!current_usage[@]}"; do echo "    \"$metric\": ${current_usage[$metric]},"; done | sed '$s/,$//')
  "growth_rates_daily": {
$(for metric in "${!growth_rates[@]}"; do echo "    \"$metric\": ${growth_rates[$metric]},"; done | sed '$s/,$//')
  "forecasted_usage": {
$(for metric in "${!forecasted_usage[@]}"; do echo "    \"$metric\": ${forecasted_usage[$metric]},"; done | sed '$s/,$//')
  "recommendations": [

    # Generate a human-readable summary
    echo "Capacity Forecast Summary for $service:"
    echo "======================================"
    echo "Forecast period: $forecast_days days"
    echo ""
    echo "Current Usage:"
    for metric in "${!current_usage[@]}"; do
        echo "  $metric: ${current_usage[$metric]}"
    echo ""
    echo "Daily Growth Rates:"
    for metric in "${!growth_rates[@]}"; do
        growth_pct=$(echo "scale=4; 100 * ${growth_rates[$metric]}" | bc -l)
        echo "  $metric: ${growth_pct}% per day"
    echo ""
    echo "Forecasted Usage (in $forecast_days days):"
    for metric in "${!forecasted_usage[@]}"; do
        echo "  $metric: ${forecasted_usage[$metric]}"
    echo ""
    echo "Recommendations:"
    for rec in "${recommendations[@]}"; do
        echo "  * $rec"
    return 0

# Example usage
forecast_capacity "web-service" "/var/lib/metrics/web-service" 60

This function implements the capacity planning principles from the SRE book:

“Capacity planning should take future usage into account, not just current usage.” — Google SRE Book

By analyzing growth trends and making forecasts, this script helps you anticipate capacity needs before they become critical. Robert Love’s “Linux System Programming” emphasizes that proactive capacity management is essential for maintaining system reliability under increasing load.

Conclusion: Embracing the SRE Mindset with Bash

Throughout this article, we’ve explored how to implement Google’s SRE principles using Bash scripts. From monitoring the four golden signals to implementing error budgets, these practices can transform your operations from reactive firefighting to proactive engineering.

As the Google SRE book concludes:

“The SRE approach is about applying engineering to operations, creating scalable and reliable software systems.” — Google SRE Book

The functions we’ve explored demonstrate that Bash scripts, when written with SRE principles in mind, can be powerful tools for implementing reliable automation. By combining the insights from “Linux System Programming” by Robert Love, “Unix Power Tools”, “How Linux Works” by Brian Ward, and Google’s SRE books, we can create scripts that don’t just automate tasks but build more reliable systems.

I encourage you to integrate these patterns into your own Bash scripts. Start small—perhaps with implementing the four golden signals monitoring—and gradually incorporate more SRE principles as you become comfortable with them. Remember that as “Shell Scripting” guides emphasize, the goal isn’t perfection on day one but continuous improvement toward more reliable systems.

What SRE principles have you implemented in your Bash scripts? I’d love to hear your experiences and additional techniques in the comments below!
