/user/kayd @ devops :~$ cat bash-string-search-split-and-count.md

Bash String Functions: Search, Split, Count, Extract Bash String Functions: Search, Split, Count, Extract

QR Code linking to: Bash String Functions: Search, Split, Count, Extract
Karandeep Singh
Karandeep Singh
• 8 minutes

Summary

Bash string functions for length validation, case conversion, substitution, truncation, counting, splitting, capitalization, ROT13, and field extraction (index, substring, join).

These functions cover the bulk of day-to-day string work in ETL pipelines: validating field lengths, normalizing case, substituting values, counting occurrences, splitting delimited data, and extracting fixed-width or delimited fields. Each one is paired with a real data-processing problem it solves.

LEN: Field Validation for Database Imports

The len function is useful for validating data before database insertion:

function len {
    echo "${#1}"
}

This uses Bash’s built-in parameter expansion ${#var} which is extremely fast - no external process spawned.

Example usage in a data validation pipeline:

# Database schema constraints
MAX_EMAIL_LENGTH=255
MAX_STATUS_LENGTH=50
MAX_COMMENT_LENGTH=1000

while IFS=',' read -r email status comment; do
    email=$(trim "$email")
    status=$(trim "$status")
    comment=$(trim "$comment")

    # Validate field lengths before insert
    if [ $(len "$email") -gt $MAX_EMAIL_LENGTH ]; then
        echo "ERROR: Email too long ($(len "$email") chars): $email" >&2
        continue
    fi

    if [ $(len "$status") -gt $MAX_STATUS_LENGTH ]; then
        echo "ERROR: Status too long ($(len "$status") chars): $status" >&2
        continue
    fi

    if [ $(len "$comment") -gt $MAX_COMMENT_LENGTH ]; then
        # Truncate comment instead of rejecting
        comment="${comment:0:$MAX_COMMENT_LENGTH}"
    fi

    echo "INSERT INTO records (email, status, comment) VALUES ('$email', '$status', '$comment');"
done < import_data.csv

Length validation like this prevents database errors caused by field length violations.

UPPERCASE and LOWERCASE: Case Normalization

Case conversion functions are essential for data normalization:

function uppercase {
    echo "$1" | tr '[:lower:]' '[:upper:]'
}

function lowercase {
    echo "$1" | tr '[:upper:]' '[:lower:]'
}

A common problem these solve: email addresses imported from different systems with inconsistent casing:

# System A: all lowercase
john.doe@example.com

# System B: mixed case
John.Doe@Example.com

# System C: uppercase
JOHN.DOE@EXAMPLE.COM

This can cause duplicate user accounts when a system treats these as different emails. The fix:

# Normalize all emails to lowercase before import
while IFS=',' read -r user_id email name; do
    normalized_email=$(lowercase "$(trim "$email")")
    echo "INSERT INTO users (id, email, name) VALUES ($user_id, '$normalized_email', '$name');"
done < user_import.csv

Uppercase is also useful for standardizing status codes:

# Normalize status codes to uppercase
status=$(uppercase "$(trim "$status_field")")
case "$status" in
    ACTIVE|PENDING|SUSPENDED)
        # Valid status
        ;;
    *)
        echo "ERROR: Invalid status: $status" >&2
        status="UNKNOWN"
        ;;
esac

These simple functions help prevent duplicate accounts and standardize status codes across different data sources.

SUBSTITUTE: Path Transformation for Multi-Environment Deploys

The substitute function solves a common problem in deployment scripts - transforming file paths between development, staging, and production environments.

function substitute {
    echo "$1" | sed "s/$2/$3/g"
}

Example usage in a deployment script:

# Configuration file paths differ across environments
# Dev:     /opt/dev/app/config/database.yml
# Staging: /opt/staging/app/config/database.yml
# Prod:    /opt/prod/app/config/database.yml

# Deploy script transforms paths based on target environment
TARGET_ENV="$1"  # dev, staging, or prod

while read -r config_line; do
    # Transform path based on environment
    case "$TARGET_ENV" in
        staging)
            transformed=$(substitute "$config_line" "/opt/dev/" "/opt/staging/")
            ;;
        prod)
            transformed=$(substitute "$config_line" "/opt/dev/" "/opt/prod/")
            ;;
        *)
            transformed="$config_line"
            ;;
    esac
    echo "$transformed"
done < config_template.yml > "config_${TARGET_ENV}.yml"

However, this function has a critical bug - it doesn’t escape special regex characters. This breaks when a path contains dots:

# BUG: Dots are regex wildcards in sed
$ substitute "/opt/app.v1/config" "/opt/app.v1/" "/opt/app.v2/"
# Matches /opt/appXv1/ instead of literal /opt/app.v1/

The fixed version escapes special characters:

function substitute_safe {
    local input="$1"
    local search=$(echo "$2" | sed 's/[.[\*^$/]/\\&/g')  # Escape regex chars
    local replace="$3"
    echo "$input" | sed "s|$search|$replace|g"  # Use | as delimiter to handle /
}

For simple substitutions, Bash parameter expansion avoids this problem entirely:

# Pure Bash approach - no sed, no regex issues
function substitute_bash {
    echo "${1//$2/$3}"
}

This is both faster and safer:

In a rough local benchmark, the Bash-native version ran several times faster than the sed-based one (exact times depend on your machine):

# Benchmark: 10,000 substitutions
time for i in {1..10000}; do substitute_safe "test.path/config" "test.path" "prod.path" > /dev/null; done
# Real: noticeably slower (spawns sed each iteration)

time for i in {1..10000}; do substitute_bash "test.path/config" "test.path" "prod.path" > /dev/null; done
# Real: much faster (no external process)

The Bash-native approach is significantly faster and handles special characters correctly by default.

TRUNCATE: Display Formatting for Long Error Messages

The truncate function prevents a monitoring dashboard from displaying massive error messages:

function truncate {
    local str="$1"
    local len="$2"
    if [ "${#str}" -gt "$len" ]; then
        echo "${str:0:$len}..."
    else
        echo "$str"
    fi
}

Example usage in an alert system:

# Parse error logs and send truncated messages to Slack
tail -n 100 /var/log/app/error.log | while read -r timestamp level message; do
    # Slack has 4000 char limit, but keep alerts concise
    truncated_msg=$(truncate "$message" 200)

    # Send to Slack webhook
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"[$level] $truncated_msg\"}" \
        "$SLACK_WEBHOOK_URL"
done

This prevents alert fatigue when long Java stack traces would otherwise be posted in full to a Slack channel, making it unusable.

COUNT: Analyzing Log Patterns

The count function helps identify frequently occurring errors in logs:

function count {
    echo "$1" | awk -v FS="$2" '{print NF-1}'
}

Example usage - finding which error appears most frequently:

# Count ERROR occurrences in each log line
while read -r line; do
    error_count=$(count "$line" "ERROR")
    if [ $error_count -gt 5 ]; then
        echo "High error density: $error_count errors in single log line"
        echo "$line"
    fi
done < /var/log/app/application.log

# Also used for CSV field counting
csv_line="field1,field2,field3,field4"
field_count=$(($(count "$csv_line" ",") + 1))
expected_fields=4

if [ $field_count -ne $expected_fields ]; then
    echo "ERROR: CSV has $field_count fields, expected $expected_fields"
fi

This is useful for catching cases where exception messages containing “ERROR” as part of the message text get counted multiple times in monitoring metrics.

SPLIT: CSV and Log Field Parsing

The split function is essential for parsing delimited data:

function split {
    local IFS="$2"
    read -ra arr <<< "$1"
    echo "${arr[@]}"
}

Example usage parsing Apache access logs:

# Apache log format: IP - - [timestamp] "request" status size
log_line='192.168.1.100 - - [15/Jan/2024:10:30:45 +0000] "GET /api/users HTTP/1.1" 200 1234'

# Split by quotes to extract request
IFS='"' read -ra parts <<< "$log_line"
request="${parts[1]}"  # GET /api/users HTTP/1.1

# Split request by spaces
IFS=' ' read -ra request_parts <<< "$request"
method="${request_parts[0]}"   # GET
path="${request_parts[1]}"     # /api/users
protocol="${request_parts[2]}" # HTTP/1.1

# Track API endpoint hits
echo "$path" >> /tmp/api_hits.log

This parsing approach can feed an API usage analytics dashboard.

CAPITALIZE: Report Generation

The capitalize function formats customer names in generated reports:

function capitalize {
    echo "$1" | sed 's/\b\([a-z]\)/\u\1/g'
}

Example usage in a monthly report generator:

# Customer names in database are all lowercase (legacy system)
# Reports need proper capitalization

psql -t -c "SELECT customer_name FROM customers" | while read -r name; do
    formatted_name=$(capitalize "$name")
    echo "Customer: $formatted_name"
done > monthly_report.txt

# Examples:
# Input:  "john smith"
# Output: "John Smith"
#
# Input:  "mary-jane watson"
# Output: "Mary-Jane Watson"

This improves the professionalism of automated reports, avoiding unprofessional lowercase names.

ROT13: Obfuscating Sensitive Data in Logs

The rot13 function provides simple obfuscation for sensitive data in debug logs:

function rot13 {
    echo "$1" | tr 'A-Za-z' 'N-ZA-Mn-za-m'
}

Example usage in a debug logging system:

# Debug logs need to show email patterns without exposing actual addresses
function log_debug {
    local level="$1"
    local message="$2"
    local email="$3"

    # Obfuscate email address in logs
    if [ -n "$email" ]; then
        obfuscated=$(rot13 "$email")
        echo "$(date -Is) [$level] $message | user_email_rot13: $obfuscated" >> /var/log/app/debug.log
    else
        echo "$(date -Is) [$level] $message" >> /var/log/app/debug.log
    fi
}

# Usage
log_debug "INFO" "User authentication successful" "john.doe@example.com"
# Logs: 2024-01-15T10:30:45+00:00 [INFO] User authentication successful | user_email_rot13: wbua.qbr@rknzcyr.pbz

# Security team can decode if needed:
$ echo "wbua.qbr@rknzcyr.pbz" | tr 'A-Za-z' 'N-ZA-Mn-za-m'
john.doe@example.com

This keeps plaintext PII out of application logs while still allowing debugging when needed, since the value can be decoded on demand.

STRING EXTRACTION: INDEX, SUBSTRING, and JOIN

These utility functions support parsing and rebuilding strings in data pipelines.

INDEX: Finding Delimiters

function index {
    local str="$1"
    local search="$2"
    expr index "$str" "$search"
}

Used to locate field separators in variable-format data:

# Some CSV files use comma, some use pipe
data="john|doe|johndoe@example.com"

comma_pos=$(index "$data" ",")
pipe_pos=$(index "$data" "|")

if [ $pipe_pos -gt 0 ] && [ $pipe_pos -lt $comma_pos ]; then
    delimiter="|"
else
    delimiter=","
fi

echo "Detected delimiter: $delimiter"

SUBSTRING: Extracting Fixed-Width Fields

function substring {
    local str="$1"
    local start="$2"
    local len="$3"
    echo "${str:$start:$len}"
}

Example usage parsing a fixed-width legacy file format:

# Legacy mainframe export format:
# Columns 1-10: Account ID (right-padded)
# Columns 11-50: Account Name
# Columns 51-65: Balance (right-aligned, 2 decimals)

while read -r line; do
    account_id=$(substring "$line" 0 10 | trim)
    account_name=$(substring "$line" 10 40 | trim)
    balance=$(substring "$line" 50 15 | trim)

    echo "INSERT INTO accounts VALUES ('$account_id', '$account_name', $balance);"
done < mainframe_export.txt

This approach handles account exports from a legacy system whose output format can’t be changed.

JOIN: Building Delimited Strings

function join {
    local IFS="$1"
    shift
    printf '%s' "$*"
}

Used to rebuild CSV lines after field manipulation:

# Read CSV, modify specific field, write back
while IFS=',' read -ra fields; do
    # Modify third field (status)
    fields[2]=$(uppercase "${fields[2]}")

    # Rebuild CSV line
    modified_line=$(join "," "${fields[@]}")
    echo "$modified_line"
done < input.csv > output.csv

These three functions together handle parsing and rebuilding of different legacy data formats in an integration layer.

This is part of the Advanced Bash String Operations series.

Question

What string manipulation challenges have you encountered in production data pipelines?

Similar Articles

More from devops