/user/kayd @ devops :~$ cat bash-string-trimming-and-case.md

Bash String Functions: Trimming, Case, and Reversal Bash String Functions: Trimming, Case, and Reversal

QR Code linking to: Bash String Functions: Trimming, Case, and Reversal
Karandeep Singh
Karandeep Singh
• 5 minutes

Summary

Bash string functions for stripping leading and trailing whitespace (ltrim, rtrim, trim) and reversing strings, with performance comparisons against sed and awk.

Whitespace from vendor CSV exports and inconsistent log formats is a frequent source of silent data corruption. These trim and reversal functions clean strings reliably across mixed whitespace, with parameter-expansion approaches that avoid spawning external processes.

LTRIM: The First Attempt (That Failed)

A naive first solution looks like this:

# First attempt - doesn't work properly
function ltrim_broken {
    echo "$1" | sed 's/^ *//'
}

This works for spaces but fails on tabs and other whitespace. When a CSV has mixed tabs and spaces, this function misses the tabs entirely.

Here’s what went wrong:

$ ltrim_broken "	  data"    # Tab + spaces + text
  data                        # Still has spaces!

The sed pattern ^ * only matches spaces, not tabs or other whitespace characters. A robust solution needs to handle all whitespace types.

LTRIM: Production Solution with Performance Testing

A reliable approach uses Bash parameter expansion:

function ltrim {
    echo "${1#"${1%%[![:space:]]*}"}"
}

This uses two parameter expansion operations:

  1. ${1%%[![:space:]]*} - finds all leading whitespace
  2. ${1#...} - removes it from the start of the string

Performance comparison processing 100,000 lines. The exact numbers below come from a rough local benchmark and will vary with your hardware and shell — treat them as illustrative of the relative ordering, not precise measurements:

# Test file: 100K lines with leading whitespace
seq 1 100000 | awk '{print "   "$0}' > test_data.txt

# Method 1: ltrim with parameter expansion
time while read line; do ltrim "$line" > /dev/null; done < test_data.txt
# Real: ~8s

# Method 2: sed approach (spawns a process per line)
time while read line; do echo "$line" | sed 's/^[[:space:]]*//'; done < test_data.txt > /dev/null
# Real: ~40s

# Method 3: awk approach (single process)
time awk '{sub(/^[[:space:]]+/, ""); print}' test_data.txt > /dev/null
# Real: ~1s

The parameter expansion approach is several times faster than the per-line sed loop but slower than a single awk pass for bulk processing. However, for individual string operations in scripts, the function approach is more maintainable.

RTRIM: Log Format Normalization

A common need for rtrim comes from log aggregation. Different microservices use different logging formats, some adding trailing spaces, some adding newlines. This can break a log parsing regex.

Here’s the kind of bug this causes:

# Microservice A log format (no trailing space)
echo "2024-01-15 ERROR UserService failed"

# Microservice B log format (trailing spaces)
echo "2024-01-15 ERROR PaymentService failed  "

# My regex pattern expected no trailing whitespace
if [[ $log_line =~ ^([0-9-]+)\ ([A-Z]+)\ (.+)$ ]]; then
    # This captured "PaymentService failed  " with spaces
    # Breaking downstream JSON generation
fi

The rtrim function fixed this:

function rtrim {
    echo "${1%"${1##*[![:space:]]}"}"
}

This mirrors ltrim but works from the right side:

  1. ${1##*[![:space:]]} - finds all trailing whitespace
  2. ${1%...} - removes it from the end of the string

Example usage in a log parser:

while IFS= read -r line; do
    clean_line=$(rtrim "$line")
    if [[ $clean_line =~ ^([0-9-]+)\ ([A-Z]+)\ (.+)$ ]]; then
        timestamp="${BASH_REMATCH[1]}"
        level="${BASH_REMATCH[2]}"
        message="${BASH_REMATCH[3]}"
        # Generate JSON for log aggregator
        echo "{\"ts\":\"$timestamp\",\"level\":\"$level\",\"msg\":\"$message\"}"
    fi
done < service.log

This normalizes log lines across services, removing the whitespace inconsistencies that can cause logs to fail parsing.

TRIM: A Workhorse Function

The trim function is a workhorse, called frequently in CSV processing:

function trim {
    echo "$(rtrim "$(ltrim "$1")")"
}

This simply combines both operations. Why not use a single regex? Because combining the two parameter expansion approaches is actually faster than sed for individual calls:

In a rough local benchmark, the parameter expansion version finished roughly an order of magnitude faster than shelling out to sed in a loop (exact times depend on your machine):

# Benchmark: 10,000 trim operations
time for i in {1..10000}; do
    trim "  test data  " > /dev/null
done
# Real: a few seconds

# Versus sed approach
time for i in {1..10000}; do
    echo "  test data  " | sed 's/^[[:space:]]*//;s/[[:space:]]*$//' > /dev/null
done
# Real: roughly 10x longer

The parameter expansion approach wins by a wide margin because it doesn’t spawn external processes.

Here’s an example CSV import script that uses it:

#!/bin/bash
# Process vendor CSV with inconsistent whitespace
# Runs daily at 2 AM processing 50K+ rows

source string_functions.sh

while IFS=',' read -r id email status; do
    # Trim all fields
    clean_id=$(trim "$id")
    clean_email=$(trim "$email")
    clean_status=$(trim "$status")

    # Validate and insert
    if [[ $clean_id =~ ^[0-9]+$ ]] && [[ $clean_email =~ ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$ ]]; then
        echo "INSERT INTO users (id, email, status) VALUES ($clean_id, '$clean_email', '$clean_status');"
    else
        echo "WARN: Skipped invalid row: id=$id email=$email" >&2
    fi
done < vendor_export.csv > import.sql

Trim functions like these guard against the whitespace issues that cause imports to fail silently.

REVERSE: Palindrome Detection

The reverse function can solve a specific problem in a data validation pipeline: detecting palindromic transaction IDs that might be flagged as potential duplicates.

Here’s the function:

function reverse {
  local str="$1"
  local reversed=""
  local len=${#str}
  for ((i=$len-1; i>=0; i--))
  do
    reversed="$reversed${str:$i:1}"
  done
  echo "$reversed"
}

This uses a C-style for loop to iterate backwards through the string. It’s slower than the rev command but more portable (rev isn’t available in all environments).

Example usage:

# Check if transaction ID is palindrome (potential duplicate)
function is_palindrome {
    local str="$1"
    local rev=$(reverse "$str")
    [[ "$str" == "$rev" ]]
}

# Process transaction file
while read -r txn_id amount status; do
    if is_palindrome "$txn_id"; then
        echo "WARN: Palindrome transaction ID $txn_id - flagging for review" >&2
    fi
    # Process transaction...
done < transactions.csv

Performance note: This function is slow for long strings because it walks the string character by character in pure Bash. As a rough illustration, processing 100,000 10-character strings, the character-by-character Bash loop is dramatically slower than delegating to a compiled tool:

# Bash loop approach: slowest by far (tens of seconds)
# rev command: fastest (well under a second)
# awk approach: in between (around a second or two)

This function is useful for portability in containerized environments where rev might not be available, but for performance-critical code, use rev:

function reverse_fast {
    echo "$1" | rev
}

This is part of the Advanced Bash String Operations series.

Question

What string manipulation challenges have you encountered in production data pipelines?

Similar Articles

More from devops