Build a multi-container app with Docker Compose, then build images with Docker Bake and push them to …
Bash String Functions: Search, Split, Count, Extract Bash String Functions: Search, Split, Count, Extract

Summary
These functions cover the bulk of day-to-day string work in ETL pipelines: validating field lengths, normalizing case, substituting values, counting occurrences, splitting delimited data, and extracting fixed-width or delimited fields. Each one is paired with a real data-processing problem it solves.
LEN: Field Validation for Database Imports
The len function is useful for validating data before database insertion:
function len {
echo "${#1}"
}
This uses Bash’s built-in parameter expansion ${#var} which is extremely fast - no external process spawned.
Example usage in a data validation pipeline:
# Database schema constraints
MAX_EMAIL_LENGTH=255
MAX_STATUS_LENGTH=50
MAX_COMMENT_LENGTH=1000
while IFS=',' read -r email status comment; do
email=$(trim "$email")
status=$(trim "$status")
comment=$(trim "$comment")
# Validate field lengths before insert
if [ $(len "$email") -gt $MAX_EMAIL_LENGTH ]; then
echo "ERROR: Email too long ($(len "$email") chars): $email" >&2
continue
fi
if [ $(len "$status") -gt $MAX_STATUS_LENGTH ]; then
echo "ERROR: Status too long ($(len "$status") chars): $status" >&2
continue
fi
if [ $(len "$comment") -gt $MAX_COMMENT_LENGTH ]; then
# Truncate comment instead of rejecting
comment="${comment:0:$MAX_COMMENT_LENGTH}"
fi
echo "INSERT INTO records (email, status, comment) VALUES ('$email', '$status', '$comment');"
done < import_data.csv
Length validation like this prevents database errors caused by field length violations.
Expand your knowledge with Database Scaling: From 100K to 5M Users
UPPERCASE and LOWERCASE: Case Normalization
Case conversion functions are essential for data normalization:
function uppercase {
echo "$1" | tr '[:lower:]' '[:upper:]'
}
function lowercase {
echo "$1" | tr '[:upper:]' '[:lower:]'
}
A common problem these solve: email addresses imported from different systems with inconsistent casing:
# System A: all lowercase
john.doe@example.com
# System B: mixed case
John.Doe@Example.com
# System C: uppercase
JOHN.DOE@EXAMPLE.COM
This can cause duplicate user accounts when a system treats these as different emails. The fix:
# Normalize all emails to lowercase before import
while IFS=',' read -r user_id email name; do
normalized_email=$(lowercase "$(trim "$email")")
echo "INSERT INTO users (id, email, name) VALUES ($user_id, '$normalized_email', '$name');"
done < user_import.csv
Uppercase is also useful for standardizing status codes:
# Normalize status codes to uppercase
status=$(uppercase "$(trim "$status_field")")
case "$status" in
ACTIVE|PENDING|SUSPENDED)
# Valid status
;;
*)
echo "ERROR: Invalid status: $status" >&2
status="UNKNOWN"
;;
esac
These simple functions help prevent duplicate accounts and standardize status codes across different data sources.
Deepen your understanding in Bash String Functions: Trimming, Case, and Reversal
SUBSTITUTE: Path Transformation for Multi-Environment Deploys
The substitute function solves a common problem in deployment scripts - transforming file paths between development, staging, and production environments.
function substitute {
echo "$1" | sed "s/$2/$3/g"
}
Example usage in a deployment script:
# Configuration file paths differ across environments
# Dev: /opt/dev/app/config/database.yml
# Staging: /opt/staging/app/config/database.yml
# Prod: /opt/prod/app/config/database.yml
# Deploy script transforms paths based on target environment
TARGET_ENV="$1" # dev, staging, or prod
while read -r config_line; do
# Transform path based on environment
case "$TARGET_ENV" in
staging)
transformed=$(substitute "$config_line" "/opt/dev/" "/opt/staging/")
;;
prod)
transformed=$(substitute "$config_line" "/opt/dev/" "/opt/prod/")
;;
*)
transformed="$config_line"
;;
esac
echo "$transformed"
done < config_template.yml > "config_${TARGET_ENV}.yml"
However, this function has a critical bug - it doesn’t escape special regex characters. This breaks when a path contains dots:
# BUG: Dots are regex wildcards in sed
$ substitute "/opt/app.v1/config" "/opt/app.v1/" "/opt/app.v2/"
# Matches /opt/appXv1/ instead of literal /opt/app.v1/
The fixed version escapes special characters:
function substitute_safe {
local input="$1"
local search=$(echo "$2" | sed 's/[.[\*^$/]/\\&/g') # Escape regex chars
local replace="$3"
echo "$input" | sed "s|$search|$replace|g" # Use | as delimiter to handle /
}
For simple substitutions, Bash parameter expansion avoids this problem entirely:
# Pure Bash approach - no sed, no regex issues
function substitute_bash {
echo "${1//$2/$3}"
}
This is both faster and safer:
In a rough local benchmark, the Bash-native version ran several times faster than the sed-based one (exact times depend on your machine):
# Benchmark: 10,000 substitutions
time for i in {1..10000}; do substitute_safe "test.path/config" "test.path" "prod.path" > /dev/null; done
# Real: noticeably slower (spawns sed each iteration)
time for i in {1..10000}; do substitute_bash "test.path/config" "test.path" "prod.path" > /dev/null; done
# Real: much faster (no external process)
The Bash-native approach is significantly faster and handles special characters correctly by default.
Explore this further in Config Templating: From envsubst to Go
TRUNCATE: Display Formatting for Long Error Messages
The truncate function prevents a monitoring dashboard from displaying massive error messages:
function truncate {
local str="$1"
local len="$2"
if [ "${#str}" -gt "$len" ]; then
echo "${str:0:$len}..."
else
echo "$str"
fi
}
Example usage in an alert system:
# Parse error logs and send truncated messages to Slack
tail -n 100 /var/log/app/error.log | while read -r timestamp level message; do
# Slack has 4000 char limit, but keep alerts concise
truncated_msg=$(truncate "$message" 200)
# Send to Slack webhook
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[$level] $truncated_msg\"}" \
"$SLACK_WEBHOOK_URL"
done
This prevents alert fatigue when long Java stack traces would otherwise be posted in full to a Slack channel, making it unusable.
Discover related concepts in Bash Error Handling: Patterns for Bulletproof Scripts
COUNT: Analyzing Log Patterns
The count function helps identify frequently occurring errors in logs:
function count {
echo "$1" | awk -v FS="$2" '{print NF-1}'
}
Example usage - finding which error appears most frequently:
# Count ERROR occurrences in each log line
while read -r line; do
error_count=$(count "$line" "ERROR")
if [ $error_count -gt 5 ]; then
echo "High error density: $error_count errors in single log line"
echo "$line"
fi
done < /var/log/app/application.log
# Also used for CSV field counting
csv_line="field1,field2,field3,field4"
field_count=$(($(count "$csv_line" ",") + 1))
expected_fields=4
if [ $field_count -ne $expected_fields ]; then
echo "ERROR: CSV has $field_count fields, expected $expected_fields"
fi
This is useful for catching cases where exception messages containing “ERROR” as part of the message text get counted multiple times in monitoring metrics.
Uncover more details in Bash Error Handling: Patterns for Bulletproof Scripts
SPLIT: CSV and Log Field Parsing
The split function is essential for parsing delimited data:
function split {
local IFS="$2"
read -ra arr <<< "$1"
echo "${arr[@]}"
}
Example usage parsing Apache access logs:
# Apache log format: IP - - [timestamp] "request" status size
log_line='192.168.1.100 - - [15/Jan/2024:10:30:45 +0000] "GET /api/users HTTP/1.1" 200 1234'
# Split by quotes to extract request
IFS='"' read -ra parts <<< "$log_line"
request="${parts[1]}" # GET /api/users HTTP/1.1
# Split request by spaces
IFS=' ' read -ra request_parts <<< "$request"
method="${request_parts[0]}" # GET
path="${request_parts[1]}" # /api/users
protocol="${request_parts[2]}" # HTTP/1.1
# Track API endpoint hits
echo "$path" >> /tmp/api_hits.log
This parsing approach can feed an API usage analytics dashboard.
Journey deeper into this topic with Bash String Validation, Generation & a Library
CAPITALIZE: Report Generation
The capitalize function formats customer names in generated reports:
function capitalize {
echo "$1" | sed 's/\b\([a-z]\)/\u\1/g'
}
Example usage in a monthly report generator:
# Customer names in database are all lowercase (legacy system)
# Reports need proper capitalization
psql -t -c "SELECT customer_name FROM customers" | while read -r name; do
formatted_name=$(capitalize "$name")
echo "Customer: $formatted_name"
done > monthly_report.txt
# Examples:
# Input: "john smith"
# Output: "John Smith"
#
# Input: "mary-jane watson"
# Output: "Mary-Jane Watson"
This improves the professionalism of automated reports, avoiding unprofessional lowercase names.
Enrich your learning with Bash String Validation, Generation & a Library
ROT13: Obfuscating Sensitive Data in Logs
The rot13 function provides simple obfuscation for sensitive data in debug logs:
function rot13 {
echo "$1" | tr 'A-Za-z' 'N-ZA-Mn-za-m'
}
Example usage in a debug logging system:
# Debug logs need to show email patterns without exposing actual addresses
function log_debug {
local level="$1"
local message="$2"
local email="$3"
# Obfuscate email address in logs
if [ -n "$email" ]; then
obfuscated=$(rot13 "$email")
echo "$(date -Is) [$level] $message | user_email_rot13: $obfuscated" >> /var/log/app/debug.log
else
echo "$(date -Is) [$level] $message" >> /var/log/app/debug.log
fi
}
# Usage
log_debug "INFO" "User authentication successful" "john.doe@example.com"
# Logs: 2024-01-15T10:30:45+00:00 [INFO] User authentication successful | user_email_rot13: wbua.qbr@rknzcyr.pbz
# Security team can decode if needed:
$ echo "wbua.qbr@rknzcyr.pbz" | tr 'A-Za-z' 'N-ZA-Mn-za-m'
john.doe@example.com
This keeps plaintext PII out of application logs while still allowing debugging when needed, since the value can be decoded on demand.
Gain comprehensive insights from Mastering NGINX Logs: Configuration and Analysis Guide
STRING EXTRACTION: INDEX, SUBSTRING, and JOIN
These utility functions support parsing and rebuilding strings in data pipelines.
INDEX: Finding Delimiters
function index {
local str="$1"
local search="$2"
expr index "$str" "$search"
}
Used to locate field separators in variable-format data:
# Some CSV files use comma, some use pipe
data="john|doe|johndoe@example.com"
comma_pos=$(index "$data" ",")
pipe_pos=$(index "$data" "|")
if [ $pipe_pos -gt 0 ] && [ $pipe_pos -lt $comma_pos ]; then
delimiter="|"
else
delimiter=","
fi
echo "Detected delimiter: $delimiter"
SUBSTRING: Extracting Fixed-Width Fields
function substring {
local str="$1"
local start="$2"
local len="$3"
echo "${str:$start:$len}"
}
Example usage parsing a fixed-width legacy file format:
# Legacy mainframe export format:
# Columns 1-10: Account ID (right-padded)
# Columns 11-50: Account Name
# Columns 51-65: Balance (right-aligned, 2 decimals)
while read -r line; do
account_id=$(substring "$line" 0 10 | trim)
account_name=$(substring "$line" 10 40 | trim)
balance=$(substring "$line" 50 15 | trim)
echo "INSERT INTO accounts VALUES ('$account_id', '$account_name', $balance);"
done < mainframe_export.txt
This approach handles account exports from a legacy system whose output format can’t be changed.
JOIN: Building Delimited Strings
function join {
local IFS="$1"
shift
printf '%s' "$*"
}
Used to rebuild CSV lines after field manipulation:
# Read CSV, modify specific field, write back
while IFS=',' read -ra fields; do
# Modify third field (status)
fields[2]=$(uppercase "${fields[2]}")
# Rebuild CSV line
modified_line=$(join "," "${fields[@]}")
echo "$modified_line"
done < input.csv > output.csv
These three functions together handle parsing and rebuilding of different legacy data formats in an integration layer.
This is part of the Advanced Bash String Operations series.
What string manipulation challenges have you encountered in production data pipelines?
Similar Articles
Related Content
More from devops
Set up a Kubernetes cluster on AWS EKS with eksctl: prerequisites, one-command cluster creation, …
Kubernetes CrashLoopBackOff explained: a workflow to diagnose it and fix the six most common causes, …
You Might Also Like
Practical sed patterns for log analysis: extract errors, filter time ranges, anonymize PII, parse …
The sed gotchas that bite in production: GNU vs BSD differences, in-place editing safety, escape …
Use sed safely in CI/CD pipelines: idempotent edits, exit-code checks, dry-run patterns, and the …
Contents
- LEN: Field Validation for Database Imports
- UPPERCASE and LOWERCASE: Case Normalization
- SUBSTITUTE: Path Transformation for Multi-Environment Deploys
- TRUNCATE: Display Formatting for Long Error Messages
- COUNT: Analyzing Log Patterns
- SPLIT: CSV and Log Field Parsing
- CAPITALIZE: Report Generation
- ROT13: Obfuscating Sensitive Data in Logs
- STRING EXTRACTION: INDEX, SUBSTRING, and JOIN

