Every DevOps engineer manages processes. You kill stuck workers, restart crashed services, and watch memory usage climb until someone gets paged. Linux gives you the tools for all of this: ps, kill, systemctl, nice, ulimit. We are going to learn each one, then build the same patterns in Go until we have a mini process supervisor with health checks and auto-restart.
Prerequisites
- A Linux system (native, WSL, or SSH to a remote server)
- Go 1.21+ installed (
go versionto check)
Create a project directory:
mkdir process-mgmt && cd process-mgmt
go mod init process-mgmt
Step 1: Finding and Inspecting Processes
Linux Commands
The first thing you need when something is wrong is a list of what is running.
ps aux
This prints every process on the system. The columns:
| Column | Meaning |
|---|---|
| USER | Who owns the process |
| PID | Process ID, the number you use to kill it |
| %CPU | CPU usage right now |
| %MEM | Physical memory usage as a percentage |
| VSZ | Virtual memory size in KB (address space reserved) |
| RSS | Resident Set Size in KB (actual physical memory used) |
| STAT | Process state: S (sleeping), R (running), Z (zombie), D (uninterruptible sleep) |
| COMMAND | The command that started this process |
RSS is the one you care about most. VSZ can be huge but harmless; it includes memory the process asked for but never touched. RSS is what is actually in RAM.
Find the top 20 memory hogs:
ps aux --sort=-%mem | head -20
See parent-child relationships in a tree:
ps -eo pid,ppid,cmd --forest
This shows which process spawned which. When you kill a parent, children might become orphans (adopted by PID 1) or die too. It depends on how the parent set things up.
Find a process by name:
pgrep -af nginx
The -a flag shows the full command line. The -f flag matches against the full command, not just the process name.
Go Code: List Processes From /proc
On Linux, every process has a directory under /proc/. The file /proc/[pid]/stat has the raw stats and /proc/[pid]/status has human-readable info.
step1/main.go
package main
import (
"fmt"
"os"
"path/filepath"
"sort"
"strconv"
"strings"
)
type ProcessInfo struct {
PID int
Name string
State string
RSS int // in KB
Threads int
}
func listProcesses() ([]ProcessInfo, error) {
entries, err := os.ReadDir("/proc")
if err != nil {
return nil, err
}
var procs []ProcessInfo
for _, entry := range entries {
if !entry.IsDir() {
continue
}
pid, err := strconv.Atoi(entry.Name())
if err != nil {
continue // not a PID directory
}
info, err := readProcessInfo(pid)
if err != nil {
continue // process may have exited
}
procs = append(procs, info)
}
return procs, nil
}
func readProcessInfo(pid int) (ProcessInfo, error) {
data, err := os.ReadFile(filepath.Join("/proc", strconv.Itoa(pid), "stat"))
if err != nil {
return ProcessInfo{}, err
}
line := string(data)
fields := strings.Fields(line)
// BUG: parse name from fields[1], RSS from fields[23]
name := strings.Trim(fields[1], "()")
rssPages, _ := strconv.Atoi(fields[23])
pageSize := os.Getpagesize()
rssKB := (rssPages * pageSize) / 1024
return ProcessInfo{
PID: pid,
Name: name,
RSS: rssKB,
}, nil
}
func main() {
procs, err := listProcesses()
if err != nil {
fmt.Println("Error:", err)
os.Exit(1)
}
// Sort by RSS descending
sort.Slice(procs, func(i, j int) bool {
return procs[i].RSS > procs[j].RSS
})
fmt.Printf("%-8s %-20s %10s\n", "PID", "NAME", "RSS (KB)")
fmt.Println(strings.Repeat("-", 42))
limit := 20
if len(procs) < limit {
limit = len(procs)
}
for _, p := range procs[:limit] {
fmt.Printf("%-8d %-20s %10d\n", p.PID, p.Name, p.RSS)
}
}
Run it:
go run step1/main.go
Expected output:
PID NAME RSS (KB)
------------------------------------------
1234 firefox 524288
5678 code 312456
9012 node 184320
3456 go 98304
...
The Bug
This code has a problem. The /proc/[pid]/stat file looks like this:
1234 (Web Content) S 1200 1234 1200 ...
The process name is in parentheses and can contain spaces. When the name is (Web Content), strings.Fields splits it into (Web and Content). Now field indexes are off by one. Field 23 is no longer RSS. It is some other value.
This bug is silent. You get wrong numbers and nothing crashes. The worst kind.
The Fix
Parse the name by finding the last ) in the line. Everything after that closing parenthesis has fixed field positions.
step1/main.go: fixed readProcessInfo:
func readProcessInfo(pid int) (ProcessInfo, error) {
data, err := os.ReadFile(filepath.Join("/proc", strconv.Itoa(pid), "stat"))
if err != nil {
return ProcessInfo{}, err
}
line := string(data)
// Find the last ')' — everything after it has fixed positions
closeIdx := strings.LastIndex(line, ")")
if closeIdx == -1 {
return ProcessInfo{}, fmt.Errorf("bad stat format for pid %d", pid)
}
// Name is between first '(' and last ')'
openIdx := strings.Index(line, "(")
name := line[openIdx+1 : closeIdx]
// Fields after ')' — skip the space after ')'
rest := strings.Fields(line[closeIdx+2:])
// rest[0] = state, rest[1] = ppid, ..., rest[21] = RSS (field 23 in original, index 21 here)
if len(rest) < 22 {
return ProcessInfo{}, fmt.Errorf("not enough fields for pid %d", pid)
}
state := rest[0]
rssPages, _ := strconv.Atoi(rest[21])
pageSize := os.Getpagesize()
rssKB := (rssPages * pageSize) / 1024
threads, _ := strconv.Atoi(rest[17])
return ProcessInfo{
PID: pid,
Name: name,
State: state,
RSS: rssKB,
Threads: threads,
}, nil
}
The key insight: strings.LastIndex(line, ")") handles any process name, even ones with nested parentheses. The Linux kernel guarantees the name is wrapped in ( and ).
Step 2: Killing Processes and Signals
Linux Commands
List all signals your system supports:
kill -l
You will see about 30 signals. The ones you use daily:
| Signal | Number | Meaning |
|---|---|---|
| SIGTERM | 15 | “Please stop.” The process gets a chance to clean up: close files, finish requests, flush buffers. |
| SIGKILL | 9 | “Stop now.” The kernel removes the process immediately. No cleanup, no signal handler, no choice. |
| SIGHUP | 1 | “Reload config.” Nginx, Apache, and many daemons reload their config on SIGHUP without restarting. |
| SIGUSR1 | 10 | Custom signal. Nginx uses it to reopen log files after rotation. |
| SIGINT | 2 | What Ctrl+C sends. Same as SIGTERM in most programs. |
Send SIGTERM (the default):
kill PID
Send SIGKILL when SIGTERM is ignored:
kill -9 PID
Reload a config:
kill -HUP PID
Kill by name pattern:
pkill -f "python server.py"
Go Code: Graceful Kill With Timeout
Write a program that starts a child process, sends SIGTERM, waits up to 5 seconds, then falls back to SIGKILL.
step2/main.go
package main
import (
"fmt"
"os"
"os/exec"
"time"
)
func main() {
// Start a child process that sleeps for 60 seconds
cmd := exec.Command("sleep", "60")
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Start(); err != nil {
fmt.Println("Failed to start:", err)
os.Exit(1)
}
fmt.Printf("Started child process (pid=%d)\n", cmd.Process.Pid)
// Wait 2 seconds then try to stop it
time.Sleep(2 * time.Second)
fmt.Println("Stopping child process...")
// BUG: just kill it immediately
err := cmd.Process.Kill()
if err != nil {
fmt.Println("Kill error:", err)
}
err = cmd.Wait()
fmt.Println("Process exited:", err)
}
Run it:
go run step2/main.go
Expected output:
Started child process (pid=12345)
Stopping child process...
Process exited: signal: killed
The Bug
cmd.Process.Kill() sends SIGKILL directly. The child never gets a chance to clean up. If the child had open files, database connections, or was in the middle of writing data, all of that is lost.
With sleep this doesn’t matter. But replace sleep with a real service and you get corrupted data, half-written files, and leaked resources.
The Fix
Send SIGTERM first. Wait with a timeout. Fall back to SIGKILL only if the process ignores SIGTERM.
step2/main.go: fixed:
package main
import (
"fmt"
"os"
"os/exec"
"syscall"
"time"
)
func gracefulStop(cmd *exec.Cmd, timeout time.Duration) error {
// Step 1: send SIGTERM
fmt.Println("Sending SIGTERM...")
if err := cmd.Process.Signal(syscall.SIGTERM); err != nil {
return fmt.Errorf("failed to send SIGTERM: %w", err)
}
// Step 2: wait with a timeout
done := make(chan error, 1)
go func() {
done <- cmd.Wait()
}()
select {
case err := <-done:
fmt.Println("Process exited gracefully")
return err
case <-time.After(timeout):
// Step 3: SIGTERM was ignored, send SIGKILL
fmt.Println("Timeout — sending SIGKILL...")
if err := cmd.Process.Kill(); err != nil {
return fmt.Errorf("failed to kill: %w", err)
}
return <-done
}
}
func main() {
cmd := exec.Command("sleep", "60")
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
if err := cmd.Start(); err != nil {
fmt.Println("Failed to start:", err)
os.Exit(1)
}
fmt.Printf("Started child process (pid=%d)\n", cmd.Process.Pid)
time.Sleep(2 * time.Second)
fmt.Println("Stopping child process...")
err := gracefulStop(cmd, 5*time.Second)
if err != nil {
fmt.Println("Process exited:", err)
}
}
Expected output:
Started child process (pid=12345)
Stopping child process...
Sending SIGTERM...
Timeout — sending SIGKILL...
Process exited: signal: killed
The sleep command does not handle SIGTERM (it just dies), so you see the timeout path. But a real service with a signal handler would exit gracefully in the first branch.
The pattern is: SIGTERM, wait, SIGKILL. This is what docker stop does. It sends SIGTERM, waits 10 seconds (configurable with --time), then sends SIGKILL. Now you know why.
Step 3: Handling Signals in Your Own Process
Linux Commands
When your Go program is running, you can send it signals from another terminal:
# Terminal 1: Run your Go program
go run main.go
# Terminal 2: Find the process and send signals
kill -TERM $(pgrep -f "go run main.go")
kill -HUP $(pgrep -f "go run main.go")
Or press Ctrl+C to send SIGINT.
Go Code: Catch Signals
Write a program that catches SIGTERM, SIGINT, and SIGHUP. On SIGTERM or SIGINT, do a graceful shutdown. On SIGHUP, reload configuration.
step3/main.go
package main
import (
"fmt"
"os"
"os/signal"
"syscall"
"time"
)
var config = map[string]string{
"log_level": "info",
"port": "8080",
}
func reloadConfig() {
fmt.Println("[config] reloading configuration...")
// In a real app, re-read the config file here
config["log_level"] = "debug"
fmt.Printf("[config] log_level is now: %s\n", config["log_level"])
}
func gracefulShutdown() {
fmt.Println("[shutdown] closing database connections...")
time.Sleep(500 * time.Millisecond)
fmt.Println("[shutdown] flushing logs...")
time.Sleep(300 * time.Millisecond)
fmt.Println("[shutdown] done. goodbye.")
}
func main() {
fmt.Println("[app] starting up...")
fmt.Printf("[app] config: %v\n", config)
// BUG: unbuffered channel
sigChan := make(chan os.Signal)
signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT, syscall.SIGHUP)
fmt.Println("[app] running. send me signals:")
fmt.Println(" kill -TERM <pid> — graceful shutdown")
fmt.Println(" kill -HUP <pid> — reload config")
fmt.Println(" Ctrl+C — graceful shutdown")
fmt.Printf("[app] my PID is %d\n", os.Getpid())
for {
sig := <-sigChan
switch sig {
case syscall.SIGHUP:
reloadConfig()
case syscall.SIGTERM, syscall.SIGINT:
fmt.Printf("\n[app] received %s\n", sig)
gracefulShutdown()
os.Exit(0)
}
}
}
Run it:
go run step3/main.go
Expected output:
[app] starting up...
[app] config: map[log_level:info port:8080]
[app] running. send me signals:
kill -TERM <pid> — graceful shutdown
kill -HUP <pid> — reload config
Ctrl+C — graceful shutdown
[app] my PID is 54321
Send SIGHUP from another terminal:
kill -HUP 54321
You see:
[config] reloading configuration...
[config] log_level is now: debug
Then send SIGTERM:
kill -TERM 54321
You see:
[app] received terminated
[shutdown] closing database connections...
[shutdown] flushing logs...
[shutdown] done. goodbye.
The Bug
Look at this line:
sigChan := make(chan os.Signal)
This creates an unbuffered channel. The signal.Notify function sends signals to this channel, but it does not block. If the channel is not ready to receive when the signal arrives, the signal is dropped silently.
This can happen in practice: if you send two signals quickly (SIGHUP then SIGTERM), and your code is busy handling the first one, the second signal is lost. Your process ignores SIGTERM and keeps running. You think it is hung. You send SIGKILL.
The Fix
Use a buffered channel:
sigChan := make(chan os.Signal, 1)
step3/main.go: the fixed line:
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT, syscall.SIGHUP)
The Go documentation for signal.Notify says it explicitly:
Package signal will not block sending to c: the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate.
A buffer of 1 is enough for most cases. If you handle multiple signal types and worry about rapid delivery, use a larger buffer. But 1 is the standard pattern.
Step 4: Resource Limits
Linux Commands
Show current limits for your shell session:
ulimit -a
You will see things like:
open files (-n) 1024
max user processes (-u) 63304
virtual memory (-v) unlimited
Set the max open files higher (common for web servers and databases):
ulimit -n 65536
Set max virtual memory in KB:
ulimit -v 1048576
Run a CPU-heavy task with the lowest priority:
nice -n 19 ./heavy-task
Change the priority of a running process:
renice -n 10 -p PID
Nice values range from -20 (highest priority) to 19 (lowest). Only root can set negative nice values.
LimitNOFILE=65536 to set the open file limit. The ulimit command only affects the current shell session.Go Code: Set Resource Limits on Child Processes
Write a program that starts a child process with memory limits using syscall.Setrlimit and the SysProcAttr on exec.Cmd.
step4/main.go
package main
import (
"fmt"
"os"
"os/exec"
"syscall"
)
func main() {
// Start a child process with resource limits
cmd := exec.Command("sh", "-c", `
echo "Child started (pid=$$)"
echo "Allocating memory..."
# Allocate ~50MB by reading from /dev/urandom
head -c 52428800 /dev/urandom > /dev/null
echo "Done"
`)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
// Set memory limit to 10MB (too low)
memLimit := uint64(10 * 1024 * 1024) // 10MB in bytes
cmd.SysProcAttr = &syscall.SysProcAttr{}
// Set RLIMIT_AS (address space limit) before starting
// We need to use a wrapper approach since SysProcAttr doesn't
// directly support rlimits on the child. Instead, set them in
// the child using a prlimit approach.
// For simplicity, we'll set limits using prlimit command
cmd = exec.Command("sh", "-c", fmt.Sprintf(`
ulimit -v %d
echo "Child started (pid=$$)"
echo "Memory limit set to %d KB"
echo "Allocating memory..."
# Try to allocate a large block
python3 -c "x = bytearray(%d); print('Allocated', len(x), 'bytes')"
echo "Exit code: $?"
`, memLimit/1024, memLimit/1024, 50*1024*1024))
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
fmt.Println("Starting child with memory limit...")
err := cmd.Run()
if err != nil {
// BUG: just print the error with no context
fmt.Println("Error:", err)
}
}
Run it:
go run step4/main.go
Expected output:
Starting child with memory limit...
Child started (pid=12345)
Memory limit set to 10240 KB
Allocating memory...
Error: exit status 1
The Bug
The child process gets killed or fails, and all you see is Error: exit status 1 or Error: signal: killed. There is no context about why it was killed.
In production, you see signal: killed in your logs and have no idea why. Was it the OOM killer? Was someone running kill -9? Was it a resource limit? You check dmesg, check systemd journal, check three dashboards, and waste 30 minutes.
The Fix
Check the exit status for signal-based kills. If the child was killed by a signal, report which signal. If it was SIGKILL and you know there is a resource limit, say so.
step4/main.go: fixed:
package main
import (
"errors"
"fmt"
"os"
"os/exec"
"syscall"
)
func runWithLimits(command string, memLimitKB uint64) error {
cmd := exec.Command("sh", "-c", fmt.Sprintf(`
ulimit -v %d 2>/dev/null
exec %s
`, memLimitKB, command))
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
fmt.Printf("Starting child with memory limit %d KB...\n", memLimitKB)
err := cmd.Run()
if err == nil {
fmt.Println("Child exited normally")
return nil
}
// Check if it was killed by a signal
var exitErr *exec.ExitError
if errors.As(err, &exitErr) {
status, ok := exitErr.Sys().(syscall.WaitStatus)
if ok {
if status.Signaled() {
sig := status.Signal()
fmt.Printf("Child killed by signal: %s (%d)\n", sig, sig)
if sig == syscall.SIGKILL {
fmt.Println("SIGKILL + resource limit = likely hit memory limit")
fmt.Println("Try increasing the memory limit or reducing usage")
}
if sig == syscall.SIGXFSZ {
fmt.Println("Hit file size limit (RLIMIT_FSIZE)")
}
return err
}
fmt.Printf("Child exited with code: %d\n", status.ExitStatus())
return err
}
}
fmt.Println("Error:", err)
return err
}
func main() {
fmt.Println("=== Test 1: Low memory limit (will likely fail) ===")
runWithLimits(
`python3 -c "x = bytearray(50*1024*1024); print('Allocated', len(x), 'bytes')"`,
10*1024, // 10MB
)
fmt.Println()
fmt.Println("=== Test 2: Generous memory limit (should succeed) ===")
runWithLimits(
`python3 -c "x = bytearray(10*1024*1024); print('Allocated', len(x), 'bytes')"`,
512*1024, // 512MB
)
}
Expected output:
=== Test 1: Low memory limit (will likely fail) ===
Starting child with memory limit 10240 KB...
Child killed by signal: killed (9)
SIGKILL + resource limit = likely hit memory limit
Try increasing the memory limit or reducing usage
=== Test 2: Generous memory limit (should succeed) ===
Starting child with memory limit 524288 KB...
Allocated 10485760 bytes
Child exited normally
The key is the exec.ExitError type. When a process is killed by a signal, Go wraps the exit status in this type. You can extract the WaitStatus and check Signaled() to see if it was a signal, and Signal() to see which one. This turns “signal: killed” into actionable information.
Step 5: Running Processes in the Background
Linux Commands
The simplest way to run something in the background:
nohup ./server &
nohup means “no hangup,” and the process survives if you close the terminal. The & puts it in the background. But nohup is crude: no log management, no automatic restart, no resource limits.
Detach a running process from the shell:
disown
For production, use systemd. Here is a minimal unit file:
[Unit]
Description=My Go Service
After=network.target
[Service]
ExecStart=/usr/local/bin/myservice
Restart=always
RestartSec=5
LimitNOFILE=65536
User=deploy
[Install]
WantedBy=multi-user.target
Save this as /etc/systemd/system/myservice.service, then:
sudo systemctl daemon-reload
sudo systemctl start myservice
sudo systemctl status myservice
sudo journalctl -u myservice -f
Restart=always means systemd restarts the process if it exits for any reason. RestartSec=5 waits 5 seconds between restarts. LimitNOFILE=65536 sets the open file limit. This is what nohup cannot do.
Go Code: Auto-Restart Loop
Build a mini supervisor that keeps a child process running. If the child exits, restart it.
step5/main.go
package main
import (
"fmt"
"os"
"os/exec"
"time"
)
func main() {
command := "sh"
args := []string{"-c", `echo "Worker started (pid=$$)"; sleep 3; echo "Worker done"; exit 1`}
restarts := 0
for {
fmt.Printf("[supervisor] starting: %s %v\n", command, args)
cmd := exec.Command(command, args...)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
startTime := time.Now()
err := cmd.Run()
elapsed := time.Since(startTime)
restarts++
fmt.Printf("[supervisor] process exited after %v (err=%v)\n", elapsed.Round(time.Millisecond), err)
// BUG: restart immediately every time
fmt.Printf("[supervisor] restarting (attempt %d)...\n\n", restarts)
}
}
Run it:
go run step5/main.go
Expected output:
[supervisor] starting: sh [-c echo "Worker started (pid=$$)"; sleep 3; echo "Worker done"; exit 1]
Worker started (pid=23456)
Worker done
[supervisor] process exited after 3.004s (err=exit status 1)
[supervisor] restarting (attempt 1)...
[supervisor] starting: sh [-c echo "Worker started (pid=$$)"; sleep 3; echo "Worker done"; exit 1]
Worker started (pid=23457)
Worker done
[supervisor] process exited after 3.003s (err=exit status 1)
[supervisor] restarting (attempt 2)...
The Bug
This supervisor restarts the child immediately every time it exits. That works fine when the child runs for 3 seconds before crashing. But what if the child crashes on startup? Bad config, missing file, permission error.
The child starts, crashes in 50 milliseconds, supervisor restarts it, crashes in 50 milliseconds, restarts, crashes. You get an infinite loop that:
- Fills the disk with log messages
- Burns CPU for no reason
- Makes the problem invisible because it scrolls off screen
This is called a crash loop. Kubernetes has the same problem, which is why it has CrashLoopBackOff.
The Fix
Add exponential backoff. If the child crashes quickly (within 5 seconds of starting), double the restart delay. If the child runs for more than 30 seconds before crashing, reset the delay. The crash is probably a new issue, not a startup failure.
step5/main.go: fixed:
package main
import (
"fmt"
"os"
"os/exec"
"time"
)
type BackoffPolicy struct {
Delay time.Duration
MinDelay time.Duration
MaxDelay time.Duration
QuickCrash time.Duration // if child runs less than this, it's a "fast crash"
StableTime time.Duration // if child runs more than this, reset delay
}
func defaultBackoff() BackoffPolicy {
return BackoffPolicy{
Delay: 1 * time.Second,
MinDelay: 1 * time.Second,
MaxDelay: 60 * time.Second,
QuickCrash: 5 * time.Second,
StableTime: 30 * time.Second,
}
}
func main() {
command := "sh"
args := []string{"-c", `echo "Worker started (pid=$$)"; sleep 1; echo "Worker crashed"; exit 1`}
restarts := 0
backoff := defaultBackoff()
for {
fmt.Printf("[supervisor] starting: %s\n", command)
cmd := exec.Command(command, args...)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
startTime := time.Now()
err := cmd.Run()
elapsed := time.Since(startTime)
restarts++
fmt.Printf("[supervisor] process exited after %v (err=%v)\n",
elapsed.Round(time.Millisecond), err)
if elapsed < backoff.QuickCrash {
// Fast crash — increase backoff
backoff.Delay *= 2
if backoff.Delay > backoff.MaxDelay {
backoff.Delay = backoff.MaxDelay
}
fmt.Printf("[supervisor] fast crash detected — backoff: %v\n", backoff.Delay)
} else if elapsed > backoff.StableTime {
// Ran for a while — reset backoff
backoff.Delay = backoff.MinDelay
fmt.Println("[supervisor] process was stable — resetting backoff")
}
fmt.Printf("[supervisor] restarting in %v (attempt %d)...\n\n", backoff.Delay, restarts)
time.Sleep(backoff.Delay)
}
}
Expected output:
[supervisor] starting: sh
Worker started (pid=23456)
Worker crashed
[supervisor] process exited after 1.003s (err=exit status 1)
[supervisor] fast crash detected — backoff: 2s
[supervisor] restarting in 2s (attempt 1)...
[supervisor] starting: sh
Worker started (pid=23457)
Worker crashed
[supervisor] process exited after 1.002s (err=exit status 1)
[supervisor] fast crash detected — backoff: 4s
[supervisor] restarting in 4s (attempt 2)...
[supervisor] starting: sh
Worker started (pid=23458)
Worker crashed
[supervisor] process exited after 1.003s (err=exit status 1)
[supervisor] fast crash detected — backoff: 8s
[supervisor] restarting in 8s (attempt 3)...
The delay doubles: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped). This gives you time to notice the problem in logs, fix the config, and the supervisor won’t have burned through 10,000 restart cycles by then.
Step 6: Build a Process Supervisor With Health Checks
This is the final step. We combine everything from the previous steps into a mini process supervisor that:
- Starts a process
- Checks its health via HTTP every 10 seconds
- Restarts it if health checks fail 3 times in a row
- Handles SIGTERM gracefully (forwards it to the child)
- Uses exponential backoff for fast crashes
- Logs everything with timestamps and ANSI colors
The Supervised Process
First, a tiny HTTP server to supervise. This is the “worker” that our supervisor manages.
step6/worker/main.go
package main
import (
"fmt"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"time"
)
func main() {
fmt.Printf("Worker started (pid=%d)\n", os.Getpid())
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
fmt.Fprintln(w, "ok")
})
server := &http.Server{Addr: ":8080"}
// Graceful shutdown on SIGTERM
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)
go func() {
sig := <-sigChan
fmt.Printf("\nWorker received %s, shutting down...\n", sig)
server.Close()
}()
fmt.Println("Worker listening on :8080")
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatal(err)
}
fmt.Println("Worker stopped cleanly")
// Simulate cleanup
time.Sleep(500 * time.Millisecond)
fmt.Println("Worker cleanup done")
}
Build it:
cd step6/worker && go build -o worker . && cd ../..
The Supervisor
step6/main.go
package main
import (
"errors"
"flag"
"fmt"
"net/http"
"os"
"os/exec"
"os/signal"
"strings"
"syscall"
"time"
)
// ANSI colors
const (
colorReset = "\033[0m"
colorRed = "\033[31m"
colorGreen = "\033[32m"
colorYellow = "\033[33m"
colorCyan = "\033[36m"
)
type SupervisorConfig struct {
Command string
HealthURL string
CheckInterval time.Duration
MaxFailures int
GraceTimeout time.Duration
MinBackoff time.Duration
MaxBackoff time.Duration
QuickCrashTime time.Duration
StableTime time.Duration
}
type Supervisor struct {
config SupervisorConfig
cmd *exec.Cmd
restarts int
startTime time.Time
backoff time.Duration
lastHealth string
running bool
shutdownChan chan struct{}
}
func NewSupervisor(cfg SupervisorConfig) *Supervisor {
return &Supervisor{
config: cfg,
backoff: cfg.MinBackoff,
shutdownChan: make(chan struct{}),
}
}
func (s *Supervisor) log(color, format string, args ...interface{}) {
msg := fmt.Sprintf(format, args...)
timestamp := time.Now().Format("15:04:05")
fmt.Printf("%s[supervisor %s]%s %s\n", color, timestamp, colorReset, msg)
}
func (s *Supervisor) startProcess() error {
parts := strings.Fields(s.config.Command)
s.cmd = exec.Command(parts[0], parts[1:]...)
s.cmd.Stdout = os.Stdout
s.cmd.Stderr = os.Stderr
// Set process group so we can kill the whole tree
s.cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
if err := s.cmd.Start(); err != nil {
return fmt.Errorf("failed to start: %w", err)
}
s.startTime = time.Now()
s.running = true
s.log(colorGreen, "process started (pid=%d)", s.cmd.Process.Pid)
return nil
}
func (s *Supervisor) stopProcess() error {
if s.cmd == nil || s.cmd.Process == nil {
return nil
}
s.log(colorYellow, "sending SIGTERM to pid=%d", s.cmd.Process.Pid)
if err := s.cmd.Process.Signal(syscall.SIGTERM); err != nil {
return err
}
done := make(chan error, 1)
go func() {
done <- s.cmd.Wait()
}()
select {
case err := <-done:
s.running = false
s.log(colorGreen, "process exited gracefully")
return err
case <-time.After(s.config.GraceTimeout):
s.log(colorRed, "grace timeout — sending SIGKILL")
s.cmd.Process.Kill()
err := <-done
s.running = false
return err
}
}
func (s *Supervisor) checkHealth() (int, error) {
client := &http.Client{Timeout: 5 * time.Second}
resp, err := client.Get(s.config.HealthURL)
if err != nil {
return 0, err
}
defer resp.Body.Close()
return resp.StatusCode, nil
}
func (s *Supervisor) waitForProcess() error {
return s.cmd.Wait()
}
func (s *Supervisor) updateBackoff(elapsed time.Duration) {
if elapsed < s.config.QuickCrashTime {
s.backoff *= 2
if s.backoff > s.config.MaxBackoff {
s.backoff = s.config.MaxBackoff
}
s.log(colorYellow, "fast crash (%v) — backoff: %v", elapsed.Round(time.Millisecond), s.backoff)
} else if elapsed > s.config.StableTime {
s.backoff = s.config.MinBackoff
s.log(colorCyan, "process was stable — resetting backoff")
}
}
func (s *Supervisor) Run() {
// Handle signals for graceful shutdown of supervisor itself
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)
go func() {
sig := <-sigChan
s.log(colorYellow, "received %s — shutting down", sig)
close(s.shutdownChan)
if s.running {
s.stopProcess()
}
os.Exit(0)
}()
for {
s.log(colorCyan, "starting: %s", s.config.Command)
if err := s.startProcess(); err != nil {
s.log(colorRed, "start failed: %v", err)
s.log(colorYellow, "retrying in %v...", s.backoff)
time.Sleep(s.backoff)
s.backoff *= 2
if s.backoff > s.config.MaxBackoff {
s.backoff = s.config.MaxBackoff
}
continue
}
// Wait for the process to exit in the background
exitChan := make(chan error, 1)
go func() {
exitChan <- s.waitForProcess()
}()
// Run health check loop
failures := 0
ticker := time.NewTicker(s.config.CheckInterval)
healthLoop:
for {
select {
case err := <-exitChan:
// Process exited on its own
ticker.Stop()
s.running = false
elapsed := time.Since(s.startTime)
if err != nil {
s.log(colorRed, "process exited with error after %v: %v",
elapsed.Round(time.Millisecond), err)
// Check if killed by signal
var exitErr *exec.ExitError
if errors.As(err, &exitErr) {
if status, ok := exitErr.Sys().(syscall.WaitStatus); ok && status.Signaled() {
s.log(colorRed, "killed by signal: %s", status.Signal())
}
}
} else {
s.log(colorYellow, "process exited cleanly after %v",
elapsed.Round(time.Millisecond))
}
s.updateBackoff(elapsed)
break healthLoop
case <-ticker.C:
// Health check
statusCode, err := s.checkHealth()
uptime := time.Since(s.startTime).Round(time.Second)
if err != nil {
failures++
s.lastHealth = fmt.Sprintf("FAILED (%v)", err)
s.log(colorRed, "health check FAILED (%v) [%d/%d]",
err, failures, s.config.MaxFailures)
} else if statusCode >= 200 && statusCode < 300 {
failures = 0
s.lastHealth = fmt.Sprintf("OK (%d)", statusCode)
s.log(colorGreen, "health check OK (%d) — uptime: %v, restarts: %d",
statusCode, uptime, s.restarts)
} else {
failures++
s.lastHealth = fmt.Sprintf("UNHEALTHY (%d)", statusCode)
s.log(colorRed, "health check UNHEALTHY (%d) [%d/%d]",
statusCode, failures, s.config.MaxFailures)
}
if failures >= s.config.MaxFailures {
s.log(colorRed, "health check failed %d times — restarting process",
s.config.MaxFailures)
ticker.Stop()
s.stopProcess()
s.restarts++
break healthLoop
}
case <-s.shutdownChan:
ticker.Stop()
return
}
}
s.restarts++
s.log(colorYellow, "restarting in %v (restart #%d)...", s.backoff, s.restarts)
time.Sleep(s.backoff)
}
}
func main() {
cmdFlag := flag.String("cmd", "", "Command to run and supervise")
healthFlag := flag.String("health", "", "Health check URL (e.g. http://localhost:8080/health)")
intervalFlag := flag.Duration("interval", 10*time.Second, "Health check interval")
maxFailFlag := flag.Int("max-failures", 3, "Consecutive failures before restart")
graceFlag := flag.Duration("grace", 10*time.Second, "Grace period for SIGTERM before SIGKILL")
flag.Parse()
if *cmdFlag == "" {
fmt.Println("Usage: supervisor --cmd './myserver' --health 'http://localhost:8080/health'")
fmt.Println()
fmt.Println("Flags:")
flag.PrintDefaults()
os.Exit(1)
}
config := SupervisorConfig{
Command: *cmdFlag,
HealthURL: *healthFlag,
CheckInterval: *intervalFlag,
MaxFailures: *maxFailFlag,
GraceTimeout: *graceFlag,
MinBackoff: 1 * time.Second,
MaxBackoff: 60 * time.Second,
QuickCrashTime: 5 * time.Second,
StableTime: 30 * time.Second,
}
s := NewSupervisor(config)
fmt.Println("=== Process Supervisor ===")
fmt.Printf("Command: %s\n", config.Command)
fmt.Printf("Health URL: %s\n", config.HealthURL)
fmt.Printf("Check interval: %v\n", config.CheckInterval)
fmt.Printf("Max failures: %d\n", config.MaxFailures)
fmt.Printf("Grace timeout: %v\n", config.GraceTimeout)
fmt.Println()
s.Run()
}
Build and run:
go build -o supervisor step6/main.go
./supervisor --cmd "./step6/worker/worker" --health "http://localhost:8080/health" --interval 10s
Expected output:
=== Process Supervisor ===
Command: ./step6/worker/worker
Health URL: http://localhost:8080/health
Check interval: 10s
Max failures: 3
Grace timeout: 10s
[supervisor 14:30:00] starting: ./step6/worker/worker
Worker started (pid=12345)
Worker listening on :8080
[supervisor 14:30:00] process started (pid=12345)
[supervisor 14:30:10] health check OK (200) — uptime: 10s, restarts: 0
[supervisor 14:30:20] health check OK (200) — uptime: 20s, restarts: 0
Now kill the worker from another terminal to simulate a crash:
kill -9 $(pgrep -f worker)
You will see:
[supervisor 14:30:25] process exited with error after 25.003s: signal: killed
[supervisor 14:30:25] killed by signal: killed
[supervisor 14:30:25] process was stable — resetting backoff
[supervisor 14:30:25] restarting in 1s (restart #1)...
[supervisor 14:30:26] starting: ./step6/worker/worker
Worker started (pid=12346)
Worker listening on :8080
[supervisor 14:30:26] process started (pid=12346)
[supervisor 14:30:36] health check OK (200) — uptime: 10s, restarts: 1
To test health check failure, stop the worker’s HTTP listener without killing it (or just test with a command that does not serve HTTP):
./supervisor --cmd "sleep 300" --health "http://localhost:8080/health" --interval 5s
[supervisor 14:35:00] starting: sleep 300
[supervisor 14:35:00] process started (pid=12347)
[supervisor 14:35:05] health check FAILED (Get "http://localhost:8080/health": dial tcp 127.0.0.1:8080: connect: connection refused) [1/3]
[supervisor 14:35:10] health check FAILED (Get "http://localhost:8080/health": dial tcp 127.0.0.1:8080: connect: connection refused) [2/3]
[supervisor 14:35:15] health check FAILED (Get "http://localhost:8080/health": dial tcp 127.0.0.1:8080: connect: connection refused) [3/3]
[supervisor 14:35:15] health check failed 3 times — restarting process
[supervisor 14:35:15] sending SIGTERM to pid=12347
[supervisor 14:35:15] process exited gracefully
[supervisor 14:35:15] restarting in 1s (restart #1)...
Send SIGTERM to the supervisor itself for a clean shutdown:
kill -TERM $(pgrep -f supervisor)
[supervisor 14:40:00] received terminated — shutting down
[supervisor 14:40:00] sending SIGTERM to pid=12346
Worker received terminated, shutting down...
Worker stopped cleanly
Worker cleanup done
[supervisor 14:40:01] process exited gracefully
The supervisor forwards SIGTERM to the child, waits for it to exit, then exits itself. No orphaned processes.
What We Built
Each step showed the Linux command first, then built the same thing in Go:
ps auxbecame a Go process lister that reads/procdirectly. Trap: process names with spaces break naive parsing. Fix: find the last)and parse from there.killbecame a Go graceful shutdown function. Trap: usingKill()(SIGKILL) directly skips cleanup. Fix: send SIGTERM first, wait with a timeout, then SIGKILL.signal.Notifybecame a Go signal handler for SIGTERM, SIGINT, and SIGHUP. Trap: unbuffered channels drop signals. Fix: always usemake(chan os.Signal, 1).ulimitandnicebecame Go resource limits on child processes. Trap: child gets SIGKILL with no useful error. Fix: checkexec.ExitErrorandWaitStatus.Signal().nohupandsystemdbecame a Go auto-restart loop. Trap: infinite restart on fast crashes. Fix: exponential backoff. Double the delay on quick crashes, reset on stable runs.Combined everything into a process supervisor with health checks, backoff, graceful shutdown, and colored log output.
Cheat Sheet
Linux Commands
ps aux --sort=-%mem | head -20 # top processes by memory
ps -eo pid,ppid,cmd --forest # process tree
pgrep -af nginx # find by name
kill -TERM PID # graceful stop
kill -9 PID # forced stop
kill -HUP PID # reload config
nice -n 19 ./task # low priority
ulimit -n 65536 # max open files
systemctl status myservice # service status
journalctl -u myservice -f # follow logs
Go Patterns
os.FindProcess(pid)+proc.Signal(syscall.SIGTERM)for graceful stop of external processessignal.Notify(make(chan os.Signal, 1), ...): always use a buffered channelexec.Cmd.SysProcAttrfor resource limits and namespaces on child processesexec.ExitErrorwithWaitStatus.Signaled()andWaitStatus.Signal()to check how a process died- Exponential backoff: double delay on fast crashes, reset on long runs, cap at a maximum
Key Rules
- Always SIGTERM first, SIGKILL as last resort
- Always use buffered channels for
signal.Notify - Always add backoff to restart loops. Infinite restarts fill disks and hide the real problem
- systemd is better than nohup for production; it handles logging, resource limits, and restarts
- Check if a child was killed by a signal vs exited normally; they need different handling
- Parse
/proc/[pid]/statcarefully, because the process name field can contain spaces
References and Further Reading
- The /proc Filesystem, Linux Kernel Documentation
- signal(7), Linux Manual Page
- os/signal Package, Go Standard Library
- exec Package, Go Standard Library
- systemd.service, systemd Unit Configuration
- Exponential Backoff, Google Cloud Architecture
Keep Reading
- CPU Monitoring: From Linux Commands to a Go Dashboard: read /proc/stat for CPU metrics, the same filesystem used here for process data.
- Task Automation: From Cron and Make to a Go Task Runner: schedule and manage the processes you learned to supervise here.
- Service Health Checks: From curl to a Go Health Monitor: monitor the services your supervisor keeps running.
What's the worst runaway process you've dealt with in production? The kind that ate all the memory or filled the disk before anyone noticed?