Skip to main content
Menu
Home WhoAmI Stack Insights Blog Contact
/user/KayD @ karandeepsingh.ca :~$ cat devops-trends-2026.md

Process Management: From Linux Commands to a Go Supervisor

Karandeep Singh
• 26 minutes read

Summary

Process management from command line to Go code. Each step shows the Linux command first, then builds it in Go. Kill processes, send signals, set resource limits, build a process supervisor with auto-restart.

Every DevOps engineer manages processes. You kill stuck workers, restart crashed services, and watch memory usage climb until someone gets paged. Linux gives you the tools for all of this: ps, kill, systemctl, nice, ulimit. We are going to learn each one, then build the same patterns in Go until we have a mini process supervisor with health checks and auto-restart.

Prerequisites

  • A Linux system (native, WSL, or SSH to a remote server)
  • Go 1.21+ installed (go version to check)

Create a project directory:

mkdir process-mgmt && cd process-mgmt
go mod init process-mgmt

Step 1: Finding and Inspecting Processes

Linux Commands

The first thing you need when something is wrong is a list of what is running.

ps aux

This prints every process on the system. The columns:

ColumnMeaning
USERWho owns the process
PIDProcess ID, the number you use to kill it
%CPUCPU usage right now
%MEMPhysical memory usage as a percentage
VSZVirtual memory size in KB (address space reserved)
RSSResident Set Size in KB (actual physical memory used)
STATProcess state: S (sleeping), R (running), Z (zombie), D (uninterruptible sleep)
COMMANDThe command that started this process

RSS is the one you care about most. VSZ can be huge but harmless; it includes memory the process asked for but never touched. RSS is what is actually in RAM.

Find the top 20 memory hogs:

ps aux --sort=-%mem | head -20

See parent-child relationships in a tree:

ps -eo pid,ppid,cmd --forest

This shows which process spawned which. When you kill a parent, children might become orphans (adopted by PID 1) or die too. It depends on how the parent set things up.

Find a process by name:

pgrep -af nginx

The -a flag shows the full command line. The -f flag matches against the full command, not just the process name.

Go Code: List Processes From /proc

On Linux, every process has a directory under /proc/. The file /proc/[pid]/stat has the raw stats and /proc/[pid]/status has human-readable info.

step1/main.go

package main

import (
	"fmt"
	"os"
	"path/filepath"
	"sort"
	"strconv"
	"strings"
)

type ProcessInfo struct {
	PID     int
	Name    string
	State   string
	RSS     int // in KB
	Threads int
}

func listProcesses() ([]ProcessInfo, error) {
	entries, err := os.ReadDir("/proc")
	if err != nil {
		return nil, err
	}

	var procs []ProcessInfo
	for _, entry := range entries {
		if !entry.IsDir() {
			continue
		}
		pid, err := strconv.Atoi(entry.Name())
		if err != nil {
			continue // not a PID directory
		}

		info, err := readProcessInfo(pid)
		if err != nil {
			continue // process may have exited
		}
		procs = append(procs, info)
	}

	return procs, nil
}

func readProcessInfo(pid int) (ProcessInfo, error) {
	data, err := os.ReadFile(filepath.Join("/proc", strconv.Itoa(pid), "stat"))
	if err != nil {
		return ProcessInfo{}, err
	}

	line := string(data)
	fields := strings.Fields(line)

	// BUG: parse name from fields[1], RSS from fields[23]
	name := strings.Trim(fields[1], "()")

	rssPages, _ := strconv.Atoi(fields[23])
	pageSize := os.Getpagesize()
	rssKB := (rssPages * pageSize) / 1024

	return ProcessInfo{
		PID:  pid,
		Name: name,
		RSS:  rssKB,
	}, nil
}

func main() {
	procs, err := listProcesses()
	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(1)
	}

	// Sort by RSS descending
	sort.Slice(procs, func(i, j int) bool {
		return procs[i].RSS > procs[j].RSS
	})

	fmt.Printf("%-8s %-20s %10s\n", "PID", "NAME", "RSS (KB)")
	fmt.Println(strings.Repeat("-", 42))

	limit := 20
	if len(procs) < limit {
		limit = len(procs)
	}
	for _, p := range procs[:limit] {
		fmt.Printf("%-8d %-20s %10d\n", p.PID, p.Name, p.RSS)
	}
}

Run it:

go run step1/main.go

Expected output:

PID      NAME                   RSS (KB)
------------------------------------------
1234     firefox                 524288
5678     code                    312456
9012     node                    184320
3456     go                       98304
...

The Bug

This code has a problem. The /proc/[pid]/stat file looks like this:

1234 (Web Content) S 1200 1234 1200 ...

The process name is in parentheses and can contain spaces. When the name is (Web Content), strings.Fields splits it into (Web and Content). Now field indexes are off by one. Field 23 is no longer RSS. It is some other value.

This bug is silent. You get wrong numbers and nothing crashes. The worst kind.

The Fix

Parse the name by finding the last ) in the line. Everything after that closing parenthesis has fixed field positions.

step1/main.go: fixed readProcessInfo:

func readProcessInfo(pid int) (ProcessInfo, error) {
	data, err := os.ReadFile(filepath.Join("/proc", strconv.Itoa(pid), "stat"))
	if err != nil {
		return ProcessInfo{}, err
	}

	line := string(data)

	// Find the last ')' — everything after it has fixed positions
	closeIdx := strings.LastIndex(line, ")")
	if closeIdx == -1 {
		return ProcessInfo{}, fmt.Errorf("bad stat format for pid %d", pid)
	}

	// Name is between first '(' and last ')'
	openIdx := strings.Index(line, "(")
	name := line[openIdx+1 : closeIdx]

	// Fields after ')' — skip the space after ')'
	rest := strings.Fields(line[closeIdx+2:])
	// rest[0] = state, rest[1] = ppid, ..., rest[21] = RSS (field 23 in original, index 21 here)
	if len(rest) < 22 {
		return ProcessInfo{}, fmt.Errorf("not enough fields for pid %d", pid)
	}

	state := rest[0]
	rssPages, _ := strconv.Atoi(rest[21])
	pageSize := os.Getpagesize()
	rssKB := (rssPages * pageSize) / 1024

	threads, _ := strconv.Atoi(rest[17])

	return ProcessInfo{
		PID:     pid,
		Name:    name,
		State:   state,
		RSS:     rssKB,
		Threads: threads,
	}, nil
}

The key insight: strings.LastIndex(line, ")") handles any process name, even ones with nested parentheses. The Linux kernel guarantees the name is wrapped in ( and ).


Step 2: Killing Processes and Signals

Linux Commands

List all signals your system supports:

kill -l

You will see about 30 signals. The ones you use daily:

SignalNumberMeaning
SIGTERM15“Please stop.” The process gets a chance to clean up: close files, finish requests, flush buffers.
SIGKILL9“Stop now.” The kernel removes the process immediately. No cleanup, no signal handler, no choice.
SIGHUP1“Reload config.” Nginx, Apache, and many daemons reload their config on SIGHUP without restarting.
SIGUSR110Custom signal. Nginx uses it to reopen log files after rotation.
SIGINT2What Ctrl+C sends. Same as SIGTERM in most programs.

Send SIGTERM (the default):

kill PID

Send SIGKILL when SIGTERM is ignored:

kill -9 PID

Reload a config:

kill -HUP PID

Kill by name pattern:

pkill -f "python server.py"

Go Code: Graceful Kill With Timeout

Write a program that starts a child process, sends SIGTERM, waits up to 5 seconds, then falls back to SIGKILL.

step2/main.go

package main

import (
	"fmt"
	"os"
	"os/exec"
	"time"
)

func main() {
	// Start a child process that sleeps for 60 seconds
	cmd := exec.Command("sleep", "60")
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	if err := cmd.Start(); err != nil {
		fmt.Println("Failed to start:", err)
		os.Exit(1)
	}
	fmt.Printf("Started child process (pid=%d)\n", cmd.Process.Pid)

	// Wait 2 seconds then try to stop it
	time.Sleep(2 * time.Second)
	fmt.Println("Stopping child process...")

	// BUG: just kill it immediately
	err := cmd.Process.Kill()
	if err != nil {
		fmt.Println("Kill error:", err)
	}

	err = cmd.Wait()
	fmt.Println("Process exited:", err)
}

Run it:

go run step2/main.go

Expected output:

Started child process (pid=12345)
Stopping child process...
Process exited: signal: killed

The Bug

cmd.Process.Kill() sends SIGKILL directly. The child never gets a chance to clean up. If the child had open files, database connections, or was in the middle of writing data, all of that is lost.

With sleep this doesn’t matter. But replace sleep with a real service and you get corrupted data, half-written files, and leaked resources.

The Fix

Send SIGTERM first. Wait with a timeout. Fall back to SIGKILL only if the process ignores SIGTERM.

step2/main.go: fixed:

package main

import (
	"fmt"
	"os"
	"os/exec"
	"syscall"
	"time"
)

func gracefulStop(cmd *exec.Cmd, timeout time.Duration) error {
	// Step 1: send SIGTERM
	fmt.Println("Sending SIGTERM...")
	if err := cmd.Process.Signal(syscall.SIGTERM); err != nil {
		return fmt.Errorf("failed to send SIGTERM: %w", err)
	}

	// Step 2: wait with a timeout
	done := make(chan error, 1)
	go func() {
		done <- cmd.Wait()
	}()

	select {
	case err := <-done:
		fmt.Println("Process exited gracefully")
		return err
	case <-time.After(timeout):
		// Step 3: SIGTERM was ignored, send SIGKILL
		fmt.Println("Timeout — sending SIGKILL...")
		if err := cmd.Process.Kill(); err != nil {
			return fmt.Errorf("failed to kill: %w", err)
		}
		return <-done
	}
}

func main() {
	cmd := exec.Command("sleep", "60")
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	if err := cmd.Start(); err != nil {
		fmt.Println("Failed to start:", err)
		os.Exit(1)
	}
	fmt.Printf("Started child process (pid=%d)\n", cmd.Process.Pid)

	time.Sleep(2 * time.Second)
	fmt.Println("Stopping child process...")

	err := gracefulStop(cmd, 5*time.Second)
	if err != nil {
		fmt.Println("Process exited:", err)
	}
}

Expected output:

Started child process (pid=12345)
Stopping child process...
Sending SIGTERM...
Timeout — sending SIGKILL...
Process exited: signal: killed

The sleep command does not handle SIGTERM (it just dies), so you see the timeout path. But a real service with a signal handler would exit gracefully in the first branch.

The pattern is: SIGTERM, wait, SIGKILL. This is what docker stop does. It sends SIGTERM, waits 10 seconds (configurable with --time), then sends SIGKILL. Now you know why.


Step 3: Handling Signals in Your Own Process

Linux Commands

When your Go program is running, you can send it signals from another terminal:

# Terminal 1: Run your Go program
go run main.go

# Terminal 2: Find the process and send signals
kill -TERM $(pgrep -f "go run main.go")
kill -HUP $(pgrep -f "go run main.go")

Or press Ctrl+C to send SIGINT.

Go Code: Catch Signals

Write a program that catches SIGTERM, SIGINT, and SIGHUP. On SIGTERM or SIGINT, do a graceful shutdown. On SIGHUP, reload configuration.

step3/main.go

package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"
)

var config = map[string]string{
	"log_level": "info",
	"port":      "8080",
}

func reloadConfig() {
	fmt.Println("[config] reloading configuration...")
	// In a real app, re-read the config file here
	config["log_level"] = "debug"
	fmt.Printf("[config] log_level is now: %s\n", config["log_level"])
}

func gracefulShutdown() {
	fmt.Println("[shutdown] closing database connections...")
	time.Sleep(500 * time.Millisecond)
	fmt.Println("[shutdown] flushing logs...")
	time.Sleep(300 * time.Millisecond)
	fmt.Println("[shutdown] done. goodbye.")
}

func main() {
	fmt.Println("[app] starting up...")
	fmt.Printf("[app] config: %v\n", config)

	// BUG: unbuffered channel
	sigChan := make(chan os.Signal)
	signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT, syscall.SIGHUP)

	fmt.Println("[app] running. send me signals:")
	fmt.Println("  kill -TERM <pid>  — graceful shutdown")
	fmt.Println("  kill -HUP <pid>   — reload config")
	fmt.Println("  Ctrl+C            — graceful shutdown")
	fmt.Printf("[app] my PID is %d\n", os.Getpid())

	for {
		sig := <-sigChan
		switch sig {
		case syscall.SIGHUP:
			reloadConfig()
		case syscall.SIGTERM, syscall.SIGINT:
			fmt.Printf("\n[app] received %s\n", sig)
			gracefulShutdown()
			os.Exit(0)
		}
	}
}

Run it:

go run step3/main.go

Expected output:

[app] starting up...
[app] config: map[log_level:info port:8080]
[app] running. send me signals:
  kill -TERM <pid>  — graceful shutdown
  kill -HUP <pid>   — reload config
  Ctrl+C            — graceful shutdown
[app] my PID is 54321

Send SIGHUP from another terminal:

kill -HUP 54321

You see:

[config] reloading configuration...
[config] log_level is now: debug

Then send SIGTERM:

kill -TERM 54321

You see:

[app] received terminated
[shutdown] closing database connections...
[shutdown] flushing logs...
[shutdown] done. goodbye.

The Bug

Look at this line:

sigChan := make(chan os.Signal)

This creates an unbuffered channel. The signal.Notify function sends signals to this channel, but it does not block. If the channel is not ready to receive when the signal arrives, the signal is dropped silently.

This can happen in practice: if you send two signals quickly (SIGHUP then SIGTERM), and your code is busy handling the first one, the second signal is lost. Your process ignores SIGTERM and keeps running. You think it is hung. You send SIGKILL.

The Fix

Use a buffered channel:

sigChan := make(chan os.Signal, 1)

step3/main.go: the fixed line:

sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT, syscall.SIGHUP)

The Go documentation for signal.Notify says it explicitly:

Package signal will not block sending to c: the caller must ensure that c has sufficient buffer space to keep up with the expected signal rate.

A buffer of 1 is enough for most cases. If you handle multiple signal types and worry about rapid delivery, use a larger buffer. But 1 is the standard pattern.


Step 4: Resource Limits

Linux Commands

Show current limits for your shell session:

ulimit -a

You will see things like:

open files          (-n) 1024
max user processes  (-u) 63304
virtual memory      (-v) unlimited

Set the max open files higher (common for web servers and databases):

ulimit -n 65536

Set max virtual memory in KB:

ulimit -v 1048576

Run a CPU-heavy task with the lowest priority:

nice -n 19 ./heavy-task

Change the priority of a running process:

renice -n 10 -p PID

Nice values range from -20 (highest priority) to 19 (lowest). Only root can set negative nice values.

Go Code: Set Resource Limits on Child Processes

Write a program that starts a child process with memory limits using syscall.Setrlimit and the SysProcAttr on exec.Cmd.

step4/main.go

package main

import (
	"fmt"
	"os"
	"os/exec"
	"syscall"
)

func main() {
	// Start a child process with resource limits
	cmd := exec.Command("sh", "-c", `
		echo "Child started (pid=$$)"
		echo "Allocating memory..."
		# Allocate ~50MB by reading from /dev/urandom
		head -c 52428800 /dev/urandom > /dev/null
		echo "Done"
	`)
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	// Set memory limit to 10MB (too low)
	memLimit := uint64(10 * 1024 * 1024) // 10MB in bytes
	cmd.SysProcAttr = &syscall.SysProcAttr{}

	// Set RLIMIT_AS (address space limit) before starting
	// We need to use a wrapper approach since SysProcAttr doesn't
	// directly support rlimits on the child. Instead, set them in
	// the child using a prlimit approach.

	// For simplicity, we'll set limits using prlimit command
	cmd = exec.Command("sh", "-c", fmt.Sprintf(`
		ulimit -v %d
		echo "Child started (pid=$$)"
		echo "Memory limit set to %d KB"
		echo "Allocating memory..."
		# Try to allocate a large block
		python3 -c "x = bytearray(%d); print('Allocated', len(x), 'bytes')"
		echo "Exit code: $?"
	`, memLimit/1024, memLimit/1024, 50*1024*1024))

	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	fmt.Println("Starting child with memory limit...")
	err := cmd.Run()
	if err != nil {
		// BUG: just print the error with no context
		fmt.Println("Error:", err)
	}
}

Run it:

go run step4/main.go

Expected output:

Starting child with memory limit...
Child started (pid=12345)
Memory limit set to 10240 KB
Allocating memory...
Error: exit status 1

The Bug

The child process gets killed or fails, and all you see is Error: exit status 1 or Error: signal: killed. There is no context about why it was killed.

In production, you see signal: killed in your logs and have no idea why. Was it the OOM killer? Was someone running kill -9? Was it a resource limit? You check dmesg, check systemd journal, check three dashboards, and waste 30 minutes.

The Fix

Check the exit status for signal-based kills. If the child was killed by a signal, report which signal. If it was SIGKILL and you know there is a resource limit, say so.

step4/main.go: fixed:

package main

import (
	"errors"
	"fmt"
	"os"
	"os/exec"
	"syscall"
)

func runWithLimits(command string, memLimitKB uint64) error {
	cmd := exec.Command("sh", "-c", fmt.Sprintf(`
		ulimit -v %d 2>/dev/null
		exec %s
	`, memLimitKB, command))

	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	fmt.Printf("Starting child with memory limit %d KB...\n", memLimitKB)
	err := cmd.Run()
	if err == nil {
		fmt.Println("Child exited normally")
		return nil
	}

	// Check if it was killed by a signal
	var exitErr *exec.ExitError
	if errors.As(err, &exitErr) {
		status, ok := exitErr.Sys().(syscall.WaitStatus)
		if ok {
			if status.Signaled() {
				sig := status.Signal()
				fmt.Printf("Child killed by signal: %s (%d)\n", sig, sig)

				if sig == syscall.SIGKILL {
					fmt.Println("SIGKILL + resource limit = likely hit memory limit")
					fmt.Println("Try increasing the memory limit or reducing usage")
				}
				if sig == syscall.SIGXFSZ {
					fmt.Println("Hit file size limit (RLIMIT_FSIZE)")
				}
				return err
			}

			fmt.Printf("Child exited with code: %d\n", status.ExitStatus())
			return err
		}
	}

	fmt.Println("Error:", err)
	return err
}

func main() {
	fmt.Println("=== Test 1: Low memory limit (will likely fail) ===")
	runWithLimits(
		`python3 -c "x = bytearray(50*1024*1024); print('Allocated', len(x), 'bytes')"`,
		10*1024, // 10MB
	)

	fmt.Println()

	fmt.Println("=== Test 2: Generous memory limit (should succeed) ===")
	runWithLimits(
		`python3 -c "x = bytearray(10*1024*1024); print('Allocated', len(x), 'bytes')"`,
		512*1024, // 512MB
	)
}

Expected output:

=== Test 1: Low memory limit (will likely fail) ===
Starting child with memory limit 10240 KB...
Child killed by signal: killed (9)
SIGKILL + resource limit = likely hit memory limit
Try increasing the memory limit or reducing usage

=== Test 2: Generous memory limit (should succeed) ===
Starting child with memory limit 524288 KB...
Allocated 10485760 bytes
Child exited normally

The key is the exec.ExitError type. When a process is killed by a signal, Go wraps the exit status in this type. You can extract the WaitStatus and check Signaled() to see if it was a signal, and Signal() to see which one. This turns “signal: killed” into actionable information.


Step 5: Running Processes in the Background

Linux Commands

The simplest way to run something in the background:

nohup ./server &

nohup means “no hangup,” and the process survives if you close the terminal. The & puts it in the background. But nohup is crude: no log management, no automatic restart, no resource limits.

Detach a running process from the shell:

disown

For production, use systemd. Here is a minimal unit file:

[Unit]
Description=My Go Service
After=network.target

[Service]
ExecStart=/usr/local/bin/myservice
Restart=always
RestartSec=5
LimitNOFILE=65536
User=deploy

[Install]
WantedBy=multi-user.target

Save this as /etc/systemd/system/myservice.service, then:

sudo systemctl daemon-reload
sudo systemctl start myservice
sudo systemctl status myservice
sudo journalctl -u myservice -f

Restart=always means systemd restarts the process if it exits for any reason. RestartSec=5 waits 5 seconds between restarts. LimitNOFILE=65536 sets the open file limit. This is what nohup cannot do.

Go Code: Auto-Restart Loop

Build a mini supervisor that keeps a child process running. If the child exits, restart it.

step5/main.go

package main

import (
	"fmt"
	"os"
	"os/exec"
	"time"
)

func main() {
	command := "sh"
	args := []string{"-c", `echo "Worker started (pid=$$)"; sleep 3; echo "Worker done"; exit 1`}

	restarts := 0

	for {
		fmt.Printf("[supervisor] starting: %s %v\n", command, args)

		cmd := exec.Command(command, args...)
		cmd.Stdout = os.Stdout
		cmd.Stderr = os.Stderr

		startTime := time.Now()
		err := cmd.Run()
		elapsed := time.Since(startTime)

		restarts++
		fmt.Printf("[supervisor] process exited after %v (err=%v)\n", elapsed.Round(time.Millisecond), err)

		// BUG: restart immediately every time
		fmt.Printf("[supervisor] restarting (attempt %d)...\n\n", restarts)
	}
}

Run it:

go run step5/main.go

Expected output:

[supervisor] starting: sh [-c echo "Worker started (pid=$$)"; sleep 3; echo "Worker done"; exit 1]
Worker started (pid=23456)
Worker done
[supervisor] process exited after 3.004s (err=exit status 1)
[supervisor] restarting (attempt 1)...

[supervisor] starting: sh [-c echo "Worker started (pid=$$)"; sleep 3; echo "Worker done"; exit 1]
Worker started (pid=23457)
Worker done
[supervisor] process exited after 3.003s (err=exit status 1)
[supervisor] restarting (attempt 2)...

The Bug

This supervisor restarts the child immediately every time it exits. That works fine when the child runs for 3 seconds before crashing. But what if the child crashes on startup? Bad config, missing file, permission error.

The child starts, crashes in 50 milliseconds, supervisor restarts it, crashes in 50 milliseconds, restarts, crashes. You get an infinite loop that:

  • Fills the disk with log messages
  • Burns CPU for no reason
  • Makes the problem invisible because it scrolls off screen

This is called a crash loop. Kubernetes has the same problem, which is why it has CrashLoopBackOff.

The Fix

Add exponential backoff. If the child crashes quickly (within 5 seconds of starting), double the restart delay. If the child runs for more than 30 seconds before crashing, reset the delay. The crash is probably a new issue, not a startup failure.

step5/main.go: fixed:

package main

import (
	"fmt"
	"os"
	"os/exec"
	"time"
)

type BackoffPolicy struct {
	Delay      time.Duration
	MinDelay   time.Duration
	MaxDelay   time.Duration
	QuickCrash time.Duration // if child runs less than this, it's a "fast crash"
	StableTime time.Duration // if child runs more than this, reset delay
}

func defaultBackoff() BackoffPolicy {
	return BackoffPolicy{
		Delay:      1 * time.Second,
		MinDelay:   1 * time.Second,
		MaxDelay:   60 * time.Second,
		QuickCrash: 5 * time.Second,
		StableTime: 30 * time.Second,
	}
}

func main() {
	command := "sh"
	args := []string{"-c", `echo "Worker started (pid=$$)"; sleep 1; echo "Worker crashed"; exit 1`}

	restarts := 0
	backoff := defaultBackoff()

	for {
		fmt.Printf("[supervisor] starting: %s\n", command)

		cmd := exec.Command(command, args...)
		cmd.Stdout = os.Stdout
		cmd.Stderr = os.Stderr

		startTime := time.Now()
		err := cmd.Run()
		elapsed := time.Since(startTime)

		restarts++
		fmt.Printf("[supervisor] process exited after %v (err=%v)\n",
			elapsed.Round(time.Millisecond), err)

		if elapsed < backoff.QuickCrash {
			// Fast crash — increase backoff
			backoff.Delay *= 2
			if backoff.Delay > backoff.MaxDelay {
				backoff.Delay = backoff.MaxDelay
			}
			fmt.Printf("[supervisor] fast crash detected — backoff: %v\n", backoff.Delay)
		} else if elapsed > backoff.StableTime {
			// Ran for a while — reset backoff
			backoff.Delay = backoff.MinDelay
			fmt.Println("[supervisor] process was stable — resetting backoff")
		}

		fmt.Printf("[supervisor] restarting in %v (attempt %d)...\n\n", backoff.Delay, restarts)
		time.Sleep(backoff.Delay)
	}
}

Expected output:

[supervisor] starting: sh
Worker started (pid=23456)
Worker crashed
[supervisor] process exited after 1.003s (err=exit status 1)
[supervisor] fast crash detected — backoff: 2s
[supervisor] restarting in 2s (attempt 1)...

[supervisor] starting: sh
Worker started (pid=23457)
Worker crashed
[supervisor] process exited after 1.002s (err=exit status 1)
[supervisor] fast crash detected — backoff: 4s
[supervisor] restarting in 4s (attempt 2)...

[supervisor] starting: sh
Worker started (pid=23458)
Worker crashed
[supervisor] process exited after 1.003s (err=exit status 1)
[supervisor] fast crash detected — backoff: 8s
[supervisor] restarting in 8s (attempt 3)...

The delay doubles: 1s, 2s, 4s, 8s, 16s, 32s, 60s (capped). This gives you time to notice the problem in logs, fix the config, and the supervisor won’t have burned through 10,000 restart cycles by then.


Step 6: Build a Process Supervisor With Health Checks

This is the final step. We combine everything from the previous steps into a mini process supervisor that:

  1. Starts a process
  2. Checks its health via HTTP every 10 seconds
  3. Restarts it if health checks fail 3 times in a row
  4. Handles SIGTERM gracefully (forwards it to the child)
  5. Uses exponential backoff for fast crashes
  6. Logs everything with timestamps and ANSI colors

The Supervised Process

First, a tiny HTTP server to supervise. This is the “worker” that our supervisor manages.

step6/worker/main.go

package main

import (
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	fmt.Printf("Worker started (pid=%d)\n", os.Getpid())

	http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
		fmt.Fprintln(w, "ok")
	})

	server := &http.Server{Addr: ":8080"}

	// Graceful shutdown on SIGTERM
	sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)

	go func() {
		sig := <-sigChan
		fmt.Printf("\nWorker received %s, shutting down...\n", sig)
		server.Close()
	}()

	fmt.Println("Worker listening on :8080")
	if err := server.ListenAndServe(); err != http.ErrServerClosed {
		log.Fatal(err)
	}
	fmt.Println("Worker stopped cleanly")

	// Simulate cleanup
	time.Sleep(500 * time.Millisecond)
	fmt.Println("Worker cleanup done")
}

Build it:

cd step6/worker && go build -o worker . && cd ../..

The Supervisor

step6/main.go

package main

import (
	"errors"
	"flag"
	"fmt"
	"net/http"
	"os"
	"os/exec"
	"os/signal"
	"strings"
	"syscall"
	"time"
)

// ANSI colors
const (
	colorReset  = "\033[0m"
	colorRed    = "\033[31m"
	colorGreen  = "\033[32m"
	colorYellow = "\033[33m"
	colorCyan   = "\033[36m"
)

type SupervisorConfig struct {
	Command        string
	HealthURL      string
	CheckInterval  time.Duration
	MaxFailures    int
	GraceTimeout   time.Duration
	MinBackoff     time.Duration
	MaxBackoff     time.Duration
	QuickCrashTime time.Duration
	StableTime     time.Duration
}

type Supervisor struct {
	config       SupervisorConfig
	cmd          *exec.Cmd
	restarts     int
	startTime    time.Time
	backoff      time.Duration
	lastHealth   string
	running      bool
	shutdownChan chan struct{}
}

func NewSupervisor(cfg SupervisorConfig) *Supervisor {
	return &Supervisor{
		config:       cfg,
		backoff:      cfg.MinBackoff,
		shutdownChan: make(chan struct{}),
	}
}

func (s *Supervisor) log(color, format string, args ...interface{}) {
	msg := fmt.Sprintf(format, args...)
	timestamp := time.Now().Format("15:04:05")
	fmt.Printf("%s[supervisor %s]%s %s\n", color, timestamp, colorReset, msg)
}

func (s *Supervisor) startProcess() error {
	parts := strings.Fields(s.config.Command)
	s.cmd = exec.Command(parts[0], parts[1:]...)
	s.cmd.Stdout = os.Stdout
	s.cmd.Stderr = os.Stderr

	// Set process group so we can kill the whole tree
	s.cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}

	if err := s.cmd.Start(); err != nil {
		return fmt.Errorf("failed to start: %w", err)
	}

	s.startTime = time.Now()
	s.running = true
	s.log(colorGreen, "process started (pid=%d)", s.cmd.Process.Pid)
	return nil
}

func (s *Supervisor) stopProcess() error {
	if s.cmd == nil || s.cmd.Process == nil {
		return nil
	}

	s.log(colorYellow, "sending SIGTERM to pid=%d", s.cmd.Process.Pid)
	if err := s.cmd.Process.Signal(syscall.SIGTERM); err != nil {
		return err
	}

	done := make(chan error, 1)
	go func() {
		done <- s.cmd.Wait()
	}()

	select {
	case err := <-done:
		s.running = false
		s.log(colorGreen, "process exited gracefully")
		return err
	case <-time.After(s.config.GraceTimeout):
		s.log(colorRed, "grace timeout — sending SIGKILL")
		s.cmd.Process.Kill()
		err := <-done
		s.running = false
		return err
	}
}

func (s *Supervisor) checkHealth() (int, error) {
	client := &http.Client{Timeout: 5 * time.Second}
	resp, err := client.Get(s.config.HealthURL)
	if err != nil {
		return 0, err
	}
	defer resp.Body.Close()
	return resp.StatusCode, nil
}

func (s *Supervisor) waitForProcess() error {
	return s.cmd.Wait()
}

func (s *Supervisor) updateBackoff(elapsed time.Duration) {
	if elapsed < s.config.QuickCrashTime {
		s.backoff *= 2
		if s.backoff > s.config.MaxBackoff {
			s.backoff = s.config.MaxBackoff
		}
		s.log(colorYellow, "fast crash (%v) — backoff: %v", elapsed.Round(time.Millisecond), s.backoff)
	} else if elapsed > s.config.StableTime {
		s.backoff = s.config.MinBackoff
		s.log(colorCyan, "process was stable — resetting backoff")
	}
}

func (s *Supervisor) Run() {
	// Handle signals for graceful shutdown of supervisor itself
	sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)

	go func() {
		sig := <-sigChan
		s.log(colorYellow, "received %s — shutting down", sig)
		close(s.shutdownChan)
		if s.running {
			s.stopProcess()
		}
		os.Exit(0)
	}()

	for {
		s.log(colorCyan, "starting: %s", s.config.Command)

		if err := s.startProcess(); err != nil {
			s.log(colorRed, "start failed: %v", err)
			s.log(colorYellow, "retrying in %v...", s.backoff)
			time.Sleep(s.backoff)
			s.backoff *= 2
			if s.backoff > s.config.MaxBackoff {
				s.backoff = s.config.MaxBackoff
			}
			continue
		}

		// Wait for the process to exit in the background
		exitChan := make(chan error, 1)
		go func() {
			exitChan <- s.waitForProcess()
		}()

		// Run health check loop
		failures := 0
		ticker := time.NewTicker(s.config.CheckInterval)

	healthLoop:
		for {
			select {
			case err := <-exitChan:
				// Process exited on its own
				ticker.Stop()
				s.running = false
				elapsed := time.Since(s.startTime)

				if err != nil {
					s.log(colorRed, "process exited with error after %v: %v",
						elapsed.Round(time.Millisecond), err)

					// Check if killed by signal
					var exitErr *exec.ExitError
					if errors.As(err, &exitErr) {
						if status, ok := exitErr.Sys().(syscall.WaitStatus); ok && status.Signaled() {
							s.log(colorRed, "killed by signal: %s", status.Signal())
						}
					}
				} else {
					s.log(colorYellow, "process exited cleanly after %v",
						elapsed.Round(time.Millisecond))
				}

				s.updateBackoff(elapsed)
				break healthLoop

			case <-ticker.C:
				// Health check
				statusCode, err := s.checkHealth()
				uptime := time.Since(s.startTime).Round(time.Second)

				if err != nil {
					failures++
					s.lastHealth = fmt.Sprintf("FAILED (%v)", err)
					s.log(colorRed, "health check FAILED (%v) [%d/%d]",
						err, failures, s.config.MaxFailures)
				} else if statusCode >= 200 && statusCode < 300 {
					failures = 0
					s.lastHealth = fmt.Sprintf("OK (%d)", statusCode)
					s.log(colorGreen, "health check OK (%d) — uptime: %v, restarts: %d",
						statusCode, uptime, s.restarts)
				} else {
					failures++
					s.lastHealth = fmt.Sprintf("UNHEALTHY (%d)", statusCode)
					s.log(colorRed, "health check UNHEALTHY (%d) [%d/%d]",
						statusCode, failures, s.config.MaxFailures)
				}

				if failures >= s.config.MaxFailures {
					s.log(colorRed, "health check failed %d times — restarting process",
						s.config.MaxFailures)
					ticker.Stop()
					s.stopProcess()
					s.restarts++
					break healthLoop
				}

			case <-s.shutdownChan:
				ticker.Stop()
				return
			}
		}

		s.restarts++
		s.log(colorYellow, "restarting in %v (restart #%d)...", s.backoff, s.restarts)
		time.Sleep(s.backoff)
	}
}

func main() {
	cmdFlag := flag.String("cmd", "", "Command to run and supervise")
	healthFlag := flag.String("health", "", "Health check URL (e.g. http://localhost:8080/health)")
	intervalFlag := flag.Duration("interval", 10*time.Second, "Health check interval")
	maxFailFlag := flag.Int("max-failures", 3, "Consecutive failures before restart")
	graceFlag := flag.Duration("grace", 10*time.Second, "Grace period for SIGTERM before SIGKILL")
	flag.Parse()

	if *cmdFlag == "" {
		fmt.Println("Usage: supervisor --cmd './myserver' --health 'http://localhost:8080/health'")
		fmt.Println()
		fmt.Println("Flags:")
		flag.PrintDefaults()
		os.Exit(1)
	}

	config := SupervisorConfig{
		Command:        *cmdFlag,
		HealthURL:      *healthFlag,
		CheckInterval:  *intervalFlag,
		MaxFailures:    *maxFailFlag,
		GraceTimeout:   *graceFlag,
		MinBackoff:     1 * time.Second,
		MaxBackoff:     60 * time.Second,
		QuickCrashTime: 5 * time.Second,
		StableTime:     30 * time.Second,
	}

	s := NewSupervisor(config)

	fmt.Println("=== Process Supervisor ===")
	fmt.Printf("Command:        %s\n", config.Command)
	fmt.Printf("Health URL:     %s\n", config.HealthURL)
	fmt.Printf("Check interval: %v\n", config.CheckInterval)
	fmt.Printf("Max failures:   %d\n", config.MaxFailures)
	fmt.Printf("Grace timeout:  %v\n", config.GraceTimeout)
	fmt.Println()

	s.Run()
}

Build and run:

go build -o supervisor step6/main.go
./supervisor --cmd "./step6/worker/worker" --health "http://localhost:8080/health" --interval 10s

Expected output:

=== Process Supervisor ===
Command:        ./step6/worker/worker
Health URL:     http://localhost:8080/health
Check interval: 10s
Max failures:   3
Grace timeout:  10s

[supervisor 14:30:00] starting: ./step6/worker/worker
Worker started (pid=12345)
Worker listening on :8080
[supervisor 14:30:00] process started (pid=12345)
[supervisor 14:30:10] health check OK (200) — uptime: 10s, restarts: 0
[supervisor 14:30:20] health check OK (200) — uptime: 20s, restarts: 0

Now kill the worker from another terminal to simulate a crash:

kill -9 $(pgrep -f worker)

You will see:

[supervisor 14:30:25] process exited with error after 25.003s: signal: killed
[supervisor 14:30:25] killed by signal: killed
[supervisor 14:30:25] process was stable — resetting backoff
[supervisor 14:30:25] restarting in 1s (restart #1)...

[supervisor 14:30:26] starting: ./step6/worker/worker
Worker started (pid=12346)
Worker listening on :8080
[supervisor 14:30:26] process started (pid=12346)
[supervisor 14:30:36] health check OK (200) — uptime: 10s, restarts: 1

To test health check failure, stop the worker’s HTTP listener without killing it (or just test with a command that does not serve HTTP):

./supervisor --cmd "sleep 300" --health "http://localhost:8080/health" --interval 5s
[supervisor 14:35:00] starting: sleep 300
[supervisor 14:35:00] process started (pid=12347)
[supervisor 14:35:05] health check FAILED (Get "http://localhost:8080/health": dial tcp 127.0.0.1:8080: connect: connection refused) [1/3]
[supervisor 14:35:10] health check FAILED (Get "http://localhost:8080/health": dial tcp 127.0.0.1:8080: connect: connection refused) [2/3]
[supervisor 14:35:15] health check FAILED (Get "http://localhost:8080/health": dial tcp 127.0.0.1:8080: connect: connection refused) [3/3]
[supervisor 14:35:15] health check failed 3 times — restarting process
[supervisor 14:35:15] sending SIGTERM to pid=12347
[supervisor 14:35:15] process exited gracefully
[supervisor 14:35:15] restarting in 1s (restart #1)...

Send SIGTERM to the supervisor itself for a clean shutdown:

kill -TERM $(pgrep -f supervisor)
[supervisor 14:40:00] received terminated — shutting down
[supervisor 14:40:00] sending SIGTERM to pid=12346
Worker received terminated, shutting down...
Worker stopped cleanly
Worker cleanup done
[supervisor 14:40:01] process exited gracefully

The supervisor forwards SIGTERM to the child, waits for it to exit, then exits itself. No orphaned processes.


What We Built

Each step showed the Linux command first, then built the same thing in Go:

  1. ps aux became a Go process lister that reads /proc directly. Trap: process names with spaces break naive parsing. Fix: find the last ) and parse from there.

  2. kill became a Go graceful shutdown function. Trap: using Kill() (SIGKILL) directly skips cleanup. Fix: send SIGTERM first, wait with a timeout, then SIGKILL.

  3. signal.Notify became a Go signal handler for SIGTERM, SIGINT, and SIGHUP. Trap: unbuffered channels drop signals. Fix: always use make(chan os.Signal, 1).

  4. ulimit and nice became Go resource limits on child processes. Trap: child gets SIGKILL with no useful error. Fix: check exec.ExitError and WaitStatus.Signal().

  5. nohup and systemd became a Go auto-restart loop. Trap: infinite restart on fast crashes. Fix: exponential backoff. Double the delay on quick crashes, reset on stable runs.

  6. Combined everything into a process supervisor with health checks, backoff, graceful shutdown, and colored log output.


Cheat Sheet

Linux Commands

ps aux --sort=-%mem | head -20         # top processes by memory
ps -eo pid,ppid,cmd --forest           # process tree
pgrep -af nginx                        # find by name
kill -TERM PID                         # graceful stop
kill -9 PID                            # forced stop
kill -HUP PID                          # reload config
nice -n 19 ./task                      # low priority
ulimit -n 65536                        # max open files
systemctl status myservice             # service status
journalctl -u myservice -f             # follow logs

Go Patterns

  • os.FindProcess(pid) + proc.Signal(syscall.SIGTERM) for graceful stop of external processes
  • signal.Notify(make(chan os.Signal, 1), ...): always use a buffered channel
  • exec.Cmd.SysProcAttr for resource limits and namespaces on child processes
  • exec.ExitError with WaitStatus.Signaled() and WaitStatus.Signal() to check how a process died
  • Exponential backoff: double delay on fast crashes, reset on long runs, cap at a maximum

Key Rules

  • Always SIGTERM first, SIGKILL as last resort
  • Always use buffered channels for signal.Notify
  • Always add backoff to restart loops. Infinite restarts fill disks and hide the real problem
  • systemd is better than nohup for production; it handles logging, resource limits, and restarts
  • Check if a child was killed by a signal vs exited normally; they need different handling
  • Parse /proc/[pid]/stat carefully, because the process name field can contain spaces

References and Further Reading

Keep Reading

Question

What's the worst runaway process you've dealt with in production? The kind that ate all the memory or filled the disk before anyone noticed?

Contents