Skip to main content
Menu
Home WhoAmI Stack Insights Blog Contact
/user/KayD @ karandeepsingh.ca :~$ cat nodejs-error-handling-best-practices.md

Service Health Checks: From curl to a Go Health Monitor

Karandeep Singh
• 26 minutes read

Summary

Master service health monitoring with Linux commands and Go. From curl and /proc basics to building a complete health monitor with HTTP checks, TCP probes, resource monitoring, and alerting.

Every service goes down eventually. The question is whether you find out before your users do or after.

This guide starts with basic Linux commands for checking service health. Then it builds each check in Go. By the end, you will have a complete health monitoring tool that checks HTTP endpoints, TCP ports, disk space, memory, and processes. It tracks state, retries on failures, and prints color-coded output to your terminal.

Each step follows the same pattern. Run the Linux command first. Then build the same thing in Go. Then make a mistake, see it break, and fix it.

Step 1: HTTP Health Checks with curl

Most services expose an HTTP endpoint for health checks. The path is usually /health or /healthz. The response tells you if the service is running.

Linux commands

The simplest check uses curl with two flags. The -s flag suppresses the progress bar. The -f flag makes curl return a non-zero exit code on HTTP errors (4xx and 5xx).

curl -sf http://localhost:8080/health

If the service is healthy, you get the response body. If it is down, curl exits with code 22 and prints nothing. Check the exit code with $?.

curl -sf http://localhost:8080/health
echo $?
# 0 means healthy, non-zero means something is wrong

Sometimes you only care about the status code. Use -o /dev/null to discard the body and -w to print just the HTTP code.

curl -o /dev/null -s -w "%{http_code}\n" http://localhost:8080/health
# Output: 200

Timeouts matter. Without them, curl will hang if the server accepts the connection but never responds. Set a connection timeout and a total timeout.

curl --connect-timeout 5 --max-time 10 -sf http://localhost:8080/health

--connect-timeout 5 gives the TCP handshake 5 seconds. --max-time 10 caps the entire request at 10 seconds. If either limit is hit, curl exits with an error.

Build it in Go

Here is a basic HTTP health checker in Go.

package main

import (
	"fmt"
	"net/http"
	"os"
	"time"
)

func checkHTTP(url string) error {
	client := &http.Client{
		Timeout: 10 * time.Second,
	}

	resp, err := client.Get(url)
	if err != nil {
		return fmt.Errorf("request failed: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
		return fmt.Errorf("unhealthy status code: %d", resp.StatusCode)
	}

	return nil
}

func main() {
	url := "http://localhost:8080/health"
	if len(os.Args) > 1 {
		url = os.Args[1]
	}

	err := checkHTTP(url)
	if err != nil {
		fmt.Fprintf(os.Stderr, "UNHEALTHY: %s\n", err)
		os.Exit(1)
	}

	fmt.Println("HEALTHY")
}

The http.Client has a 10-second timeout. This covers the entire request: DNS lookup, TCP handshake, TLS handshake, sending the request, and reading the response. If anything takes longer than 10 seconds total, the client returns an error.

The bug: no timeout

What happens if you use http.Get() directly?

package main

import (
	"fmt"
	"net/http"
)

func checkHTTPBroken(url string) error {
	// BUG: http.Get uses the default client, which has no timeout
	resp, err := http.Get(url)
	if err != nil {
		return err
	}
	defer resp.Body.Close()

	if resp.StatusCode != 200 {
		return fmt.Errorf("status: %d", resp.StatusCode)
	}

	return nil
}

func main() {
	err := checkHTTPBroken("http://localhost:8080/health")
	if err != nil {
		fmt.Println("UNHEALTHY:", err)
		return
	}
	fmt.Println("HEALTHY")
}

The default http.Client in Go has no timeout. Zero. If the server accepts the TCP connection but never sends a response, your program hangs forever. This is a real problem in health checks. A health checker that hangs is worse than no health checker at all. It will consume a goroutine and a file descriptor, and you will never get a result.

The fix

Always create your own http.Client with a timeout.

client := &http.Client{
	Timeout: 10 * time.Second,
}
resp, err := client.Get(url)

This is the single most important rule for HTTP requests in Go. Never use http.DefaultClient in production code. Never use http.Get() directly. Always set a timeout.

Step 2: TCP Port Checks

Not every service has an HTTP endpoint. Databases, caches, and message queues often speak binary protocols. But you can still check if they are listening on a port.

Linux commands

The classic tool is nc (netcat). The -z flag means scan mode (do not send data). The -v flag prints the result.

nc -zv localhost 5432
# Connection to localhost 5432 port [tcp/postgresql] succeeded!

If the port is closed or the host is unreachable, nc prints an error and exits with a non-zero code.

You can also use bash built-in TCP support. This opens a TCP connection to the given host and port. The timeout command kills it after 3 seconds if it does not complete.

timeout 3 bash -c '</dev/tcp/localhost/5432' && echo "Port open" || echo "Port closed"

To check what is listening on your system, use ss. The flags -t means TCP, -l means listening, -n means numeric (do not resolve names), and -p means show the process.

ss -tlnp | grep :5432
# LISTEN  0  244  0.0.0.0:5432  0.0.0.0:*  users:(("postgres",pid=1234,fd=5))

This tells you that PostgreSQL is listening on port 5432, its PID is 1234, and it is using file descriptor 5.

Build it in Go

Go’s net package provides DialTimeout, which does exactly what we need.

package main

import (
	"fmt"
	"net"
	"os"
	"time"
)

func checkTCP(host string, port string) error {
	address := net.JoinHostPort(host, port)
	conn, err := net.DialTimeout("tcp", address, 5*time.Second)
	if err != nil {
		return fmt.Errorf("tcp connect failed: %w", err)
	}
	defer conn.Close()

	return nil
}

func main() {
	host := "localhost"
	port := "5432"
	if len(os.Args) > 2 {
		host = os.Args[1]
		port = os.Args[2]
	}

	err := checkTCP(host, port)
	if err != nil {
		fmt.Fprintf(os.Stderr, "UNHEALTHY: %s\n", err)
		os.Exit(1)
	}

	fmt.Printf("HEALTHY: %s:%s is accepting connections\n", host, port)
}

net.DialTimeout attempts a TCP connection. If the port is listening, it returns a connection. If the port is closed or the timeout is reached, it returns an error.

net.JoinHostPort properly handles IPv6 addresses by wrapping them in brackets. Always use it instead of fmt.Sprintf("%s:%s", host, port).

The bug: file descriptor leak

Here is a version that checks the port but forgets to close the connection.

package main

import (
	"fmt"
	"net"
	"time"
)

func checkTCPBroken(host string, port string) error {
	address := net.JoinHostPort(host, port)
	conn, err := net.DialTimeout("tcp", address, 5*time.Second)
	if err != nil {
		return fmt.Errorf("tcp connect failed: %w", err)
	}

	// BUG: we opened a connection but never closed it
	_ = conn

	return nil
}

func main() {
	for i := 0; i < 100000; i++ {
		err := checkTCPBroken("localhost", "5432")
		if err != nil {
			fmt.Println("check failed:", err)
			return
		}
	}
	fmt.Println("all checks passed")
}

Each call to DialTimeout opens a TCP connection and creates a file descriptor. Without closing the connection, these file descriptors pile up. On Linux, the default file descriptor limit is usually 1024 per process. After roughly 1020 health checks, your program crashes with socket: too many open files.

Even if your health check runs once per minute, after 17 hours you hit the limit. In practice, you also leak connections on the server side. The server sees thousands of established connections that will never close (until the OS cleans them up with TCP keepalive timeouts, which can be hours).

The fix

Close the connection immediately after a successful dial.

conn, err := net.DialTimeout("tcp", address, 5*time.Second)
if err != nil {
	return fmt.Errorf("tcp connect failed: %w", err)
}
defer conn.Close()

The defer conn.Close() line must come right after the nil check on err. This is a standard Go pattern. Always defer the cleanup call on the same line as the success check. If you put other code between the successful open and the defer, you risk returning early without closing the resource.

Step 3: Disk and Memory Checks from /proc

A service can respond to HTTP requests and accept TCP connections but still be minutes away from crashing. If disk space runs out, logs stop writing, databases corrupt, and containers get evicted. If memory runs out, the OOM killer picks a process and terminates it.

Linux commands

Check disk usage with df. The -h flag makes it human-readable.

df -h /
# Filesystem      Size  Used Avail Use%  Mounted on
# /dev/sda1       100G   67G   33G  67%  /

The Use% column is what matters. When it hits 90%, you should be worried. At 100%, things break.

Check memory with free. The -m flag shows megabytes.

free -m
#               total    used    free   shared  buff/cache  available
# Mem:          16384    8192    2048      512        6144       7680
# Swap:          4096       0    4096

The available column is the important one, not free. Linux uses free memory for disk cache. The available column accounts for this and tells you how much memory is actually available for new processes.

You can read the raw data from /proc/meminfo.

grep MemAvailable /proc/meminfo
# MemAvailable:    7864320 kB

Check how much space your logs are consuming.

du -sh /var/log/
# 2.3G    /var/log/

Build it in Go

Go can read /proc/meminfo like any text file. For disk space, use syscall.Statfs.

package main

import (
	"bufio"
	"fmt"
	"os"
	"strconv"
	"strings"
	"syscall"
)

type MemInfo struct {
	TotalKB     uint64
	AvailableKB uint64
}

func readMemInfo() (MemInfo, error) {
	file, err := os.Open("/proc/meminfo")
	if err != nil {
		return MemInfo{}, fmt.Errorf("open /proc/meminfo: %w", err)
	}
	defer file.Close()

	var info MemInfo
	scanner := bufio.NewScanner(file)

	for scanner.Scan() {
		line := scanner.Text()
		parts := strings.Fields(line)
		if len(parts) < 2 {
			continue
		}

		key := strings.TrimSuffix(parts[0], ":")
		value, err := strconv.ParseUint(parts[1], 10, 64)
		if err != nil {
			continue
		}

		switch key {
		case "MemTotal":
			info.TotalKB = value
		case "MemAvailable":
			info.AvailableKB = value
		}
	}

	if err := scanner.Err(); err != nil {
		return MemInfo{}, fmt.Errorf("read /proc/meminfo: %w", err)
	}

	return info, nil
}

type DiskInfo struct {
	TotalBytes uint64
	FreeBytes  uint64
	UsedPct    float64
}

func readDiskInfo(path string) (DiskInfo, error) {
	var stat syscall.Statfs_t
	err := syscall.Statfs(path, &stat)
	if err != nil {
		return DiskInfo{}, fmt.Errorf("statfs %s: %w", path, err)
	}

	total := stat.Blocks * uint64(stat.Bsize)
	free := stat.Bavail * uint64(stat.Bsize)
	used := total - free
	usedPct := float64(used) / float64(total) * 100

	return DiskInfo{
		TotalBytes: total,
		FreeBytes:  free,
		UsedPct:    usedPct,
	}, nil
}

func main() {
	mem, err := readMemInfo()
	if err != nil {
		fmt.Fprintf(os.Stderr, "memory check failed: %s\n", err)
		os.Exit(1)
	}

	usedKB := mem.TotalKB - mem.AvailableKB
	usedPct := float64(usedKB) / float64(mem.TotalKB) * 100

	fmt.Printf("Memory: %d MB total, %d MB available (%.1f%% used)\n",
		mem.TotalKB/1024, mem.AvailableKB/1024, usedPct)

	disk, err := readDiskInfo("/")
	if err != nil {
		fmt.Fprintf(os.Stderr, "disk check failed: %s\n", err)
		os.Exit(1)
	}

	fmt.Printf("Disk /: %.1f GB total, %.1f GB free (%.1f%% used)\n",
		float64(disk.TotalBytes)/1e9, float64(disk.FreeBytes)/1e9, disk.UsedPct)
}

syscall.Statfs fills a Statfs_t struct with filesystem statistics. Blocks is the total number of blocks. Bavail is the number of blocks available to unprivileged users (this accounts for the root reserved space). Bsize is the block size in bytes.

The bug: wrong unit conversion

Here is a version with a subtle math error.

func readMemInfoBroken() {
	// Pretend we parsed MemTotal: 16384000 kB from /proc/meminfo
	memTotalKB := uint64(16384000)

	// BUG: treating kB value as bytes
	memTotalGB := float64(memTotalKB) / (1024 * 1024 * 1024)

	fmt.Printf("Total memory: %.2f GB\n", memTotalGB)
	// Output: Total memory: 0.02 GB
	// That is wrong. A machine with 16 GB of RAM reports 0.02 GB.
}

The value in /proc/meminfo is in kB (kibibytes, meaning 1024 bytes). But the code treats it as if it were in bytes, dividing by 1024 three times instead of two.

The correct conversion from kB to GB is: divide by 1024 twice (kB to MB, then MB to GB). Or multiply by 1024 first (kB to bytes), then divide by 1024 three times.

The fix

Multiply the kB value by 1024 to get bytes first, then convert.

func readMemInfoFixed() {
	memTotalKB := uint64(16384000)

	// Correct: kB * 1024 = bytes, then bytes / (1024^3) = GB
	// Simplified: kB / (1024 * 1024) = GB
	memTotalGB := float64(memTotalKB) / (1024 * 1024)

	fmt.Printf("Total memory: %.2f GB\n", memTotalGB)
	// Output: Total memory: 15.63 GB
	// Correct. 16384000 kB is about 15.63 GiB.
}

This is a common mistake. The /proc/meminfo values are labeled kB but they actually mean KiB (1024 bytes), not KB (1000 bytes). The Linux kernel documentation confirms this. Always multiply by 1024 when converting from the values in /proc/meminfo to bytes.

Step 4: Process Health Checks

Sometimes you need to check if a specific process is running. The service might not have an HTTP endpoint. Or it might have one, but you want to verify the process itself has not become a zombie.

Linux commands

Find a process by name with pgrep. The -f flag matches the full command line, not just the process name.

pgrep -f myapp
# 12345

If the process exists, pgrep prints its PID and exits with code 0. If not, it prints nothing and exits with code 1.

Check if a specific PID is alive with signal 0. Signal 0 does not actually send a signal. It just checks if the process exists and you have permission to signal it.

kill -0 12345
echo $?
# 0 means alive, 1 means dead or no permission

Read the process state from /proc.

grep State /proc/12345/status
# State:    S (sleeping)

The state letters are:

  • R = running or runnable
  • S = sleeping (waiting for an event)
  • D = uninterruptible sleep (usually disk I/O, cannot be killed)
  • Z = zombie (finished but parent has not collected the exit status)
  • T = stopped (by a signal or debugger)

A zombie process is a problem. It means the parent process is not calling wait(). If you see a D state process, something is stuck in I/O. Both of these warrant investigation.

Count open file descriptors for a process.

ls /proc/12345/fd/ | wc -l
# 47

If this number is growing over time, you have a file descriptor leak. Compare it against the limit.

grep "open files" /proc/12345/limits
# Max open files  1024  1048576  files

Build it in Go

Here is a Go function that checks if a process is alive.

package main

import (
	"bufio"
	"fmt"
	"os"
	"path/filepath"
	"strconv"
	"strings"
	"syscall"
)

type ProcessInfo struct {
	PID   int
	State string
	Name  string
	FDs   int
}

func checkProcess(pid int) (ProcessInfo, error) {
	info := ProcessInfo{PID: pid}

	// Check if process exists using signal 0
	err := syscall.Kill(pid, 0)
	if err != nil {
		return info, fmt.Errorf("process %d not found: %w", pid, err)
	}

	// Read process status
	statusPath := fmt.Sprintf("/proc/%d/status", pid)
	file, err := os.Open(statusPath)
	if err != nil {
		return info, fmt.Errorf("open %s: %w", statusPath, err)
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		parts := strings.SplitN(line, ":\t", 2)
		if len(parts) != 2 {
			continue
		}

		key := strings.TrimSpace(parts[0])
		value := strings.TrimSpace(parts[1])

		switch key {
		case "Name":
			info.Name = value
		case "State":
			info.State = value
		}
	}

	// Count open file descriptors
	fdPath := fmt.Sprintf("/proc/%d/fd", pid)
	entries, err := os.ReadDir(fdPath)
	if err == nil {
		info.FDs = len(entries)
	}

	return info, nil
}

func main() {
	if len(os.Args) < 2 {
		fmt.Fprintln(os.Stderr, "usage: processcheck <pid>")
		os.Exit(1)
	}

	pid, err := strconv.Atoi(os.Args[1])
	if err != nil {
		fmt.Fprintf(os.Stderr, "invalid pid: %s\n", os.Args[1])
		os.Exit(1)
	}

	info, err := checkProcess(pid)
	if err != nil {
		fmt.Fprintf(os.Stderr, "UNHEALTHY: %s\n", err)
		os.Exit(1)
	}

	_ = filepath.Join("") // ensure filepath import is used in full program

	fmt.Printf("Process %d (%s): %s, %d open FDs\n",
		info.PID, info.Name, info.State, info.FDs)
}

This function does three things. First, it sends signal 0 to check if the process exists. Second, it reads /proc/<pid>/status to get the process name and state. Third, it counts entries in /proc/<pid>/fd to find the number of open file descriptors.

The bug: os.FindProcess always succeeds

Here is a version that tries to use os.FindProcess to check if a process exists.

package main

import (
	"fmt"
	"os"
	"strconv"
)

func checkProcessBroken(pid int) error {
	// BUG: os.FindProcess on Linux ALWAYS returns a non-nil Process
	// It does not check if the process actually exists
	proc, err := os.FindProcess(pid)
	if err != nil {
		return fmt.Errorf("process %d not found", pid)
	}

	fmt.Printf("Found process: %+v\n", proc)
	return nil
}

func main() {
	pid := 99999999 // almost certainly does not exist
	if len(os.Args) > 1 {
		pid, _ = strconv.Atoi(os.Args[1])
	}

	err := checkProcessBroken(pid)
	if err != nil {
		fmt.Println("Process not found")
	} else {
		fmt.Println("Process found!")
		// Output: Process found!
		// This is WRONG. PID 99999999 does not exist.
	}
}

On Linux, os.FindProcess always returns a valid Process struct and a nil error, regardless of whether the PID exists. It is designed this way because on Unix systems, FindProcess is just a wrapper that stores the PID. It does not perform any system call to verify the process.

This means your health check will always report “process found” for any PID, even completely nonexistent ones.

The fix

Use syscall.Kill with signal 0. This actually makes a system call.

func checkProcessFixed(pid int) error {
	err := syscall.Kill(pid, 0)
	if err == nil {
		return nil // process exists and we have permission
	}

	if err == syscall.ESRCH {
		return fmt.Errorf("process %d does not exist", pid)
	}

	if err == syscall.EPERM {
		// Process exists but we do not have permission to signal it.
		// This still means the process is alive.
		return nil
	}

	return fmt.Errorf("unexpected error checking process %d: %w", pid, err)
}

syscall.Kill(pid, 0) sends signal 0, which does nothing to the process but returns an error if the process does not exist. The kernel returns ESRCH (no such process) if the PID is not found. It returns EPERM (permission denied) if the process exists but you do not have permission to signal it. EPERM means the process IS alive. You just cannot kill it. That is still a successful health check.

Step 5: Health Checks with Retries and Degraded State

A single failed check does not always mean the service is down. Networks hiccup. Servers take a moment to warm up. Disk usage spikes during log rotation and drops again. A good health checker accounts for this.

Linux commands

Here is a bash script that retries a health check and uses exit codes to indicate state.

#!/bin/bash
# healthcheck.sh - exit 0=healthy, 1=degraded, 2=unhealthy

URL="http://localhost:8080/health"
MAX_RETRIES=3
RETRY_DELAY=2

failures=0

for i in $(seq 1 $MAX_RETRIES); do
    if curl --connect-timeout 5 --max-time 10 -sf "$URL" > /dev/null 2>&1; then
        echo "Check $i: OK"
        exit 0
    else
        failures=$((failures + 1))
        echo "Check $i: FAILED"
        if [ $i -lt $MAX_RETRIES ]; then
            sleep $RETRY_DELAY
        fi
    fi
done

if [ $failures -eq $MAX_RETRIES ]; then
    echo "UNHEALTHY: all $MAX_RETRIES checks failed"
    exit 2
else
    echo "DEGRADED: $failures of $MAX_RETRIES checks failed"
    exit 1
fi

You can watch a health check in real time with watch.

watch -n 5 'curl -sf localhost:8080/health && echo OK || echo FAIL'

This runs the curl command every 5 seconds and displays the result. It is useful for manual monitoring during deploys or incidents.

Build it in Go

A real health checker needs to track state over time. A single failure should not immediately mark the service as unhealthy. And a single success should not immediately mark it as healthy after a long outage. This is the same logic that Kubernetes uses for readiness and liveness probes.

package main

import (
	"fmt"
	"net/http"
	"sync"
	"time"
)

type HealthState int

const (
	StateHealthy  HealthState = iota
	StateDegraded
	StateUnhealthy
)

func (s HealthState) String() string {
	switch s {
	case StateHealthy:
		return "HEALTHY"
	case StateDegraded:
		return "DEGRADED"
	case StateUnhealthy:
		return "UNHEALTHY"
	default:
		return "UNKNOWN"
	}
}

type CheckResult struct {
	Healthy  bool
	Duration time.Duration
	Error    string
}

type StateTracker struct {
	mu                  sync.Mutex
	state               HealthState
	consecutiveFailures int
	consecutiveSuccess  int
	failureThreshold    int
	successThreshold    int
	lastCheck           time.Time
	lastResult          CheckResult
}

func NewStateTracker(failureThreshold, successThreshold int) *StateTracker {
	return &StateTracker{
		state:            StateHealthy,
		failureThreshold: failureThreshold,
		successThreshold: successThreshold,
	}
}

func (t *StateTracker) RecordResult(result CheckResult) HealthState {
	t.mu.Lock()
	defer t.mu.Unlock()

	t.lastCheck = time.Now()
	t.lastResult = result

	if result.Healthy {
		t.consecutiveFailures = 0
		t.consecutiveSuccess++

		if t.consecutiveSuccess >= t.successThreshold {
			t.state = StateHealthy
		}
	} else {
		t.consecutiveSuccess = 0
		t.consecutiveFailures++

		if t.consecutiveFailures >= t.failureThreshold {
			t.state = StateUnhealthy
		} else if t.consecutiveFailures >= 1 && t.state == StateHealthy {
			t.state = StateDegraded
		}
	}

	return t.state
}

func (t *StateTracker) State() HealthState {
	t.mu.Lock()
	defer t.mu.Unlock()
	return t.state
}

func httpCheck(url string, timeout time.Duration) CheckResult {
	start := time.Now()
	client := &http.Client{Timeout: timeout}

	resp, err := client.Get(url)
	duration := time.Since(start)

	if err != nil {
		return CheckResult{
			Healthy:  false,
			Duration: duration,
			Error:    err.Error(),
		}
	}
	defer resp.Body.Close()

	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
		return CheckResult{
			Healthy:  false,
			Duration: duration,
			Error:    fmt.Sprintf("status code %d", resp.StatusCode),
		}
	}

	return CheckResult{
		Healthy:  true,
		Duration: duration,
	}
}

func main() {
	tracker := NewStateTracker(3, 2)
	url := "http://localhost:8080/health"
	interval := 5 * time.Second
	timeout := 10 * time.Second

	fmt.Printf("Monitoring %s every %s\n", url, interval)
	fmt.Printf("Failure threshold: 3 consecutive, Recovery threshold: 2 consecutive\n\n")

	ticker := time.NewTicker(interval)
	defer ticker.Stop()

	// Run first check immediately
	result := httpCheck(url, timeout)
	state := tracker.RecordResult(result)
	printResult(url, result, state)

	for range ticker.C {
		result := httpCheck(url, timeout)
		state := tracker.RecordResult(result)
		printResult(url, result, state)
	}
}

func printResult(url string, result CheckResult, state HealthState) {
	status := "OK"
	if !result.Healthy {
		status = "FAIL"
	}
	fmt.Printf("[%s] %s %s (%s) - state: %s\n",
		time.Now().Format("15:04:05"),
		status,
		url,
		result.Duration.Round(time.Millisecond),
		state)
	if result.Error != "" {
		fmt.Printf("  error: %s\n", result.Error)
	}
}

The StateTracker requires 3 consecutive failures before changing from healthy to unhealthy. And it requires 2 consecutive successes before changing from unhealthy back to healthy. This prevents flapping.

The bug: state flapping

Here is what happens without thresholds.

func recordResultBroken(healthy bool) string {
	// BUG: state changes on every single check result
	if healthy {
		return "HEALTHY"
	}
	return "UNHEALTHY"
}

If a server drops one request out of ten, the health check oscillates between healthy and unhealthy every few seconds. Downstream systems that depend on the health check (like load balancers) will keep adding and removing the server. This causes connection errors for users as their requests are routed to a server that is being removed, or traffic spikes on other servers as this one is taken out of rotation.

14:00:01 HEALTHY
14:00:06 UNHEALTHY   <- one timeout
14:00:11 HEALTHY     <- back to normal
14:00:16 HEALTHY
14:00:21 UNHEALTHY   <- another timeout
14:00:26 HEALTHY     <- back again

The fix

Require N consecutive failures before changing state. The StateTracker above implements this. With a failure threshold of 3, a single timeout does not change the state. The server must fail 3 times in a row before it is marked unhealthy. And it must succeed 2 times in a row before it is marked healthy again.

This is exactly how Kubernetes readiness probes work. The failureThreshold and successThreshold fields in a pod spec control the same behavior.

Step 6: Complete Health Monitor Dashboard

Now combine everything into a single tool. This monitor checks HTTP endpoints, TCP ports, disk space, memory, and processes. It prints a color-coded dashboard to the terminal.

Configuration

Define checks as a list of structs.

package main

import (
	"bufio"
	"fmt"
	"net"
	"net/http"
	"os"
	"strconv"
	"strings"
	"sync"
	"syscall"
	"time"
)

type CheckType string

const (
	CheckHTTP    CheckType = "http"
	CheckTCP     CheckType = "tcp"
	CheckProcess CheckType = "process"
	CheckDisk    CheckType = "disk"
	CheckMemory  CheckType = "memory"
)

type HealthState int

const (
	StateHealthy  HealthState = iota
	StateDegraded
	StateUnhealthy
)

func (s HealthState) String() string {
	switch s {
	case StateHealthy:
		return "HEALTHY"
	case StateDegraded:
		return "DEGRADED"
	case StateUnhealthy:
		return "UNHEALTHY"
	default:
		return "UNKNOWN"
	}
}

type CheckConfig struct {
	Name     string
	Type     CheckType
	Target   string    // URL, host:port, PID, or path depending on type
	Timeout  time.Duration
	Warning  float64   // warning threshold (percentage for disk/memory)
	Critical float64   // critical threshold
}

type CheckResult struct {
	Healthy  bool
	Duration time.Duration
	Error    string
	Details  string
}

type CheckState struct {
	mu                  sync.Mutex
	config              CheckConfig
	state               HealthState
	consecutiveFailures int
	consecutiveSuccess  int
	lastCheck           time.Time
	lastResult          CheckResult
	history             []HealthState
}

func NewCheckState(config CheckConfig) *CheckState {
	return &CheckState{
		config:  config,
		state:   StateHealthy,
		history: make([]HealthState, 0, 10),
	}
}

func (cs *CheckState) RecordResult(result CheckResult) {
	cs.mu.Lock()
	defer cs.mu.Unlock()

	cs.lastCheck = time.Now()
	cs.lastResult = result

	if result.Healthy {
		cs.consecutiveFailures = 0
		cs.consecutiveSuccess++
		if cs.consecutiveSuccess >= 2 {
			cs.state = StateHealthy
		}
	} else {
		cs.consecutiveSuccess = 0
		cs.consecutiveFailures++
		if cs.consecutiveFailures >= 3 {
			cs.state = StateUnhealthy
		} else if cs.state == StateHealthy {
			cs.state = StateDegraded
		}
	}

	cs.history = append(cs.history, cs.state)
	if len(cs.history) > 10 {
		cs.history = cs.history[len(cs.history)-10:]
	}
}

Check implementations

Each check type has its own function. All return a CheckResult.

func runCheck(config CheckConfig) CheckResult {
	switch config.Type {
	case CheckHTTP:
		return checkHTTP(config)
	case CheckTCP:
		return checkTCP(config)
	case CheckProcess:
		return checkProcess(config)
	case CheckDisk:
		return checkDisk(config)
	case CheckMemory:
		return checkMemory(config)
	default:
		return CheckResult{
			Healthy: false,
			Error:   fmt.Sprintf("unknown check type: %s", config.Type),
		}
	}
}

func checkHTTP(config CheckConfig) CheckResult {
	start := time.Now()
	client := &http.Client{Timeout: config.Timeout}

	resp, err := client.Get(config.Target)
	duration := time.Since(start)

	if err != nil {
		return CheckResult{
			Healthy:  false,
			Duration: duration,
			Error:    err.Error(),
		}
	}
	defer resp.Body.Close()

	healthy := resp.StatusCode >= 200 && resp.StatusCode < 300
	result := CheckResult{
		Healthy:  healthy,
		Duration: duration,
		Details:  fmt.Sprintf("HTTP %d", resp.StatusCode),
	}
	if !healthy {
		result.Error = fmt.Sprintf("status code %d", resp.StatusCode)
	}
	return result
}

func checkTCP(config CheckConfig) CheckResult {
	start := time.Now()
	conn, err := net.DialTimeout("tcp", config.Target, config.Timeout)
	duration := time.Since(start)

	if err != nil {
		return CheckResult{
			Healthy:  false,
			Duration: duration,
			Error:    err.Error(),
		}
	}
	conn.Close()

	return CheckResult{
		Healthy:  true,
		Duration: duration,
		Details:  "connection accepted",
	}
}

func checkProcess(config CheckConfig) CheckResult {
	start := time.Now()
	pid, err := strconv.Atoi(config.Target)
	if err != nil {
		return CheckResult{
			Healthy:  false,
			Duration: time.Since(start),
			Error:    fmt.Sprintf("invalid PID: %s", config.Target),
		}
	}

	err = syscall.Kill(pid, 0)
	duration := time.Since(start)

	if err == nil || err == syscall.EPERM {
		// Read process state
		state := readProcessState(pid)
		return CheckResult{
			Healthy:  true,
			Duration: duration,
			Details:  fmt.Sprintf("PID %d alive, state: %s", pid, state),
		}
	}

	return CheckResult{
		Healthy:  false,
		Duration: duration,
		Error:    fmt.Sprintf("PID %d not found", pid),
	}
}

func readProcessState(pid int) string {
	path := fmt.Sprintf("/proc/%d/status", pid)
	file, err := os.Open(path)
	if err != nil {
		return "unknown"
	}
	defer file.Close()

	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		if strings.HasPrefix(line, "State:") {
			parts := strings.SplitN(line, ":\t", 2)
			if len(parts) == 2 {
				return strings.TrimSpace(parts[1])
			}
		}
	}
	return "unknown"
}

func checkDisk(config CheckConfig) CheckResult {
	start := time.Now()
	var stat syscall.Statfs_t
	err := syscall.Statfs(config.Target, &stat)
	duration := time.Since(start)

	if err != nil {
		return CheckResult{
			Healthy:  false,
			Duration: duration,
			Error:    fmt.Sprintf("statfs failed: %s", err),
		}
	}

	total := stat.Blocks * uint64(stat.Bsize)
	free := stat.Bavail * uint64(stat.Bsize)
	used := total - free
	usedPct := float64(used) / float64(total) * 100

	healthy := usedPct < config.Critical
	details := fmt.Sprintf("%.1f%% used (%.1f GB free)",
		usedPct, float64(free)/1e9)

	result := CheckResult{
		Healthy:  healthy,
		Duration: duration,
		Details:  details,
	}
	if !healthy {
		result.Error = fmt.Sprintf("disk usage %.1f%% exceeds threshold %.1f%%",
			usedPct, config.Critical)
	}
	return result
}

func checkMemory(config CheckConfig) CheckResult {
	start := time.Now()
	file, err := os.Open("/proc/meminfo")
	if err != nil {
		return CheckResult{
			Healthy:  false,
			Duration: time.Since(start),
			Error:    fmt.Sprintf("open /proc/meminfo: %s", err),
		}
	}
	defer file.Close()

	var totalKB, availableKB uint64
	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Text()
		fields := strings.Fields(line)
		if len(fields) < 2 {
			continue
		}
		key := strings.TrimSuffix(fields[0], ":")
		val, parseErr := strconv.ParseUint(fields[1], 10, 64)
		if parseErr != nil {
			continue
		}
		switch key {
		case "MemTotal":
			totalKB = val
		case "MemAvailable":
			availableKB = val
		}
	}

	duration := time.Since(start)

	if totalKB == 0 {
		return CheckResult{
			Healthy:  false,
			Duration: duration,
			Error:    "could not read memory info",
		}
	}

	usedKB := totalKB - availableKB
	usedPct := float64(usedKB) / float64(totalKB) * 100

	healthy := usedPct < config.Critical
	details := fmt.Sprintf("%.1f%% used (%d MB available)",
		usedPct, availableKB/1024)

	result := CheckResult{
		Healthy:  healthy,
		Duration: duration,
		Details:  details,
	}
	if !healthy {
		result.Error = fmt.Sprintf("memory usage %.1f%% exceeds threshold %.1f%%",
			usedPct, config.Critical)
	}
	return result
}

Color-coded terminal output

Use ANSI escape codes for color. Green for healthy, yellow for degraded, red for unhealthy.

const (
	colorReset  = "\033[0m"
	colorRed    = "\033[31m"
	colorGreen  = "\033[32m"
	colorYellow = "\033[33m"
	colorBold   = "\033[1m"
	colorDim    = "\033[2m"
)

func stateColor(state HealthState) string {
	switch state {
	case StateHealthy:
		return colorGreen
	case StateDegraded:
		return colorYellow
	case StateUnhealthy:
		return colorRed
	default:
		return colorReset
	}
}

func printDashboard(checks []*CheckState) {
	// Clear screen
	fmt.Print("\033[2J\033[H")

	fmt.Printf("%s%s=== Health Monitor Dashboard ===%s\n",
		colorBold, colorGreen, colorReset)
	fmt.Printf("%sUpdated: %s%s\n\n",
		colorDim, time.Now().Format("2006-01-02 15:04:05"), colorReset)

	// Header
	fmt.Printf("%-20s %-10s %-10s %-12s %-8s %s\n",
		"CHECK", "TYPE", "STATE", "LATENCY", "FAILS", "DETAILS")
	fmt.Println(strings.Repeat("-", 80))

	for _, cs := range checks {
		cs.mu.Lock()

		color := stateColor(cs.state)
		stateStr := cs.state.String()

		latency := "-"
		if !cs.lastCheck.IsZero() {
			latency = cs.lastResult.Duration.Round(time.Millisecond).String()
		}

		details := cs.lastResult.Details
		if cs.lastResult.Error != "" {
			details = cs.lastResult.Error
		}
		if len(details) > 30 {
			details = details[:30] + "..."
		}

		fmt.Printf("%-20s %-10s %s%-10s%s %-12s %-8d %s\n",
			cs.config.Name,
			cs.config.Type,
			color, stateStr, colorReset,
			latency,
			cs.consecutiveFailures,
			details,
		)

		cs.mu.Unlock()
	}

	// Print history
	fmt.Printf("\n%sState History (last 10 checks):%s\n", colorBold, colorReset)
	for _, cs := range checks {
		cs.mu.Lock()
		fmt.Printf("  %-20s ", cs.config.Name)
		for _, state := range cs.history {
			color := stateColor(state)
			switch state {
			case StateHealthy:
				fmt.Printf("%s+%s", color, colorReset)
			case StateDegraded:
				fmt.Printf("%s~%s", color, colorReset)
			case StateUnhealthy:
				fmt.Printf("%s!%s", color, colorReset)
			}
		}
		fmt.Println()
		cs.mu.Unlock()
	}
}

Main loop

Tie it all together.

func main() {
	configs := []CheckConfig{
		{
			Name:    "web-app",
			Type:    CheckHTTP,
			Target:  "http://localhost:8080/health",
			Timeout: 10 * time.Second,
		},
		{
			Name:    "api-server",
			Type:    CheckHTTP,
			Target:  "http://localhost:3000/healthz",
			Timeout: 10 * time.Second,
		},
		{
			Name:    "postgresql",
			Type:    CheckTCP,
			Target:  "localhost:5432",
			Timeout: 5 * time.Second,
		},
		{
			Name:    "redis",
			Type:    CheckTCP,
			Target:  "localhost:6379",
			Timeout: 5 * time.Second,
		},
		{
			Name:     "disk-root",
			Type:     CheckDisk,
			Target:   "/",
			Warning:  80,
			Critical: 90,
		},
		{
			Name:     "memory",
			Type:     CheckMemory,
			Target:   "",
			Warning:  80,
			Critical: 95,
		},
	}

	checks := make([]*CheckState, len(configs))
	for i, cfg := range configs {
		checks[i] = NewCheckState(cfg)
	}

	interval := 5 * time.Second
	fmt.Printf("Starting health monitor with %d checks, interval %s\n",
		len(checks), interval)
	fmt.Println("Press Ctrl+C to stop.")
	time.Sleep(2 * time.Second)

	// Run checks in parallel
	var wg sync.WaitGroup

	runAllChecks := func() {
		wg.Add(len(checks))
		for _, cs := range checks {
			go func(check *CheckState) {
				defer wg.Done()
				result := runCheck(check.config)
				check.RecordResult(result)
			}(cs)
		}
		wg.Wait()
		printDashboard(checks)
	}

	// First run
	runAllChecks()

	ticker := time.NewTicker(interval)
	defer ticker.Stop()

	for range ticker.C {
		runAllChecks()
	}
}

This runs all checks in parallel using goroutines. Each check runs independently, so a slow HTTP check does not block the TCP checks. After all checks complete, it prints the dashboard.

Sample output

When you run this monitor, you see something like this (imagine the colors).

=== Health Monitor Dashboard ===
Updated: 2024-08-28 14:23:15

CHECK                TYPE       STATE      LATENCY      FAILS    DETAILS
--------------------------------------------------------------------------------
web-app              http       HEALTHY    45ms         0        HTTP 200
api-server           http       UNHEALTHY  10.001s      5        context deadline exceeded...
postgresql           tcp        HEALTHY    2ms          0        connection accepted
redis                tcp        HEALTHY    1ms          0        connection accepted
disk-root            disk       HEALTHY    0ms          0        67.2% used (33.1 GB free)
memory               memory     HEALTHY    0ms          0        52.3% used (7680 MB available)

State History (last 10 checks):
  web-app              ++++++++++
  api-server           +++~~!!!!!
  postgresql           ++++++++++
  redis                ++++++++++
  disk-root            ++++++++++
  memory               ++++++++++

The api-server went from healthy (+) to degraded (~) after the first failure, then to unhealthy (!) after 3 consecutive failures. All other services remain healthy.

Extending the monitor

There are several directions you can take this.

Add a process check. If you know the PID of a service, add a process check to the configs list. You could also have the monitor find the PID by reading a PID file, or by searching /proc for a process with a matching command line.

{
    Name:    "nginx",
    Type:    CheckProcess,
    Target:  "1234", // PID
    Timeout: 5 * time.Second,
},

Add alerting. When a check transitions from healthy to unhealthy, send a notification. This could be a simple webhook call, an email, or a write to a log file that another tool monitors.

Add configuration from a file. Instead of hard-coding the checks, read them from a YAML or JSON file. This makes the monitor reusable across different environments.

Add an HTTP endpoint. Make the monitor itself expose a /health endpoint that returns the aggregate status of all checks. Now other monitors can check your monitor.

Wrapping Up

Health checks are not complicated. The tools are built into every Linux system. curl checks HTTP. nc checks TCP. /proc has memory, disk, and process information. ss shows listening ports. These commands work everywhere, with no installation needed.

Building the same checks in Go gives you timeouts, retries, state tracking, and structured output. The key lessons:

  • Always set timeouts on HTTP clients. The default Go HTTP client has no timeout.
  • Always close connections after checking TCP ports. File descriptors are finite.
  • Read /proc/meminfo values as kB (kibibytes). Multiply by 1024 to get bytes.
  • Use syscall.Kill(pid, 0) to check if a process exists, not os.FindProcess.
  • Require multiple consecutive failures before marking a service unhealthy. Single-check flapping is worse than no monitoring at all.

The complete monitor in this guide uses only Go standard library packages. No external dependencies. It compiles to a single binary that runs on any Linux system.

Keep Reading

Contents