Every service goes down eventually. The question is whether you find out before your users do or after.
This guide starts with basic Linux commands for checking service health. Then it builds each check in Go. By the end, you will have a complete health monitoring tool that checks HTTP endpoints, TCP ports, disk space, memory, and processes. It tracks state, retries on failures, and prints color-coded output to your terminal.
Each step follows the same pattern. Run the Linux command first. Then build the same thing in Go. Then make a mistake, see it break, and fix it.
Step 1: HTTP Health Checks with curl
Most services expose an HTTP endpoint for health checks. The path is usually /health or /healthz. The response tells you if the service is running.
Linux commands
The simplest check uses curl with two flags. The -s flag suppresses the progress bar. The -f flag makes curl return a non-zero exit code on HTTP errors (4xx and 5xx).
curl -sf http://localhost:8080/health
If the service is healthy, you get the response body. If it is down, curl exits with code 22 and prints nothing. Check the exit code with $?.
curl -sf http://localhost:8080/health
echo $?
# 0 means healthy, non-zero means something is wrong
Sometimes you only care about the status code. Use -o /dev/null to discard the body and -w to print just the HTTP code.
curl -o /dev/null -s -w "%{http_code}\n" http://localhost:8080/health
# Output: 200
Timeouts matter. Without them, curl will hang if the server accepts the connection but never responds. Set a connection timeout and a total timeout.
curl --connect-timeout 5 --max-time 10 -sf http://localhost:8080/health
--connect-timeout 5 gives the TCP handshake 5 seconds. --max-time 10 caps the entire request at 10 seconds. If either limit is hit, curl exits with an error.
Build it in Go
Here is a basic HTTP health checker in Go.
package main
import (
"fmt"
"net/http"
"os"
"time"
)
func checkHTTP(url string) error {
client := &http.Client{
Timeout: 10 * time.Second,
}
resp, err := client.Get(url)
if err != nil {
return fmt.Errorf("request failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return fmt.Errorf("unhealthy status code: %d", resp.StatusCode)
}
return nil
}
func main() {
url := "http://localhost:8080/health"
if len(os.Args) > 1 {
url = os.Args[1]
}
err := checkHTTP(url)
if err != nil {
fmt.Fprintf(os.Stderr, "UNHEALTHY: %s\n", err)
os.Exit(1)
}
fmt.Println("HEALTHY")
}
The http.Client has a 10-second timeout. This covers the entire request: DNS lookup, TCP handshake, TLS handshake, sending the request, and reading the response. If anything takes longer than 10 seconds total, the client returns an error.
The bug: no timeout
What happens if you use http.Get() directly?
package main
import (
"fmt"
"net/http"
)
func checkHTTPBroken(url string) error {
// BUG: http.Get uses the default client, which has no timeout
resp, err := http.Get(url)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return fmt.Errorf("status: %d", resp.StatusCode)
}
return nil
}
func main() {
err := checkHTTPBroken("http://localhost:8080/health")
if err != nil {
fmt.Println("UNHEALTHY:", err)
return
}
fmt.Println("HEALTHY")
}
The default http.Client in Go has no timeout. Zero. If the server accepts the TCP connection but never sends a response, your program hangs forever. This is a real problem in health checks. A health checker that hangs is worse than no health checker at all. It will consume a goroutine and a file descriptor, and you will never get a result.
The fix
Always create your own http.Client with a timeout.
client := &http.Client{
Timeout: 10 * time.Second,
}
resp, err := client.Get(url)
This is the single most important rule for HTTP requests in Go. Never use http.DefaultClient in production code. Never use http.Get() directly. Always set a timeout.
Step 2: TCP Port Checks
Not every service has an HTTP endpoint. Databases, caches, and message queues often speak binary protocols. But you can still check if they are listening on a port.
Linux commands
The classic tool is nc (netcat). The -z flag means scan mode (do not send data). The -v flag prints the result.
nc -zv localhost 5432
# Connection to localhost 5432 port [tcp/postgresql] succeeded!
If the port is closed or the host is unreachable, nc prints an error and exits with a non-zero code.
You can also use bash built-in TCP support. This opens a TCP connection to the given host and port. The timeout command kills it after 3 seconds if it does not complete.
timeout 3 bash -c '</dev/tcp/localhost/5432' && echo "Port open" || echo "Port closed"
To check what is listening on your system, use ss. The flags -t means TCP, -l means listening, -n means numeric (do not resolve names), and -p means show the process.
ss -tlnp | grep :5432
# LISTEN 0 244 0.0.0.0:5432 0.0.0.0:* users:(("postgres",pid=1234,fd=5))
This tells you that PostgreSQL is listening on port 5432, its PID is 1234, and it is using file descriptor 5.
Build it in Go
Go’s net package provides DialTimeout, which does exactly what we need.
package main
import (
"fmt"
"net"
"os"
"time"
)
func checkTCP(host string, port string) error {
address := net.JoinHostPort(host, port)
conn, err := net.DialTimeout("tcp", address, 5*time.Second)
if err != nil {
return fmt.Errorf("tcp connect failed: %w", err)
}
defer conn.Close()
return nil
}
func main() {
host := "localhost"
port := "5432"
if len(os.Args) > 2 {
host = os.Args[1]
port = os.Args[2]
}
err := checkTCP(host, port)
if err != nil {
fmt.Fprintf(os.Stderr, "UNHEALTHY: %s\n", err)
os.Exit(1)
}
fmt.Printf("HEALTHY: %s:%s is accepting connections\n", host, port)
}
net.DialTimeout attempts a TCP connection. If the port is listening, it returns a connection. If the port is closed or the timeout is reached, it returns an error.
net.JoinHostPort properly handles IPv6 addresses by wrapping them in brackets. Always use it instead of fmt.Sprintf("%s:%s", host, port).
The bug: file descriptor leak
Here is a version that checks the port but forgets to close the connection.
package main
import (
"fmt"
"net"
"time"
)
func checkTCPBroken(host string, port string) error {
address := net.JoinHostPort(host, port)
conn, err := net.DialTimeout("tcp", address, 5*time.Second)
if err != nil {
return fmt.Errorf("tcp connect failed: %w", err)
}
// BUG: we opened a connection but never closed it
_ = conn
return nil
}
func main() {
for i := 0; i < 100000; i++ {
err := checkTCPBroken("localhost", "5432")
if err != nil {
fmt.Println("check failed:", err)
return
}
}
fmt.Println("all checks passed")
}
Each call to DialTimeout opens a TCP connection and creates a file descriptor. Without closing the connection, these file descriptors pile up. On Linux, the default file descriptor limit is usually 1024 per process. After roughly 1020 health checks, your program crashes with socket: too many open files.
Even if your health check runs once per minute, after 17 hours you hit the limit. In practice, you also leak connections on the server side. The server sees thousands of established connections that will never close (until the OS cleans them up with TCP keepalive timeouts, which can be hours).
The fix
Close the connection immediately after a successful dial.
conn, err := net.DialTimeout("tcp", address, 5*time.Second)
if err != nil {
return fmt.Errorf("tcp connect failed: %w", err)
}
defer conn.Close()
The defer conn.Close() line must come right after the nil check on err. This is a standard Go pattern. Always defer the cleanup call on the same line as the success check. If you put other code between the successful open and the defer, you risk returning early without closing the resource.
Step 3: Disk and Memory Checks from /proc
A service can respond to HTTP requests and accept TCP connections but still be minutes away from crashing. If disk space runs out, logs stop writing, databases corrupt, and containers get evicted. If memory runs out, the OOM killer picks a process and terminates it.
Linux commands
Check disk usage with df. The -h flag makes it human-readable.
df -h /
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 100G 67G 33G 67% /
The Use% column is what matters. When it hits 90%, you should be worried. At 100%, things break.
Check memory with free. The -m flag shows megabytes.
free -m
# total used free shared buff/cache available
# Mem: 16384 8192 2048 512 6144 7680
# Swap: 4096 0 4096
The available column is the important one, not free. Linux uses free memory for disk cache. The available column accounts for this and tells you how much memory is actually available for new processes.
You can read the raw data from /proc/meminfo.
grep MemAvailable /proc/meminfo
# MemAvailable: 7864320 kB
Check how much space your logs are consuming.
du -sh /var/log/
# 2.3G /var/log/
Build it in Go
Go can read /proc/meminfo like any text file. For disk space, use syscall.Statfs.
package main
import (
"bufio"
"fmt"
"os"
"strconv"
"strings"
"syscall"
)
type MemInfo struct {
TotalKB uint64
AvailableKB uint64
}
func readMemInfo() (MemInfo, error) {
file, err := os.Open("/proc/meminfo")
if err != nil {
return MemInfo{}, fmt.Errorf("open /proc/meminfo: %w", err)
}
defer file.Close()
var info MemInfo
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
parts := strings.Fields(line)
if len(parts) < 2 {
continue
}
key := strings.TrimSuffix(parts[0], ":")
value, err := strconv.ParseUint(parts[1], 10, 64)
if err != nil {
continue
}
switch key {
case "MemTotal":
info.TotalKB = value
case "MemAvailable":
info.AvailableKB = value
}
}
if err := scanner.Err(); err != nil {
return MemInfo{}, fmt.Errorf("read /proc/meminfo: %w", err)
}
return info, nil
}
type DiskInfo struct {
TotalBytes uint64
FreeBytes uint64
UsedPct float64
}
func readDiskInfo(path string) (DiskInfo, error) {
var stat syscall.Statfs_t
err := syscall.Statfs(path, &stat)
if err != nil {
return DiskInfo{}, fmt.Errorf("statfs %s: %w", path, err)
}
total := stat.Blocks * uint64(stat.Bsize)
free := stat.Bavail * uint64(stat.Bsize)
used := total - free
usedPct := float64(used) / float64(total) * 100
return DiskInfo{
TotalBytes: total,
FreeBytes: free,
UsedPct: usedPct,
}, nil
}
func main() {
mem, err := readMemInfo()
if err != nil {
fmt.Fprintf(os.Stderr, "memory check failed: %s\n", err)
os.Exit(1)
}
usedKB := mem.TotalKB - mem.AvailableKB
usedPct := float64(usedKB) / float64(mem.TotalKB) * 100
fmt.Printf("Memory: %d MB total, %d MB available (%.1f%% used)\n",
mem.TotalKB/1024, mem.AvailableKB/1024, usedPct)
disk, err := readDiskInfo("/")
if err != nil {
fmt.Fprintf(os.Stderr, "disk check failed: %s\n", err)
os.Exit(1)
}
fmt.Printf("Disk /: %.1f GB total, %.1f GB free (%.1f%% used)\n",
float64(disk.TotalBytes)/1e9, float64(disk.FreeBytes)/1e9, disk.UsedPct)
}
syscall.Statfs fills a Statfs_t struct with filesystem statistics. Blocks is the total number of blocks. Bavail is the number of blocks available to unprivileged users (this accounts for the root reserved space). Bsize is the block size in bytes.
The bug: wrong unit conversion
Here is a version with a subtle math error.
func readMemInfoBroken() {
// Pretend we parsed MemTotal: 16384000 kB from /proc/meminfo
memTotalKB := uint64(16384000)
// BUG: treating kB value as bytes
memTotalGB := float64(memTotalKB) / (1024 * 1024 * 1024)
fmt.Printf("Total memory: %.2f GB\n", memTotalGB)
// Output: Total memory: 0.02 GB
// That is wrong. A machine with 16 GB of RAM reports 0.02 GB.
}
The value in /proc/meminfo is in kB (kibibytes, meaning 1024 bytes). But the code treats it as if it were in bytes, dividing by 1024 three times instead of two.
The correct conversion from kB to GB is: divide by 1024 twice (kB to MB, then MB to GB). Or multiply by 1024 first (kB to bytes), then divide by 1024 three times.
The fix
Multiply the kB value by 1024 to get bytes first, then convert.
func readMemInfoFixed() {
memTotalKB := uint64(16384000)
// Correct: kB * 1024 = bytes, then bytes / (1024^3) = GB
// Simplified: kB / (1024 * 1024) = GB
memTotalGB := float64(memTotalKB) / (1024 * 1024)
fmt.Printf("Total memory: %.2f GB\n", memTotalGB)
// Output: Total memory: 15.63 GB
// Correct. 16384000 kB is about 15.63 GiB.
}
This is a common mistake. The /proc/meminfo values are labeled kB but they actually mean KiB (1024 bytes), not KB (1000 bytes). The Linux kernel documentation confirms this. Always multiply by 1024 when converting from the values in /proc/meminfo to bytes.
Step 4: Process Health Checks
Sometimes you need to check if a specific process is running. The service might not have an HTTP endpoint. Or it might have one, but you want to verify the process itself has not become a zombie.
Linux commands
Find a process by name with pgrep. The -f flag matches the full command line, not just the process name.
pgrep -f myapp
# 12345
If the process exists, pgrep prints its PID and exits with code 0. If not, it prints nothing and exits with code 1.
Check if a specific PID is alive with signal 0. Signal 0 does not actually send a signal. It just checks if the process exists and you have permission to signal it.
kill -0 12345
echo $?
# 0 means alive, 1 means dead or no permission
Read the process state from /proc.
grep State /proc/12345/status
# State: S (sleeping)
The state letters are:
- R = running or runnable
- S = sleeping (waiting for an event)
- D = uninterruptible sleep (usually disk I/O, cannot be killed)
- Z = zombie (finished but parent has not collected the exit status)
- T = stopped (by a signal or debugger)
A zombie process is a problem. It means the parent process is not calling wait(). If you see a D state process, something is stuck in I/O. Both of these warrant investigation.
Count open file descriptors for a process.
ls /proc/12345/fd/ | wc -l
# 47
If this number is growing over time, you have a file descriptor leak. Compare it against the limit.
grep "open files" /proc/12345/limits
# Max open files 1024 1048576 files
Build it in Go
Here is a Go function that checks if a process is alive.
package main
import (
"bufio"
"fmt"
"os"
"path/filepath"
"strconv"
"strings"
"syscall"
)
type ProcessInfo struct {
PID int
State string
Name string
FDs int
}
func checkProcess(pid int) (ProcessInfo, error) {
info := ProcessInfo{PID: pid}
// Check if process exists using signal 0
err := syscall.Kill(pid, 0)
if err != nil {
return info, fmt.Errorf("process %d not found: %w", pid, err)
}
// Read process status
statusPath := fmt.Sprintf("/proc/%d/status", pid)
file, err := os.Open(statusPath)
if err != nil {
return info, fmt.Errorf("open %s: %w", statusPath, err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
parts := strings.SplitN(line, ":\t", 2)
if len(parts) != 2 {
continue
}
key := strings.TrimSpace(parts[0])
value := strings.TrimSpace(parts[1])
switch key {
case "Name":
info.Name = value
case "State":
info.State = value
}
}
// Count open file descriptors
fdPath := fmt.Sprintf("/proc/%d/fd", pid)
entries, err := os.ReadDir(fdPath)
if err == nil {
info.FDs = len(entries)
}
return info, nil
}
func main() {
if len(os.Args) < 2 {
fmt.Fprintln(os.Stderr, "usage: processcheck <pid>")
os.Exit(1)
}
pid, err := strconv.Atoi(os.Args[1])
if err != nil {
fmt.Fprintf(os.Stderr, "invalid pid: %s\n", os.Args[1])
os.Exit(1)
}
info, err := checkProcess(pid)
if err != nil {
fmt.Fprintf(os.Stderr, "UNHEALTHY: %s\n", err)
os.Exit(1)
}
_ = filepath.Join("") // ensure filepath import is used in full program
fmt.Printf("Process %d (%s): %s, %d open FDs\n",
info.PID, info.Name, info.State, info.FDs)
}
This function does three things. First, it sends signal 0 to check if the process exists. Second, it reads /proc/<pid>/status to get the process name and state. Third, it counts entries in /proc/<pid>/fd to find the number of open file descriptors.
The bug: os.FindProcess always succeeds
Here is a version that tries to use os.FindProcess to check if a process exists.
package main
import (
"fmt"
"os"
"strconv"
)
func checkProcessBroken(pid int) error {
// BUG: os.FindProcess on Linux ALWAYS returns a non-nil Process
// It does not check if the process actually exists
proc, err := os.FindProcess(pid)
if err != nil {
return fmt.Errorf("process %d not found", pid)
}
fmt.Printf("Found process: %+v\n", proc)
return nil
}
func main() {
pid := 99999999 // almost certainly does not exist
if len(os.Args) > 1 {
pid, _ = strconv.Atoi(os.Args[1])
}
err := checkProcessBroken(pid)
if err != nil {
fmt.Println("Process not found")
} else {
fmt.Println("Process found!")
// Output: Process found!
// This is WRONG. PID 99999999 does not exist.
}
}
On Linux, os.FindProcess always returns a valid Process struct and a nil error, regardless of whether the PID exists. It is designed this way because on Unix systems, FindProcess is just a wrapper that stores the PID. It does not perform any system call to verify the process.
This means your health check will always report “process found” for any PID, even completely nonexistent ones.
The fix
Use syscall.Kill with signal 0. This actually makes a system call.
func checkProcessFixed(pid int) error {
err := syscall.Kill(pid, 0)
if err == nil {
return nil // process exists and we have permission
}
if err == syscall.ESRCH {
return fmt.Errorf("process %d does not exist", pid)
}
if err == syscall.EPERM {
// Process exists but we do not have permission to signal it.
// This still means the process is alive.
return nil
}
return fmt.Errorf("unexpected error checking process %d: %w", pid, err)
}
syscall.Kill(pid, 0) sends signal 0, which does nothing to the process but returns an error if the process does not exist. The kernel returns ESRCH (no such process) if the PID is not found. It returns EPERM (permission denied) if the process exists but you do not have permission to signal it. EPERM means the process IS alive. You just cannot kill it. That is still a successful health check.
Step 5: Health Checks with Retries and Degraded State
A single failed check does not always mean the service is down. Networks hiccup. Servers take a moment to warm up. Disk usage spikes during log rotation and drops again. A good health checker accounts for this.
Linux commands
Here is a bash script that retries a health check and uses exit codes to indicate state.
#!/bin/bash
# healthcheck.sh - exit 0=healthy, 1=degraded, 2=unhealthy
URL="http://localhost:8080/health"
MAX_RETRIES=3
RETRY_DELAY=2
failures=0
for i in $(seq 1 $MAX_RETRIES); do
if curl --connect-timeout 5 --max-time 10 -sf "$URL" > /dev/null 2>&1; then
echo "Check $i: OK"
exit 0
else
failures=$((failures + 1))
echo "Check $i: FAILED"
if [ $i -lt $MAX_RETRIES ]; then
sleep $RETRY_DELAY
fi
fi
done
if [ $failures -eq $MAX_RETRIES ]; then
echo "UNHEALTHY: all $MAX_RETRIES checks failed"
exit 2
else
echo "DEGRADED: $failures of $MAX_RETRIES checks failed"
exit 1
fi
You can watch a health check in real time with watch.
watch -n 5 'curl -sf localhost:8080/health && echo OK || echo FAIL'
This runs the curl command every 5 seconds and displays the result. It is useful for manual monitoring during deploys or incidents.
Build it in Go
A real health checker needs to track state over time. A single failure should not immediately mark the service as unhealthy. And a single success should not immediately mark it as healthy after a long outage. This is the same logic that Kubernetes uses for readiness and liveness probes.
package main
import (
"fmt"
"net/http"
"sync"
"time"
)
type HealthState int
const (
StateHealthy HealthState = iota
StateDegraded
StateUnhealthy
)
func (s HealthState) String() string {
switch s {
case StateHealthy:
return "HEALTHY"
case StateDegraded:
return "DEGRADED"
case StateUnhealthy:
return "UNHEALTHY"
default:
return "UNKNOWN"
}
}
type CheckResult struct {
Healthy bool
Duration time.Duration
Error string
}
type StateTracker struct {
mu sync.Mutex
state HealthState
consecutiveFailures int
consecutiveSuccess int
failureThreshold int
successThreshold int
lastCheck time.Time
lastResult CheckResult
}
func NewStateTracker(failureThreshold, successThreshold int) *StateTracker {
return &StateTracker{
state: StateHealthy,
failureThreshold: failureThreshold,
successThreshold: successThreshold,
}
}
func (t *StateTracker) RecordResult(result CheckResult) HealthState {
t.mu.Lock()
defer t.mu.Unlock()
t.lastCheck = time.Now()
t.lastResult = result
if result.Healthy {
t.consecutiveFailures = 0
t.consecutiveSuccess++
if t.consecutiveSuccess >= t.successThreshold {
t.state = StateHealthy
}
} else {
t.consecutiveSuccess = 0
t.consecutiveFailures++
if t.consecutiveFailures >= t.failureThreshold {
t.state = StateUnhealthy
} else if t.consecutiveFailures >= 1 && t.state == StateHealthy {
t.state = StateDegraded
}
}
return t.state
}
func (t *StateTracker) State() HealthState {
t.mu.Lock()
defer t.mu.Unlock()
return t.state
}
func httpCheck(url string, timeout time.Duration) CheckResult {
start := time.Now()
client := &http.Client{Timeout: timeout}
resp, err := client.Get(url)
duration := time.Since(start)
if err != nil {
return CheckResult{
Healthy: false,
Duration: duration,
Error: err.Error(),
}
}
defer resp.Body.Close()
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return CheckResult{
Healthy: false,
Duration: duration,
Error: fmt.Sprintf("status code %d", resp.StatusCode),
}
}
return CheckResult{
Healthy: true,
Duration: duration,
}
}
func main() {
tracker := NewStateTracker(3, 2)
url := "http://localhost:8080/health"
interval := 5 * time.Second
timeout := 10 * time.Second
fmt.Printf("Monitoring %s every %s\n", url, interval)
fmt.Printf("Failure threshold: 3 consecutive, Recovery threshold: 2 consecutive\n\n")
ticker := time.NewTicker(interval)
defer ticker.Stop()
// Run first check immediately
result := httpCheck(url, timeout)
state := tracker.RecordResult(result)
printResult(url, result, state)
for range ticker.C {
result := httpCheck(url, timeout)
state := tracker.RecordResult(result)
printResult(url, result, state)
}
}
func printResult(url string, result CheckResult, state HealthState) {
status := "OK"
if !result.Healthy {
status = "FAIL"
}
fmt.Printf("[%s] %s %s (%s) - state: %s\n",
time.Now().Format("15:04:05"),
status,
url,
result.Duration.Round(time.Millisecond),
state)
if result.Error != "" {
fmt.Printf(" error: %s\n", result.Error)
}
}
The StateTracker requires 3 consecutive failures before changing from healthy to unhealthy. And it requires 2 consecutive successes before changing from unhealthy back to healthy. This prevents flapping.
The bug: state flapping
Here is what happens without thresholds.
func recordResultBroken(healthy bool) string {
// BUG: state changes on every single check result
if healthy {
return "HEALTHY"
}
return "UNHEALTHY"
}
If a server drops one request out of ten, the health check oscillates between healthy and unhealthy every few seconds. Downstream systems that depend on the health check (like load balancers) will keep adding and removing the server. This causes connection errors for users as their requests are routed to a server that is being removed, or traffic spikes on other servers as this one is taken out of rotation.
14:00:01 HEALTHY
14:00:06 UNHEALTHY <- one timeout
14:00:11 HEALTHY <- back to normal
14:00:16 HEALTHY
14:00:21 UNHEALTHY <- another timeout
14:00:26 HEALTHY <- back again
The fix
Require N consecutive failures before changing state. The StateTracker above implements this. With a failure threshold of 3, a single timeout does not change the state. The server must fail 3 times in a row before it is marked unhealthy. And it must succeed 2 times in a row before it is marked healthy again.
This is exactly how Kubernetes readiness probes work. The failureThreshold and successThreshold fields in a pod spec control the same behavior.
Step 6: Complete Health Monitor Dashboard
Now combine everything into a single tool. This monitor checks HTTP endpoints, TCP ports, disk space, memory, and processes. It prints a color-coded dashboard to the terminal.
Configuration
Define checks as a list of structs.
package main
import (
"bufio"
"fmt"
"net"
"net/http"
"os"
"strconv"
"strings"
"sync"
"syscall"
"time"
)
type CheckType string
const (
CheckHTTP CheckType = "http"
CheckTCP CheckType = "tcp"
CheckProcess CheckType = "process"
CheckDisk CheckType = "disk"
CheckMemory CheckType = "memory"
)
type HealthState int
const (
StateHealthy HealthState = iota
StateDegraded
StateUnhealthy
)
func (s HealthState) String() string {
switch s {
case StateHealthy:
return "HEALTHY"
case StateDegraded:
return "DEGRADED"
case StateUnhealthy:
return "UNHEALTHY"
default:
return "UNKNOWN"
}
}
type CheckConfig struct {
Name string
Type CheckType
Target string // URL, host:port, PID, or path depending on type
Timeout time.Duration
Warning float64 // warning threshold (percentage for disk/memory)
Critical float64 // critical threshold
}
type CheckResult struct {
Healthy bool
Duration time.Duration
Error string
Details string
}
type CheckState struct {
mu sync.Mutex
config CheckConfig
state HealthState
consecutiveFailures int
consecutiveSuccess int
lastCheck time.Time
lastResult CheckResult
history []HealthState
}
func NewCheckState(config CheckConfig) *CheckState {
return &CheckState{
config: config,
state: StateHealthy,
history: make([]HealthState, 0, 10),
}
}
func (cs *CheckState) RecordResult(result CheckResult) {
cs.mu.Lock()
defer cs.mu.Unlock()
cs.lastCheck = time.Now()
cs.lastResult = result
if result.Healthy {
cs.consecutiveFailures = 0
cs.consecutiveSuccess++
if cs.consecutiveSuccess >= 2 {
cs.state = StateHealthy
}
} else {
cs.consecutiveSuccess = 0
cs.consecutiveFailures++
if cs.consecutiveFailures >= 3 {
cs.state = StateUnhealthy
} else if cs.state == StateHealthy {
cs.state = StateDegraded
}
}
cs.history = append(cs.history, cs.state)
if len(cs.history) > 10 {
cs.history = cs.history[len(cs.history)-10:]
}
}
Check implementations
Each check type has its own function. All return a CheckResult.
func runCheck(config CheckConfig) CheckResult {
switch config.Type {
case CheckHTTP:
return checkHTTP(config)
case CheckTCP:
return checkTCP(config)
case CheckProcess:
return checkProcess(config)
case CheckDisk:
return checkDisk(config)
case CheckMemory:
return checkMemory(config)
default:
return CheckResult{
Healthy: false,
Error: fmt.Sprintf("unknown check type: %s", config.Type),
}
}
}
func checkHTTP(config CheckConfig) CheckResult {
start := time.Now()
client := &http.Client{Timeout: config.Timeout}
resp, err := client.Get(config.Target)
duration := time.Since(start)
if err != nil {
return CheckResult{
Healthy: false,
Duration: duration,
Error: err.Error(),
}
}
defer resp.Body.Close()
healthy := resp.StatusCode >= 200 && resp.StatusCode < 300
result := CheckResult{
Healthy: healthy,
Duration: duration,
Details: fmt.Sprintf("HTTP %d", resp.StatusCode),
}
if !healthy {
result.Error = fmt.Sprintf("status code %d", resp.StatusCode)
}
return result
}
func checkTCP(config CheckConfig) CheckResult {
start := time.Now()
conn, err := net.DialTimeout("tcp", config.Target, config.Timeout)
duration := time.Since(start)
if err != nil {
return CheckResult{
Healthy: false,
Duration: duration,
Error: err.Error(),
}
}
conn.Close()
return CheckResult{
Healthy: true,
Duration: duration,
Details: "connection accepted",
}
}
func checkProcess(config CheckConfig) CheckResult {
start := time.Now()
pid, err := strconv.Atoi(config.Target)
if err != nil {
return CheckResult{
Healthy: false,
Duration: time.Since(start),
Error: fmt.Sprintf("invalid PID: %s", config.Target),
}
}
err = syscall.Kill(pid, 0)
duration := time.Since(start)
if err == nil || err == syscall.EPERM {
// Read process state
state := readProcessState(pid)
return CheckResult{
Healthy: true,
Duration: duration,
Details: fmt.Sprintf("PID %d alive, state: %s", pid, state),
}
}
return CheckResult{
Healthy: false,
Duration: duration,
Error: fmt.Sprintf("PID %d not found", pid),
}
}
func readProcessState(pid int) string {
path := fmt.Sprintf("/proc/%d/status", pid)
file, err := os.Open(path)
if err != nil {
return "unknown"
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
if strings.HasPrefix(line, "State:") {
parts := strings.SplitN(line, ":\t", 2)
if len(parts) == 2 {
return strings.TrimSpace(parts[1])
}
}
}
return "unknown"
}
func checkDisk(config CheckConfig) CheckResult {
start := time.Now()
var stat syscall.Statfs_t
err := syscall.Statfs(config.Target, &stat)
duration := time.Since(start)
if err != nil {
return CheckResult{
Healthy: false,
Duration: duration,
Error: fmt.Sprintf("statfs failed: %s", err),
}
}
total := stat.Blocks * uint64(stat.Bsize)
free := stat.Bavail * uint64(stat.Bsize)
used := total - free
usedPct := float64(used) / float64(total) * 100
healthy := usedPct < config.Critical
details := fmt.Sprintf("%.1f%% used (%.1f GB free)",
usedPct, float64(free)/1e9)
result := CheckResult{
Healthy: healthy,
Duration: duration,
Details: details,
}
if !healthy {
result.Error = fmt.Sprintf("disk usage %.1f%% exceeds threshold %.1f%%",
usedPct, config.Critical)
}
return result
}
func checkMemory(config CheckConfig) CheckResult {
start := time.Now()
file, err := os.Open("/proc/meminfo")
if err != nil {
return CheckResult{
Healthy: false,
Duration: time.Since(start),
Error: fmt.Sprintf("open /proc/meminfo: %s", err),
}
}
defer file.Close()
var totalKB, availableKB uint64
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Text()
fields := strings.Fields(line)
if len(fields) < 2 {
continue
}
key := strings.TrimSuffix(fields[0], ":")
val, parseErr := strconv.ParseUint(fields[1], 10, 64)
if parseErr != nil {
continue
}
switch key {
case "MemTotal":
totalKB = val
case "MemAvailable":
availableKB = val
}
}
duration := time.Since(start)
if totalKB == 0 {
return CheckResult{
Healthy: false,
Duration: duration,
Error: "could not read memory info",
}
}
usedKB := totalKB - availableKB
usedPct := float64(usedKB) / float64(totalKB) * 100
healthy := usedPct < config.Critical
details := fmt.Sprintf("%.1f%% used (%d MB available)",
usedPct, availableKB/1024)
result := CheckResult{
Healthy: healthy,
Duration: duration,
Details: details,
}
if !healthy {
result.Error = fmt.Sprintf("memory usage %.1f%% exceeds threshold %.1f%%",
usedPct, config.Critical)
}
return result
}
Color-coded terminal output
Use ANSI escape codes for color. Green for healthy, yellow for degraded, red for unhealthy.
const (
colorReset = "\033[0m"
colorRed = "\033[31m"
colorGreen = "\033[32m"
colorYellow = "\033[33m"
colorBold = "\033[1m"
colorDim = "\033[2m"
)
func stateColor(state HealthState) string {
switch state {
case StateHealthy:
return colorGreen
case StateDegraded:
return colorYellow
case StateUnhealthy:
return colorRed
default:
return colorReset
}
}
func printDashboard(checks []*CheckState) {
// Clear screen
fmt.Print("\033[2J\033[H")
fmt.Printf("%s%s=== Health Monitor Dashboard ===%s\n",
colorBold, colorGreen, colorReset)
fmt.Printf("%sUpdated: %s%s\n\n",
colorDim, time.Now().Format("2006-01-02 15:04:05"), colorReset)
// Header
fmt.Printf("%-20s %-10s %-10s %-12s %-8s %s\n",
"CHECK", "TYPE", "STATE", "LATENCY", "FAILS", "DETAILS")
fmt.Println(strings.Repeat("-", 80))
for _, cs := range checks {
cs.mu.Lock()
color := stateColor(cs.state)
stateStr := cs.state.String()
latency := "-"
if !cs.lastCheck.IsZero() {
latency = cs.lastResult.Duration.Round(time.Millisecond).String()
}
details := cs.lastResult.Details
if cs.lastResult.Error != "" {
details = cs.lastResult.Error
}
if len(details) > 30 {
details = details[:30] + "..."
}
fmt.Printf("%-20s %-10s %s%-10s%s %-12s %-8d %s\n",
cs.config.Name,
cs.config.Type,
color, stateStr, colorReset,
latency,
cs.consecutiveFailures,
details,
)
cs.mu.Unlock()
}
// Print history
fmt.Printf("\n%sState History (last 10 checks):%s\n", colorBold, colorReset)
for _, cs := range checks {
cs.mu.Lock()
fmt.Printf(" %-20s ", cs.config.Name)
for _, state := range cs.history {
color := stateColor(state)
switch state {
case StateHealthy:
fmt.Printf("%s+%s", color, colorReset)
case StateDegraded:
fmt.Printf("%s~%s", color, colorReset)
case StateUnhealthy:
fmt.Printf("%s!%s", color, colorReset)
}
}
fmt.Println()
cs.mu.Unlock()
}
}
Main loop
Tie it all together.
func main() {
configs := []CheckConfig{
{
Name: "web-app",
Type: CheckHTTP,
Target: "http://localhost:8080/health",
Timeout: 10 * time.Second,
},
{
Name: "api-server",
Type: CheckHTTP,
Target: "http://localhost:3000/healthz",
Timeout: 10 * time.Second,
},
{
Name: "postgresql",
Type: CheckTCP,
Target: "localhost:5432",
Timeout: 5 * time.Second,
},
{
Name: "redis",
Type: CheckTCP,
Target: "localhost:6379",
Timeout: 5 * time.Second,
},
{
Name: "disk-root",
Type: CheckDisk,
Target: "/",
Warning: 80,
Critical: 90,
},
{
Name: "memory",
Type: CheckMemory,
Target: "",
Warning: 80,
Critical: 95,
},
}
checks := make([]*CheckState, len(configs))
for i, cfg := range configs {
checks[i] = NewCheckState(cfg)
}
interval := 5 * time.Second
fmt.Printf("Starting health monitor with %d checks, interval %s\n",
len(checks), interval)
fmt.Println("Press Ctrl+C to stop.")
time.Sleep(2 * time.Second)
// Run checks in parallel
var wg sync.WaitGroup
runAllChecks := func() {
wg.Add(len(checks))
for _, cs := range checks {
go func(check *CheckState) {
defer wg.Done()
result := runCheck(check.config)
check.RecordResult(result)
}(cs)
}
wg.Wait()
printDashboard(checks)
}
// First run
runAllChecks()
ticker := time.NewTicker(interval)
defer ticker.Stop()
for range ticker.C {
runAllChecks()
}
}
This runs all checks in parallel using goroutines. Each check runs independently, so a slow HTTP check does not block the TCP checks. After all checks complete, it prints the dashboard.
Sample output
When you run this monitor, you see something like this (imagine the colors).
=== Health Monitor Dashboard ===
Updated: 2024-08-28 14:23:15
CHECK TYPE STATE LATENCY FAILS DETAILS
--------------------------------------------------------------------------------
web-app http HEALTHY 45ms 0 HTTP 200
api-server http UNHEALTHY 10.001s 5 context deadline exceeded...
postgresql tcp HEALTHY 2ms 0 connection accepted
redis tcp HEALTHY 1ms 0 connection accepted
disk-root disk HEALTHY 0ms 0 67.2% used (33.1 GB free)
memory memory HEALTHY 0ms 0 52.3% used (7680 MB available)
State History (last 10 checks):
web-app ++++++++++
api-server +++~~!!!!!
postgresql ++++++++++
redis ++++++++++
disk-root ++++++++++
memory ++++++++++
The api-server went from healthy (+) to degraded (~) after the first failure, then to unhealthy (!) after 3 consecutive failures. All other services remain healthy.
Extending the monitor
There are several directions you can take this.
Add a process check. If you know the PID of a service, add a process check to the configs list. You could also have the monitor find the PID by reading a PID file, or by searching /proc for a process with a matching command line.
{
Name: "nginx",
Type: CheckProcess,
Target: "1234", // PID
Timeout: 5 * time.Second,
},
Add alerting. When a check transitions from healthy to unhealthy, send a notification. This could be a simple webhook call, an email, or a write to a log file that another tool monitors.
Add configuration from a file. Instead of hard-coding the checks, read them from a YAML or JSON file. This makes the monitor reusable across different environments.
Add an HTTP endpoint. Make the monitor itself expose a /health endpoint that returns the aggregate status of all checks. Now other monitors can check your monitor.
Wrapping Up
Health checks are not complicated. The tools are built into every Linux system. curl checks HTTP. nc checks TCP. /proc has memory, disk, and process information. ss shows listening ports. These commands work everywhere, with no installation needed.
Building the same checks in Go gives you timeouts, retries, state tracking, and structured output. The key lessons:
- Always set timeouts on HTTP clients. The default Go HTTP client has no timeout.
- Always close connections after checking TCP ports. File descriptors are finite.
- Read
/proc/meminfovalues as kB (kibibytes). Multiply by 1024 to get bytes. - Use
syscall.Kill(pid, 0)to check if a process exists, notos.FindProcess. - Require multiple consecutive failures before marking a service unhealthy. Single-check flapping is worse than no monitoring at all.
The complete monitor in this guide uses only Go standard library packages. No external dependencies. It compiles to a single binary that runs on any Linux system.
Keep Reading
- CPU Monitoring: From Linux Commands to a Go Dashboard — read /proc/stat for CPU metrics and build a live dashboard with the same patterns used here.
- Docker Log Management: From docker logs to a Go Log Collector — monitor container logs alongside the health checks you built here.
- Process Management: From Linux Commands to a Go Supervisor — keep the services healthy by supervising them with signals and restart logic.