/user/kayd @ devops :~$ cat devops-stack-i-would-pick-2026.md

The DevOps Stack I'd Pick If I Started Over in 2026 The DevOps Stack I'd Pick If I Started Over in 2026

Karandeep Singh

May 4, 2026 • 9 minutes

Summary

An opinionated DevOps stack pick from an engineer with years of production experience. Each layer is a vote, with what I would avoid and why.

If someone who’d been writing CRUD apps for years asked me what stack to learn for moving into DevOps, I’d start typing a list, delete it three paragraphs in, and start over. Then I’d delete that too.

The honest answer is that I’ve picked the wrong tool for the job at every layer of the stack at some point. Jenkins when I should’ve used GitHub Actions. ECS when I should’ve stayed with Lambda. Ansible when I should’ve used Terraform. CloudWatch when I should’ve used Datadog. Bash when I should’ve used Go.

After enough of those mistakes, you stop believing in “best practices” and start believing in the smallest set of tools that make production survivable.

This is that list, in 2026. It’s an opinion piece. Disagree with most of it. Here’s what I’d pick if I were starting over tomorrow.

Cloud: AWS, but barely

I’d still pick AWS, and I’d be unhappy about it.

The case for AWS is the same case it’s been for years: the depth of services means you can run almost any workload without leaving the platform, and the documentation — though painful — is more thorough than any competitor’s. The case against AWS is that the bill is unpredictable, the IAM model is genuinely user-hostile, and the console UI feels designed by a committee that never spoke to each other.

If I were starting over and the workload looked anything like “static site, a few Lambda functions, maybe a small Postgres,” I’d pick Cloudflare instead. Workers + R2 + D1 covers a meaningful slice of what most early-stage teams actually need, at a fraction of the cost, with a developer experience that doesn’t make you want to set your laptop on fire. I migrated a personal project off AWS S3 + CloudFront onto Cloudflare R2 + Workers recently. Build complexity dropped a lot. Bill dropped even more.

But for any workload that touches Kubernetes, EKS is unmatched. So AWS, with the asterisk that I keep evaluating Cloudflare for everything new.

What I’d avoid: GCP unless you’re already deep in BigQuery. Azure unless you’re contractually trapped in the Microsoft ecosystem. The major-three cloud comparison stopped being interesting around 2022.

Container orchestration: nothing, until you’re sure

The most expensive mistake I’ve watched teams make recently was deploying Kubernetes for workloads that fit comfortably on a few EC2 instances behind an ALB.

The honest answer:

A small number of services, predictable load: Don’t run Kubernetes. Use ECS Fargate or App Runner or, if you can, plain Lambda + API Gateway. The operational tax of Kubernetes — control plane, networking, RBAC, upgrades, the cluster autoscaler argument that takes a week — is real and it does not amortize until you have a lot of services.
A real catalog of services, real autoscaling needs, multiple teams: EKS with Karpenter. Karpenter has been the single biggest improvement to the Kubernetes operational story in years. It removed entire classes of capacity-planning headaches. If you’re on EKS without Karpenter, you’re using a 2021 Kubernetes.
Tens of teams, multi-region: EKS with proper platform-engineering investment, internal developer platform, the works.

I run Kubernetes on EKS where it earns its operational tax — where the service catalog and platform investment are real. I would not run Kubernetes for my own side projects. The threshold is real.

What I’d avoid: Self-managed Kubernetes on EC2 unless you’ve decided your differentiator is being good at Kubernetes. You probably haven’t decided that.

CI/CD: GitHub Actions

I’ve used Jenkins, GitLab CI, AWS CodeBuild, CircleCI, and Travis. After all of that, I’d start with GitHub Actions.

GitHub Actions wins not because it’s the best — it isn’t — but because the friction of using anything else has gotten ridiculous. Your code is already on GitHub. Your secrets manager already integrates. Your team already knows the YAML. The marketplace covers most of what common pipelines need.

The case for Jenkins is real if you have specific compliance, plugin, or build-orchestration needs that GitHub Actions can’t meet. Jenkins still earns its keep when pipelines have to reach into VMware and on-prem networks. But for anything green-field — GitHub Actions, full stop.

What I’d avoid: Building your own custom CI/CD platform. I have watched several teams do this. None of them shipped faster as a result.

Infrastructure as Code: Terraform, with caveats

Terraform is winning by inertia. OpenTofu fork drama hasn’t really changed the on-the-ground experience. The community modules ecosystem is still on Terraform.

I’d still write new infrastructure in Terraform in 2026, but I’d be much pickier about what I put in Terraform.

Long-lived foundational stuff (VPCs, IAM roles, S3 buckets, RDS, EKS clusters): Terraform.
Application-level config (deployments, ConfigMaps, Helm values): not Terraform. Use the tool of the platform — Helm, Kustomize, or App-of-Apps with ArgoCD.
Truly app-coupled resources (a queue used by exactly one Lambda): keep it in the same repo as the app code, deploy it with the app’s CI. The “one giant Terraform monorepo” pattern always seems clean for a couple of months and then becomes the slowest part of any change.

What I’d avoid: Pulumi unless your team is genuinely better at TypeScript than they are at HCL. They probably aren’t, and the abstraction cost shows up later when you need to read a long codebase to debug a deploy issue.

Configuration & secrets: AWS Parameter Store + Secrets Manager

This is unsexy and correct. AWS Parameter Store handles non-sensitive config. Secrets Manager handles secrets. Both integrate cleanly with most workloads. Both have proper IAM control.

The case for HashiCorp Vault is real if you have actual cross-cloud or on-prem needs. The case for Doppler / 1Password Secrets Automation is real for early-stage teams that aren’t ready to think about IAM policies.

For everyone else: just use the AWS-native tools. The “let me run my own secret store” path leads to operational pain.

Observability: Datadog if you can pay, Grafana stack if you can’t

This one hurts to write because Datadog is expensive enough that the bill becomes a regular boardroom topic. But the truth is that if your engineering time is genuinely valuable, the time the team saves not building dashboards is worth more than the Datadog bill.

I’ve spent years arguing about Datadog cost and losing every argument. The teams that switched to self-hosted Prometheus + Grafana + Loki saved money for a while. Then somebody had to maintain the stack. Then a senior engineer left. Then they were paying for Datadog plus the technical debt of a half-maintained Grafana stack.

If you’re at a stage where engineering time is more expensive than vendor bills, pay Datadog. If you’re truly cost-constrained, AWS-native CloudWatch + a thin Grafana layer for the dashboards your team actually looks at. Self-hosting a full Prometheus + Grafana + Loki + Tempo stack is an entire engineering job. Don’t pretend otherwise.

What I’d avoid: Splunk in any form, in 2026. Yes, I know. No, I’m not changing my mind.

Languages: Bash → Go, when it crosses the threshold

This is the take I’m most confident in.

Bash is fine until your script reaches the point where you need:

Tests
Arguments with proper validation
Concurrency
Anything you’d want to debug in a debugger

Then it’s not fine. Then it’s a liability. Then you write Go and find out Go is just better at every operational task that crosses a certain threshold.

The pattern I’ve moved to: start every operational tool in Bash. Rewrite in Go the moment it stops being a one-screen script or anyone other than me has to touch it. This rule alone has saved me more debugging time than every fancy DevOps tool combined.

What I write Go for now:

Custom task runners (replaced cron + systemd-timer hairballs)
Internal CLI tools that wrap AWS APIs
Webhook handlers that need to be reliable
Anything that talks to Kafka, SQS, or any queue

What I keep in Bash:

Quick one-off scripts
Interactive runbooks
The occasional tail | grep | awk pipeline

What I’d avoid: Python for ops scripting in 2026. The packaging story is still bad. The startup time is still bad. Go is just better at this category.

Editor & dotfiles: stop optimizing, start writing

I went through the whole pipeline: Sublime → Atom → VS Code → Neovim with a sprawling config → back to Neovim with a stripped-down one → back to VS Code. The conclusion is that the editor doesn’t matter, and the time spent optimizing it is the most expensive activity in DevOps.

Pick something. Stop tweaking. Write things.

If you must use Neovim, use LazyVim and don’t customize. If you must use VS Code, install a small set of extensions and stop. The hours you spend tuning your editor are hours you don’t spend writing the runbook that would’ve prevented this week’s incident.

What I’d refuse to use in 2026

The non-controversial list:

Chef, Puppet — replaced by Ansible, then by infrastructure as code
Self-hosted GitLab CE — too much operational overhead vs GitHub
Heroku for new workloads — Salesforce-era pricing, not 2026 pricing
Travis CI — barely a product anymore
Self-managed Jenkins on a single VM — see this article’s preceding context for why

The controversial list:

Docker Desktop on macOS — Colima or OrbStack. Don’t pay Docker Inc. for what should be a free runtime.
Helm charts you didn’t write — for anything more complex than the canonical examples. The opacity-to-customization ratio is terrible.
Sprawling YAML — when configuration crosses a length where you’d prefer to refactor it, that’s not configuration anymore, it’s a programming language. Use jsonnet, CDK, or write a generator.

What I’d invest in that I haven’t yet

If I had a free month for tooling investment in 2026, I’d spend it on:

A real local-development setup that mirrors production. Most of my “works on my laptop, not in CI” debugging time would disappear.
A personal Wiki for runbooks. I have most of mine scattered across notes apps. Consolidating would pay back quickly.
Real AI-assisted ops workflow. I use Claude and ChatGPT casually. I haven’t wired them into incident response or PR review the way the productive teams I know are doing.

The honest take

Most of the value in any DevOps stack is in the parts you choose not to add. Every tool you bring into production is something you have to upgrade, monitor, secure, document, and eventually replace. The teams I’ve watched ship the fastest are not the teams with the most tools. They are the teams who are willing to say “we don’t need that yet.”

If I started over tomorrow with the stack above, I’d ship a real product faster than I shipped my first one, and most of the speedup would come from not making the mistakes I made. The tooling matters. But the tooling matters less than the discipline to keep things small.

Pick fewer things. Use them harder. Document the ones you keep. Delete the ones you don’t.

That’s the stack.

Disagree? I’d genuinely like to hear which call you’d flip. Reach out on LinkedIn or GitHub.

More from devops

Sed for Log Analysis: Extract Errors, Filter by Time, Find Patterns

Real sed patterns for log analysis: extract errors, filter time ranges, anonymize PII, parse …

Sed Gotchas: GNU vs BSD, In-Place Backup, and Safety Patterns

The sed gotchas that bite in production: GNU vs BSD differences (`-i` syntax, `-E` support, `\b` …

Sed in CI/CD Pipelines: Safe Patterns for GitHub Actions and Jenkins

How to use sed safely in CI/CD pipelines: idempotent edits, exit-code checks, dry-run patterns, and …

What I Got Wrong About Kubernetes in 2025

A year-end retrospective on five expensive mistakes I made running Kubernetes in 2025: a service …

Log Aggregator From Scratch in Go

Build a log aggregator in Go from scratch. Tail files with inotify, survive log rotation, parse …