Skip main navigation
/user/kayd @ devops :~$ cat what-i-got-wrong-about-kubernetes-2025.md

What I Got Wrong About Kubernetes in 2025 What I Got Wrong About Kubernetes in 2025

QR Code linking to: What I Got Wrong About Kubernetes in 2025
Karandeep Singh
Karandeep Singh
• 9 minutes

Summary

A year-end retrospective on Kubernetes mistakes from a DevOps engineer. Five things I overinvested in, underinvested in, or simply got wrong while running EKS in 2025.

I started running Kubernetes seriously a couple of years ago. By the end of 2025 I had built and torn down enough infrastructure to fill a small graveyard. The cluster is still up — that part went fine. What didn’t go fine was almost every decision I made about what to put on top of it.

This is the honest list. Five things I got wrong in 2025. None of these are fresh hot takes you haven’t read elsewhere. They are mistakes I made anyway, with my own eyes open, surrounded by people who told me not to. Maybe they’ll save you the same months.

1. I deployed a service mesh nobody needed

We had a small handful of services. We bought Istio.

The argument I made at the time was reasonable in shape and wrong in fact: “We’re going to scale to many more services next year. Better to put the mesh in now while the surface area is small than retrofit it later.” It’s a completely standard piece of architectural reasoning. It is also exactly the wrong call when you’re at a small handful of services and your engineering team is small.

Here’s what actually happened across most of 2025:

  • We spent weeks designing the mesh architecture, mTLS rollout, and traffic policies for our staging environment.
  • We hit several Istio version-incompatibility issues during EKS upgrades.
  • Our P99 latency went up noticeably. None of us could agree on whether the mesh caused it or whether one of our application changes caused it. We never figured out the answer.
  • One service — exactly one — used a feature that we couldn’t have done with a simpler solution.
  • The platform engineer who knew Istio well moved on. The remaining team was afraid to touch it for a long time after.

Eventually we ripped Istio out and replaced it with EKS native pod-to-pod IAM and a couple of NetworkPolicies. The replacement was the work of a sprint. The Istio investment had been the work of many.

The lesson isn’t “service meshes are bad.” The lesson is that service mesh is a tool for the problem of “I have so many services that point-to-point communication is unmanageable.” If you don’t have that problem yet, you don’t need the tool yet. We didn’t have it. I bought it anyway.

If I were starting fresh on EKS today with anything short of a sprawling service catalog, I would not consider a service mesh until somebody on the team could point at a specific cross-service problem the mesh solves and the simpler tools cannot. Not before.

2. I built autoscaling for traffic we never had

The cluster autoscaler took weeks to tune. The horizontal pod autoscaler took longer. We wrote custom metrics, configured target utilization, ran load tests, validated the autoscaling worked at multiple traffic patterns.

Then we put it in production and watched it never autoscale.

The traffic profile of our SaaS workload is flat. Tuesday looks like Wednesday looks like Sunday with very small variance. The peak traffic was barely above the average. Our autoscaler was configured to handle multiples of that. We had carefully built a system to handle a load spike that, for the structure of our customer base, mathematically could not occur.

What I should have built instead: a fixed pod count slightly above the steady-state load, with a manual scaling runbook for unusual events. Total time to build that: a fraction of an afternoon.

The premature autoscaling cost us in two ways. First, the design time itself. Second, the cluster was provisioned for the autoscaling worst case, which means we paid for capacity we never used, every single month.

The honest lesson: scale your infrastructure for the load you’ve actually measured, plus a small safety margin. Build autoscaling when the measured peak-to-average ratio crosses a threshold that fixed provisioning can’t handle. For most B2B SaaS workloads, that day never comes.

3. I trusted Helm charts I never actually read

We installed plenty of community Helm charts in 2025. I read very few of them carefully. The rest got the “this has lots of GitHub stars and the README looks reasonable” treatment.

At one point, one of them silently changed its image tag from a pinned version to :latest between two patch releases. We caught it in staging because a reasonable engineer happened to look at the deployed image. We could just as easily have shipped it to production.

Later, a different chart updated its dependencies, pulling in a new sub-chart that ran a privileged init container we had not seen, did not need, and could not justify.

I had treated helm install like apt install: a packaged, vetted, safe-by-default operation. It is not. Charts are arbitrary YAML and arbitrary container images, often maintained by one or two volunteers, often pulling in transitive dependencies that the chart author didn’t audit either. Every Helm chart you install is the equivalent of curl | sudo bash for your cluster.

What I do now:

  • Pin chart versions in source control, never latest, never the un-pinned implicit upgrade path.
  • Pin container image tags inside the chart values, never trust the chart’s defaults.
  • Render the chart to YAML and read it once before installing. Yes, the whole thing.
  • Renovate or Dependabot configured to flag chart upgrades for human review, not auto-merge.

The lesson is depressing but real: the convenience of community Helm charts is sometimes worse than the operational tax of writing your own simpler manifests. For complex things like Postgres operators or cert-manager, the chart is unavoidable. For everything simpler, write the YAML yourself. You’ll read it more carefully when you have to write it.

4. I built our internal developer platform too early

Earlier in the year our team had a long meeting. The conclusion was that our developer experience had become bad: deploys took ages, the YAML was confusing, our most senior engineers were spending a real chunk of their time as a help desk for the rest of the team.

So I started building an Internal Developer Platform. Backstage on top, custom CRDs in the middle, a polished make deploy flow for app teams.

Months in, I had built most of what was needed. The team that was supposed to use it had churned through several people in the meantime. The remaining users were frustrated with the platform’s gaps but understandably did not want to learn the half-finished version when the rough manual deploy still worked.

By the end of the year, I retired the platform without it ever being adopted. Sunk cost.

The mistake here was not building the platform — it was building it before there was a stable team to maintain it. An Internal Developer Platform is a long-term commitment with a high maintenance ceiling. If you don’t have engineers committed to it for the long haul, you are building debt that will outlive whoever made the call.

What I should have done: ship targeted improvements to the manual flow. A better Makefile. A kubectl plugin for the operations everyone repeated. A two-page documented “happy path” with screenshots. Total elapsed time: a tiny fraction of what the platform took. Outcome: every engineer’s Tuesday gets meaningfully better.

The grand vision is seductive. It almost always loses to the small, scrappy, immediately-useful change.

5. I treated observability as a setup task instead of a maintenance task

The original observability rollout took us a few weeks. After that, I considered observability done. The dashboards existed. The alerts fired. We could see CPU and memory and request latency.

Across the year, observability quietly broke in several different ways without me noticing:

  • A handful of custom metrics drifted from their source — someone renamed a function, the metric name was wrong, no one updated the dashboard.
  • A couple of alerts had been muted “for this week” earlier in the year and were still muted months later.
  • The log-volume cost grew significantly because nobody was rotating the verbose-logging config back to default after debugging sessions.
  • The on-call runbook referenced a dashboard URL that had been deleted in a workspace cleanup.
  • A bunch of alerts had thresholds that hadn’t been recalibrated since the original rollout, even though the traffic profile had changed.

The pattern is the same in each case: observability rots if you don’t maintain it. It is not infrastructure that you provision once. It is a body of code, dashboards, alerts, and runbooks that has the same software-rot characteristics as any other codebase. We were treating it like the wiring in a house, when it is actually like the wiring in a car.

What I do now:

  • A monthly observability review on the calendar — walk through the alerts that fired, the alerts that didn’t, the dashboards we used, the dashboards nobody opened.
  • A check that every on-call alert has been triggered at least once in a reasonable window. If it hasn’t, it’s either too lax or genuinely irrelevant.
  • A single “log-volume cost” line item that we look at regularly. It has caught more than one cost runaway since I started watching it.

If I had been doing this from the start, I would have caught these problems as they happened instead of in a deep-dive months later.

The meta-mistake

The pattern across all five of these is the same. In each case I optimized for the future I was imagining instead of the present I was actually operating in.

Service mesh for the service catalog we didn’t have. Autoscaling for spikes that didn’t exist. Helm charts for tooling sophistication we hadn’t earned. Internal platform for a team that hadn’t stabilized. Observability for a deployment that wasn’t going to be a deployment forever.

In every case, less infrastructure earlier would have shipped better outcomes faster. In every case, the future-proofing turned into present-paying.

The thing they don’t tell you about Kubernetes is that it gives you so much room to over-engineer that the over-engineering becomes the default state. Every YAML file is an invitation to add another field. Every Helm chart is an invitation to install another component. Every architectural meeting is an invitation to anticipate a problem that may never arrive.

The discipline that actually wins is exactly the discipline you don’t want to have: not building the thing yet. Saying “we don’t need that until we measure that we need it.” Letting the system stay rough and slightly painful until the pain crosses a threshold that justifies real investment.

I will keep getting this wrong. I just want to get it wrong less in 2026.


If you ran Kubernetes in 2025 and your retrospective list looks different from mine, I genuinely want to hear it. Reach out on LinkedIn or GitHub. Disagreement is the most useful feedback.

Similar Articles

More from devops