How I Cut Kubernetes Costs by 47% Without Sacrificing Performance
A near-50% reduction in daily cloud spend is hard to ignore.
That’s exactly what happened after I completed a three-month infrastructure optimization project for a company running a large-scale batch processing platform on EKS.
The cluster ran thousands of concurrent pods, processing millions of work items daily. By the end of the project, we achieved a 47% reduction in compute costs—translating to nearly seven figures in annual savings—with zero downtime and no degradation in output quality.
This wasn’t a greenfield environment. This was production infrastructure serving paying customers, running workloads that could not tolerate mid-execution interruptions. The usual advice of “just use spot instances and handle interruptions gracefully” didn’t apply here. We needed a more deliberate approach.
The Problem: Paying for Compute We Weren’t Using
The metrics told a clear story.
The EKS cluster was running at roughly 45% CPU utilization and under 20% memory utilization. We were paying for significantly more capacity than we actually needed.
The cluster used Cluster Autoscaler (the traditional Kubernetes node autoscaler) with conservative settings. It maintained a large buffer of unused capacity and could provision only a narrow set of instance types. At the same time, pod-level resource requests were padded with generous safety margins that no longer matched real usage.
The workload itself made optimization even more difficult. Jobs arrived in predictable hourly waves, spiking sharply at the top of each hour and then processing steadily for most of the hour. As jobs completed and new ones arrived, pod counts were in constant motion, preventing the autoscaler from ever settling into an efficient, stable node layout.
The First Attempt: Partial Success
I migrated the cluster to Karpenter (a modern node autoscaler) with aggressive consolidation enabled. This immediately unlocked better bin-packing and access to a wider range of instance types.
The result was promising: an immediate ~30% cost reduction.
Then the problems started.
To optimize node utilization, Karpenter was consolidating aggressively, which meant pods were being terminated before their jobs completed. For this particular workload, a mid-execution termination meant a failed job that couldn’t be resumed.
We had two realistic options:
- Disable aggressive consolidation and accept minimal savings
- Change the workload architecture to tolerate interruptions
Refactoring the application to be fully fault-tolerant wasn’t practical given the nature of the jobs and the scope of changes required.
So I rolled node consolidation back to conservative settings. The disruptions stopped—but so did most of the savings.
The Breakthrough: Inverting the Problem
The key insight was this:
Karpenter couldn’t stabilize because the workload itself never stabilized.
With pods continuously scaling up and down, nodes almost never remained idle long enough for safe consolidation. Rather than allowing workload volatility to shape the infrastructure, I flipped the approach:
Establish a stable infrastructure baseline and adapt the workload to fit it.
I analyzed historical data and found that each job type had remarkably consistent peak pod counts that repeated daily.
The proposal:
- Replace dynamic job scaling with fixed-replica Deployments configured to handle typical peak workloads
- Modify the application to poll for work continuously rather than terminating when queues emptied
This approach stabilizes pod count at a predictable baseline. During peak hours, jobs may experience slightly longer queue times, but message queues are specifically designed to handle this buffering. The key advantage: stable pod counts enable aggressive consolidation without the constant churn of pods spinning up and down.
The Three-Phase Implementation
Phase 1: Karpenter Foundation
I configured Karpenter NodePools to provision from a broad selection of spot instance types spanning multiple families—moving beyond the handful the team had previously monitored manually. This unlocked flexibility for both pricing and bin-packing, allowing Karpenter to optimize across dozens of instance options.
Even with conservative consolidation settings, this delivered measurable cost reduction and validated the approach.
Phase 2: Workload Stabilization
I replaced dynamic scaling with fixed Deployments, rolling out changes incrementally by job type. Replica counts were based on observed workload patterns, configured to handle typical peaks with modest headroom.
The application was modified to poll continuously with exponential backoff:
- Check queue for work
- If empty, wait briefly
- Gradually increase wait time up to a maximum
- Reset when work arrives
Pods remained alive during quiet periods, eliminating the constant pod churn that had been undermining consolidation.
I deployed these changes over two weeks, monitoring queue depth and latency throughout. During peak hours, queue wait times increased modestly—well within SLA requirements and delivery commitments.
With pod counts stabilized, I re-enabled aggressive consolidation.
This time, it worked. Costs dropped significantly with zero disruption complaints.
Phase 3: Resource Right-Sizing
With pod counts stable and consolidation running smoothly, I turned to the resource requests themselves. The numbers told a familiar story: memory requests had been set conservatively early in the platform’s life and never revisited. Over time, actual usage had drifted far below those original estimates—in some cases, pods were using less than 20% of their requested memory.
I built right-sized resource profiles based on historical usage data. Memory requests came down significantly—often by 50% or more—while CPU requests were adjusted to match observed consumption. Limits stayed high enough to absorb legitimate spikes without triggering throttling.
As with the previous phases, changes were rolled out incrementally with monitoring at each step. No OOM kills. No sustained throttling. Both CPU and memory utilization moved into healthy ranges above 60%, which gave Karpenter exactly what it needed: tightly packed pods and far fewer nodes to pay for.
The Results
By the end of the project, daily spot instance compute costs were down 47%, effectively cutting the cluster’s infrastructure spend in half. The changes were rolled out with zero downtime and no customer impact.
Those savings came from three overlapping phases:
- Karpenter migration: ~5–10% from improved bin-packing and instance selection
- Workload stabilization and consolidation: ~20–25% by eliminating constant pod and node churn
- Pod right-sizing: ~15–20% by aligning resource requests with actual usage
These weren’t isolated improvements. Each step reinforced the next. Right-sized pods packed more efficiently. Better packing made consolidation both safer and more aggressive. And deeper consolidation ultimately translated into a much smaller AWS bill.
What This Means for You
If you’re running Kubernetes in production and your cluster utilization sits below 60%, you’re likely sitting on significant annual savings.
The 47% reduction I achieved came from three systematic changes: migrating to Karpenter for better bin-packing and instance selection, stabilizing workloads to eliminate constant pod and node churn, and right-sizing pods to match actual usage. None of these required sacrificing reliability—they required understanding what the infrastructure was actually doing versus what we assumed it needed.
Most Kubernetes clusters grow organically without ongoing optimization. Resource requests become historical artifacts, instance diversity stays artificially narrow, and autoscaling masks inefficiencies rather than fixing them. The savings are there. It’s a matter of knowing where to look.
Want to see what’s possible with your infrastructure? I help engineering teams identify and capture Kubernetes cost savings without impacting reliability. Let’s talk.