Kubernetes Cluster Works Fine Until Traffic Spikes

A Deep Dive into Resource Requests, Limits, and Real World Failures

A Kubernetes cluster often appears healthy during normal operation. Pods are running, dashboards are green, and alerts stay quiet. Then traffic spikes. Suddenly requests slow down, pods restart, and some workloads disappear entirely. This situation surprises many teams because nothing changed in the cluster configuration. The problem usually lies in how resource requests and limits were defined, or not defined at all.

This article explains why these failures happen, how Kubernetes actually uses CPU and memory settings, and how small configuration choices can prevent large production outages.

The False Comfort of a Quiet Cluster

A cluster that handles low traffic smoothly is not necessarily well configured. During calm periods, most applications consume only a fraction of their potential resources. Kubernetes does not enforce limits aggressively when there is no contention. This creates the illusion that default settings are sufficient.

When traffic increases, applications begin to compete for CPU and memory. At that point, Kubernetes must make decisions quickly. If resource definitions are unrealistic, the scheduler and the kubelet respond in ways that feel unpredictable.

What Resource Requests Really Mean

A resource request is a promise. When a pod declares a CPU or memory request, it is telling Kubernetes how much it needs to function reliably. The scheduler uses this information to decide where the pod can run.

Consider a simple deployment:

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"

This configuration means Kubernetes will place the pod only on a node that has at least 250 millicores of CPU and 256 MiB of memory available. Even if the pod uses much less most of the time, the scheduler reserves this capacity.

If requests are set too low, Kubernetes may pack too many pods onto the same node. Everything works until load increases. At that point, the node becomes overwhelmed.

What Limits Actually Do

Limits define the maximum resources a container can use.

For CPU, exceeding the limit causes throttling. The application does not crash but becomes slower. Latency increases and timeouts appear.

For memory, exceeding the limit results in an immediate termination. The container is killed with an Out Of Memory error and restarted if a restart policy exists.

Example:

resources:
  limits:
    cpu: "500m"
    memory: "512Mi"

If the application suddenly needs more memory during a traffic spike, Kubernetes will not negotiate. The container is terminated.

A Common Real World Failure Scenario

Imagine an API service running three replicas on a small cluster. Each pod has low requests and tight memory limits.

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    memory: "256Mi"

During normal usage, memory consumption stays below 150 MiB. Everything looks fine.

Now a traffic spike occurs. More requests mean more objects in memory, larger request payloads, and more concurrent goroutines or threads. Memory usage climbs past 256 MiB. Kubernetes kills the container. The pod restarts and immediately receives traffic again. The cycle repeats.

From the outside, this looks like instability or a Kubernetes bug. In reality, the limits were never aligned with real application behavior.

Why Pod Evictions Appear During Traffic Spikes

Even if limits are not reached, nodes can still come under pressure. When total memory usage on a node approaches capacity, Kubernetes starts evicting pods.

Pods with lower priority and lower memory requests are evicted first. If requests are unrealistically small, critical workloads may be removed before less important ones.

This is why teams sometimes see monitoring agents or tracing systems disappear under load, even though the main application survives.

The Hidden Cost of Copy Pasted Values

Many Helm charts ship with conservative defaults. These values are designed to work everywhere, not to reflect your workload.

Copying these defaults into production without measurement leads to two common problems:

Requests are too low, causing over scheduling and node pressure
Limits are too tight, causing restarts under load

The cluster appears cost efficient but becomes fragile.

Observing the Problem Properly

Kubernetes provides enough signals to understand what is happening if they are used correctly.

kubectl top pod shows real CPU and memory usage over time.
kubectl describe pod reveals throttling, OOMKills, and eviction reasons.
Node metrics show whether failures are isolated or systemic.

The key is to observe during peak traffic, not during quiet periods.

Setting Requests Based on Reality

A practical approach is to observe average usage under moderate load and add a safety margin.

If a service consistently uses 300 MiB of memory during busy periods, setting a request around 350 MiB and a limit around 600 MiB gives Kubernetes room to operate.

Example:

resources:
  requests:
    cpu: "300m"
    memory: "350Mi"
  limits:
    memory: "600Mi"

This configuration allows bursts without immediate termination and helps the scheduler place pods intelligently.

Why Autoscaling Alone Does Not Save You

Horizontal Pod Autoscalers react to metrics. They do not prevent individual pods from being killed. If pods are crashing due to memory limits, scaling replicas only increases the number of failing pods.

Autoscaling works best when each pod is stable under load and requests reflect real needs.

A Stable Cluster Is an Honest Cluster

Kubernetes does exactly what it is told. If resource definitions are optimistic, the cluster becomes optimistic too. Traffic spikes expose this optimism brutally.

Clusters that survive real world traffic are not over provisioned. They are well measured, honestly configured, and continuously observed.

The difference between a calm cluster and a resilient one is rarely the number of nodes. It is almost always the quality of resource requests and limits.

Why Your Kubernetes Cluster Works Fine Until Traffic Spikes

A Deep Dive into Resource Requests, Limits, and Real World Failures

The False Comfort of a Quiet Cluster

What Resource Requests Really Mean

What Limits Actually Do

A Common Real World Failure Scenario

Why Pod Evictions Appear During Traffic Spikes

The Hidden Cost of Copy Pasted Values

Observing the Problem Properly

Setting Requests Based on Reality

Why Autoscaling Alone Does Not Save You

A Stable Cluster Is an Honest Cluster

More from this blog

When SSL Lies: Debugging PostgreSQL “server does not support SSL” in Kubernetes

A Real World Journey Building on Tencent Cloud

Lessons Learned Building a CI Pipeline That Auto-Tags and Deploys Docker Images

What I Learned Migrating a Real App from Docker Compose to Kubernetes

Running Apache Flink on Kubernetes: From Zero to a Fully Utilized Cluster

Command Palette

A Deep Dive into Resource Requests, Limits, and Real World Failures

The False Comfort of a Quiet Cluster

What Resource Requests Really Mean

What Limits Actually Do

A Common Real World Failure Scenario

Why Pod Evictions Appear During Traffic Spikes

The Hidden Cost of Copy Pasted Values

Observing the Problem Properly

Setting Requests Based on Reality

Why Autoscaling Alone Does Not Save You

A Stable Cluster Is an Honest Cluster

More from this blog