Skip to main content

Command Palette

Search for a command to run...

How I Reduced Jaeger Trace Loss in a Small AKS Cluster

Updated
4 min read
How I Reduced Jaeger Trace Loss in a Small AKS Cluster
S

DevOps & Cloud Engineer — building scalable, automated, and intelligent systems. Developer of sorts | Automator | Innovator

Introduction

Distributed tracing helps understand how services behave when requests travel through multiple microservices. Tools like Jaeger are very useful, but they can also consume significant memory and CPU resources. This becomes especially noticeable when the Kubernetes cluster is small.

In my case, I deployed Jaeger in a single-node Azure Kubernetes Service (AKS) cluster running on a Standard_D3_v2 node. At first, everything looked fine. But under real workload, traces began to disappear. Only some traces showed up, and sometimes none at all.

This is the story of why that happened and how I fixed it.


Cluster Context

ComponentDetails
Cloud ProviderAzure Kubernetes Service (AKS)
Node Count1
Node TypeStandard_D3_v2 (4 vCPU, 14 GB RAM)
Jaeger DeploymentAll-in-One (Collector + Query + Agent)
Storage Backend (Initial)In-memory
Storage Backend (Final)Badger
Workload TypePython application with moderate tracing volume

The Problem

The issue was trace loss. When requests increased, Jaeger was not able to keep up. The UI showed missing spans and partial traces. Sometimes entire traces disappeared.

Why does this happen?

Reason #1: In-memory Storage Flushes Data Quickly

The default Jaeger all-in-one deployment uses in-memory storage. This is fast but extremely volatile. When memory fills up, older traces are dropped without warning.

Memory full  drop oldest spans  missing traces in UI

Reason #2: Single Node Means Shared Resource Pressure

AKS Standard_D3_v2 has only 4 vCPUs. When Jaeger collector competes with application workloads, the collector queue can start backing up:

Incoming spans  Collector queue  Storage

When the queue fills:

New spans are dropped

How I Diagnosed The Issue

I observed these symptoms:

  1. CPU usage spikes near 80% on the Jaeger pod

  2. Memory usage increasing until pod OOMKilled or auto-cleared

  3. Jaeger logs showing queue warnings:

queue is full, dropping spans

This confirmed the root cause: Jaeger was overloaded.


The Fix

I applied two changes:

1. Increased Resource Requests and Limits

Example configuration:

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1500m"
    memory: "2Gi"

This ensured Jaeger received enough CPU shares and memory headroom.

2. Switched Storage Backend from In-Memory to Badger

Badger is a lightweight key-value storage engine suitable for single-node setups.
It avoids trace eviction caused by memory pressure.

Helm values change:

storage:
  type: badger
  badger:
    directory: /badger
    ephemeral: false

This allowed traces to persist longer and prevented sudden drops.


Results

BeforeAfter
Frequent trace lossStable trace retention
Collector queue warning logsNo queue overflow warnings
Unreliable observabilityConsistent tracing in UI
Memory spikesControlled usage

Simple Architecture Diagram

Now let us break down what each component does:

1. Application Pods

Your microservices or application containers generate trace spans.
Each span represents one part of a request, such as:

  • A database query

  • An HTTP call to another service

  • A background job

These spans help show how long each part of a request takes.


2. Jaeger Collector

The collector receives the span data from the application.

Think of it like a “mail sorting center”:

Apps send spans  Collector receives, batches, and writes them.

If the collector becomes overloaded or has no place to store spans, they get dropped — this is where trace loss happened before the fix.


3. Badger Storage

This is where the spans are actually stored.

  • With in-memory storage, data disappears when memory fills.

  • With Badger, spans are written to a lightweight key-value store on disk.

So instead of traces evaporating, they are retained long enough to view and analyze.


Simplifying the Flow

You can think of the whole system like this:

Your App  Sends Traces  Collector  Stores Traces  UI shows Traces

If any part in the middle breaks, you get missing traces.

By giving Jaeger more CPU / memory and switching storage to Badger, you ensured:

  • Collector did not choke under pressure

  • Spans were not dropped due to memory eviction

  • Traces became reliable and consistent


Key Takeaways

  • In-memory storage is fine for testing, but not for real workloads.

  • Even a small cluster can run Jaeger effectively, but resources must be allocated carefully.

  • Badger is a practical lightweight backend for non-production or low volume production clusters.

  • Observability is only useful when it is reliable.


Conclusion

Jaeger is powerful, but it requires thoughtful resource planning. In a small AKS cluster, the default configuration is not enough. Switching from in-memory storage to Badger, combined with slight resource tuning, solved the trace loss issue completely. The result was consistent distributed tracing and better insight into system behavior.

If you run a low-resource Kubernetes environment and notice missing traces, this approach may help you fix it quickly.

More from this blog

C

CodeOps Studies

39 posts

Simple write-ups on day to day code or devops experiments, tests etc.