Reducing Jaeger Trace Loss

Introduction

Distributed tracing helps understand how services behave when requests travel through multiple microservices. Tools like Jaeger are very useful, but they can also consume significant memory and CPU resources. This becomes especially noticeable when the Kubernetes cluster is small.

In my case, I deployed Jaeger in a single-node Azure Kubernetes Service (AKS) cluster running on a Standard_D3_v2 node. At first, everything looked fine. But under real workload, traces began to disappear. Only some traces showed up, and sometimes none at all.

This is the story of why that happened and how I fixed it.

Cluster Context

Component	Details
Cloud Provider	Azure Kubernetes Service (AKS)
Node Count	1
Node Type	Standard_D3_v2 (4 vCPU, 14 GB RAM)
Jaeger Deployment	All-in-One (Collector + Query + Agent)
Storage Backend (Initial)	In-memory
Storage Backend (Final)	Badger
Workload Type	Python application with moderate tracing volume

The Problem

The issue was trace loss. When requests increased, Jaeger was not able to keep up. The UI showed missing spans and partial traces. Sometimes entire traces disappeared.

Why does this happen?

Reason #1: In-memory Storage Flushes Data Quickly

The default Jaeger all-in-one deployment uses in-memory storage. This is fast but extremely volatile. When memory fills up, older traces are dropped without warning.

Memory full → drop oldest spans → missing traces in UI

Reason #2: Single Node Means Shared Resource Pressure

AKS Standard_D3_v2 has only 4 vCPUs. When Jaeger collector competes with application workloads, the collector queue can start backing up:

Incoming spans → Collector queue → Storage

When the queue fills:

New spans are dropped

How I Diagnosed The Issue

I observed these symptoms:

CPU usage spikes near 80% on the Jaeger pod
Memory usage increasing until pod OOMKilled or auto-cleared
Jaeger logs showing queue warnings:

queue is full, dropping spans

This confirmed the root cause: Jaeger was overloaded.

The Fix

I applied two changes:

1. Increased Resource Requests and Limits

Example configuration:

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "1500m"
    memory: "2Gi"

This ensured Jaeger received enough CPU shares and memory headroom.

2. Switched Storage Backend from In-Memory to Badger

Badger is a lightweight key-value storage engine suitable for single-node setups.
It avoids trace eviction caused by memory pressure.

Helm values change:

storage:
  type: badger
  badger:
    directory: /badger
    ephemeral: false

This allowed traces to persist longer and prevented sudden drops.

Results

Before	After
Frequent trace loss	Stable trace retention
Collector queue warning logs	No queue overflow warnings
Unreliable observability	Consistent tracing in UI
Memory spikes	Controlled usage

Simple Architecture Diagram

Now let us break down what each component does:

1. Application Pods

Your microservices or application containers generate trace spans.
Each span represents one part of a request, such as:

A database query
An HTTP call to another service
A background job

These spans help show how long each part of a request takes.

2. Jaeger Collector

The collector receives the span data from the application.

Think of it like a “mail sorting center”:

Apps send spans → Collector receives, batches, and writes them.

If the collector becomes overloaded or has no place to store spans, they get dropped — this is where trace loss happened before the fix.

3. Badger Storage

This is where the spans are actually stored.

With in-memory storage, data disappears when memory fills.
With Badger, spans are written to a lightweight key-value store on disk.

So instead of traces evaporating, they are retained long enough to view and analyze.

Simplifying the Flow

You can think of the whole system like this:

Your App → Sends Traces → Collector → Stores Traces → UI shows Traces

If any part in the middle breaks, you get missing traces.

By giving Jaeger more CPU / memory and switching storage to Badger, you ensured:

Collector did not choke under pressure
Spans were not dropped due to memory eviction
Traces became reliable and consistent

Key Takeaways

In-memory storage is fine for testing, but not for real workloads.
Even a small cluster can run Jaeger effectively, but resources must be allocated carefully.
Badger is a practical lightweight backend for non-production or low volume production clusters.
Observability is only useful when it is reliable.

Conclusion

Jaeger is powerful, but it requires thoughtful resource planning. In a small AKS cluster, the default configuration is not enough. Switching from in-memory storage to Badger, combined with slight resource tuning, solved the trace loss issue completely. The result was consistent distributed tracing and better insight into system behavior.

If you run a low-resource Kubernetes environment and notice missing traces, this approach may help you fix it quickly.

How I Reduced Jaeger Trace Loss in a Small AKS Cluster

Introduction

Cluster Context

The Problem

Reason #1: In-memory Storage Flushes Data Quickly

Reason #2: Single Node Means Shared Resource Pressure

How I Diagnosed The Issue

The Fix

1. Increased Resource Requests and Limits

2. Switched Storage Backend from In-Memory to Badger

Results

Simple Architecture Diagram

1. Application Pods

2. Jaeger Collector

3. Badger Storage

Simplifying the Flow

Key Takeaways

Conclusion

More from this blog

When SSL Lies: Debugging PostgreSQL “server does not support SSL” in Kubernetes

A Real World Journey Building on Tencent Cloud

Lessons Learned Building a CI Pipeline That Auto-Tags and Deploys Docker Images

What I Learned Migrating a Real App from Docker Compose to Kubernetes

Running Apache Flink on Kubernetes: From Zero to a Fully Utilized Cluster

Command Palette

Introduction

Cluster Context

The Problem

Reason #1: In-memory Storage Flushes Data Quickly

Reason #2: Single Node Means Shared Resource Pressure

How I Diagnosed The Issue

The Fix

1. Increased Resource Requests and Limits

2. Switched Storage Backend from In-Memory to Badger

Results

Simple Architecture Diagram

1. Application Pods

2. Jaeger Collector

3. Badger Storage

Simplifying the Flow

Key Takeaways

Conclusion

More from this blog