Apache Flink, Kubernetes, and How It Works

DevOps & Cloud Engineer — building scalable, automated, and intelligent systems. Developer of sorts | Automator | Innovator
Imagine you have a factory that processes things. Flink is like that factory, and Kubernetes is like the factory floor manager that decides where machines go and how they run.
1. What is Flink?
Apache Flink is a distributed data processing engine.
“Distributed” means it can run across multiple computers (nodes) at the same time.
“Data processing” means it takes data streams or batches and transforms them into results.
It can process data as it arrives (streaming) or process a fixed dataset (batch).
Flink is stateful and fault-tolerant: it remembers important information and can recover if a worker dies.
Example:
Like counting how many people enter a mall every minute:
Flink will keep a running total.
If a computer crashes, it will recover the total from where it left off.
2. Flink Cluster Components

A Flink cluster has two main roles:
JobManager (JM): The Brain
The JobManager is like the manager of the factory.
Responsibilities:
Accept jobs (the “instructions” for the factory)
Split the job into smaller tasks
Decide which worker (TaskManager) will execute each task
Keep track of the state and checkpoints
Handle failure recovery
Usually, 1 JobManager pod in Kubernetes
Needs persistent storage (PVC) because it stores checkpoints and savepoints
TaskManager (TM): The Worker
The TaskManager is like a worker machine on the factory floor.
Responsibilities:
Execute tasks assigned by the JobManager
Keep temporary state in memory or local disk (ephemeral)
Report back progress to JobManager
TaskManagers are stateless and ephemeral:
If a TaskManager dies, Kubernetes will restart it somewhere else
JobManager uses checkpoints to restore the state
Each TaskManager pod can have one or more slots (think “hands” to do work)
3. Slots: Hands of the TaskManager
Each TaskManager has slots, which are units of parallel work.
Each slot can run one subtask of a Flink job.
Analogy:
Imagine a worker has 2 hands → they can work on 2 small tasks at the same time
If you have 4 workers, each with 2 hands → 8 tasks can be worked on simultaneously
Slots allow Flink to divide work and control resources (memory, CPU) per task
4. Parallelism: How Many Hands Work
Parallelism is how many subtasks a job is divided into.
Each subtask needs one slot.
Maximum parallelism = total slots in cluster
Example:
| Component | Pods | Slots per Pod | Total Slots |
| JobManager | 1 | 0 | 0 |
| TaskManager | 4 | 2 | 8 |
parallelism.default = 2→ job will run 2 subtasks if no-pis specifiedJob parallelism = 6 → 6 subtasks run, distributed across the 4 TaskManagers
Job parallelism = 10 → only 8 subtasks can run at once (total slots), 2 wait
5. How TaskManagers store data
TaskManagers are ephemeral; they use local memory or local disk for temporary state.
They do NOT use PVC. If a TaskManager dies, its data is lost.
JobManager + PVC is where durable state lives:
Checkpoints
Savepoints
Checkpoint flow:
JobManager asks TaskManagers to snapshot their local state
TaskManagers send snapshots to JobManager
JobManager writes snapshots to PVC (persistent storage)
If a TM dies, it is restarted → state is restored from PVC
6. How Kubernetes schedules TaskManagers
As an example, let us see this scenario:
You requested 4 TaskManager pods
Kubernetes decides which node each pod runs on
You added soft anti-affinity → tries to spread them across both nodes
Example of your cluster after scheduling:
| TaskManager | Node |
| TM1 | node1 |
| TM2 | node1 |
| TM3 | node2 |
| TM4 | node2 |
JobManager pod runs on node1, mounts PVC for persistent storage
TaskManagers use local ephemeral storage
7. Step-by-step of job execution
Submit a job (example: WordCount)
JobManager splits job into subtasks according to parallelism
Assigns subtasks to TaskManager slots
TaskManagers execute tasks, keep temporary state
Periodically, TaskManagers checkpoint state to JobManager → written to PVC
If a TaskManager dies → Kubernetes restarts pod → JobManager restores state
Job continues processing seamlessly
8. Example analogy
JobManager → Manager in a factory
TaskManagers → Workers on the floor
Slots → Worker’s hands
Parallelism → Number of hands used on a job
PVC → Manager’s filing cabinet with important records
TaskManager ephemeral storage → Paper on worker’s desk (temporary, lost if worker leaves)
Key takeaways
Flink separates computation (TaskManagers) from state (JobManager + PVC)
Slots control parallelism at runtime
TaskManagers are disposable, Kubernetes can reschedule them anywhere
JobManager is critical: PVC ensures job can recover if TMs die
Parallelism.default is just a default; maximum parallelism is determined by total slots in cluster






