<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[CodeOps Studies]]></title><description><![CDATA[DevOps, CodeOps, Practices : The good and bad, Solutions and mistakes committed along the journey of development to deployment]]></description><link>https://blog.nyzex.in</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1761603500195/640df845-995c-4545-b309-34aca8439dc9.png</url><title>CodeOps Studies</title><link>https://blog.nyzex.in</link></image><generator>RSS for Node</generator><lastBuildDate>Fri, 24 Apr 2026 21:33:33 GMT</lastBuildDate><atom:link href="https://blog.nyzex.in/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[When SSL Lies: Debugging PostgreSQL “server does not support SSL” in Kubernetes]]></title><description><![CDATA[You spin up a Kubernetes workload. You connect to PostgreSQL.
You add sslmode=require because security matters.
And then PostgreSQL replies with:
server does not support SSL, but SSL was required

Wai]]></description><link>https://blog.nyzex.in/when-ssl-lies-debugging-postgresql-server-does-not-support-ssl-in-kubernetes</link><guid isPermaLink="true">https://blog.nyzex.in/when-ssl-lies-debugging-postgresql-server-does-not-support-ssl-in-kubernetes</guid><category><![CDATA[Devops]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[software development]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Wed, 22 Apr 2026 18:45:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/61b66c78-00c1-4841-9afe-0ec8e9907ecc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You spin up a Kubernetes workload. You connect to PostgreSQL.</p>
<p>You add <code>sslmode=require</code> because security matters.</p>
<p>And then PostgreSQL replies with:</p>
<pre><code class="language-plaintext">server does not support SSL, but SSL was required
</code></pre>
<p>Wait. What?</p>
<p>This is one of those errors that sounds like a configuration typo but usually reveals something deeper about your infrastructure. Let’s break it down properly and fix it the right way.</p>
<h2>The Setup</h2>
<p>This was a fairly common environment:</p>
<ul>
<li><p>Application running inside Kubernetes</p>
</li>
<li><p>PostgreSQL running on a private IP inside the VPC</p>
</li>
<li><p>Direct <code>psql</code> access used for testing connectivity</p>
</li>
<li><p>Connection string explicitly enforcing SSL</p>
</li>
</ul>
<p>The test command looked like this:</p>
<pre><code class="language-plaintext">psql "host=10.1.1.17 port=5432 dbname=mydb user=myuser sslmode=require"
</code></pre>
<p>And PostgreSQL responded:</p>
<pre><code class="language-plaintext">psql: error: connection to server at "10.1.1.17", port 5432 failed:
server does not support SSL, but SSL was required
</code></pre>
<p>At first glance, it feels contradictory. PostgreSQL supports SSL. So why is it saying it does not?</p>
<hr />
<h2>What <code>sslmode=require</code> Actually Does</h2>
<p>When you set:</p>
<pre><code class="language-plaintext">sslmode=require
</code></pre>
<p>you are telling the PostgreSQL client:</p>
<blockquote>
<p>If SSL cannot be negotiated, fail immediately.</p>
</blockquote>
<p>No fallback. No downgrade. No retry without encryption.</p>
<p>The connection flow looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/cc741a68-29e8-4128-8221-7671bdaf334c.png" alt="" style="display:block;margin:0 auto" />

<p>If the server is not configured with SSL enabled, the handshake never happens and the client exits.</p>
<p>That is exactly what happened here.</p>
<hr />
<h2>Why This Happens in Kubernetes Environments</h2>
<p>In many Kubernetes setups, PostgreSQL is:</p>
<ul>
<li><p>A container inside the cluster</p>
</li>
<li><p>A VM inside a private subnet</p>
</li>
<li><p>An internal managed database</p>
</li>
<li><p>Or a local development database</p>
</li>
</ul>
<p>In these cases, SSL is often disabled by default.</p>
<p>This creates a mismatch:</p>
<ul>
<li><p>The client demands encryption</p>
</li>
<li><p>The server is configured for plain TCP</p>
</li>
<li><p>PostgreSQL refuses to downgrade</p>
</li>
<li><p>The connection fails</p>
</li>
</ul>
<p>From PostgreSQL’s perspective, nothing is wrong. It simply does not have SSL enabled.</p>
<h2>Step 1: Confirm Whether the Server Supports SSL</h2>
<p>Before changing anything, verify the server configuration.</p>
<p>If you can connect without SSL, try:</p>
<pre><code class="language-plaintext">psql "host=10.1.1.17 port=5432 dbname=mydb user=myuser sslmode=disable"
</code></pre>
<p>If that works, then the database is reachable and the issue is purely SSL related.</p>
<p>Now check whether SSL is enabled on the server:</p>
<pre><code class="language-plaintext">SHOW ssl;
</code></pre>
<p>If it returns:</p>
<pre><code class="language-plaintext">ssl | off
</code></pre>
<p>then the error makes complete sense.</p>
<hr />
<h2>Step 2: Understand Your Architecture</h2>
<p>Here is the real question:</p>
<p>Should SSL even be required here?</p>
<p>Look at the traffic path.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/68b346b0-eff4-4877-ab9c-1019ab30223e.png" alt="" style="display:block;margin:0 auto" />

<p>If all communication is happening:</p>
<ul>
<li><p>Inside the same Kubernetes cluster</p>
</li>
<li><p>Or inside a private VPC</p>
</li>
<li><p>Without exposure to the public internet</p>
</li>
</ul>
<p>then the network layer is already isolated.</p>
<p>In such cases, enforcing SSL is sometimes unnecessary and only adds complexity.</p>
<p>However, if the database is:</p>
<ul>
<li><p>Publicly accessible</p>
</li>
<li><p>Cross region</p>
</li>
<li><p>Or accessed over the internet</p>
</li>
</ul>
<p>then SSL should absolutely be enabled.</p>
<hr />
<h2>Step 3: The Two Real Fixes</h2>
<p>There are only two correct solutions.</p>
<h3>Option One: Disable SSL on the Client</h3>
<p>If this is an internal trusted network, change:</p>
<pre><code class="language-plaintext">sslmode=require
</code></pre>
<p>to:</p>
<pre><code class="language-plaintext">sslmode=disable
</code></pre>
<p>or simply remove the parameter.</p>
<p>Test:</p>
<pre><code class="language-plaintext">psql "host=10.1.1.17 port=5432 dbname=mydb user=myuser"
</code></pre>
<p>If it connects successfully, you are done.</p>
<p>This is common in:</p>
<ul>
<li><p>Local development</p>
</li>
<li><p>Internal Kubernetes clusters</p>
</li>
<li><p>Private staging environments</p>
</li>
</ul>
<p>Option Two: Enable SSL on PostgreSQL Properly</p>
<p>If encryption is required, then the server must be configured correctly.</p>
<p>In postgresql.conf:</p>
<pre><code class="language-go">ssl = on
ssl_cert_file = 'server.crt'
ssl_key_file = 'server.key'
</code></pre>
<p>Then update pg_hba.conf:</p>
<p>/code</p>
<p>Restart PostgreSQL.</p>
<p>Now test:</p>
<pre><code class="language-go">psql "host=10.1.1.17 port=5432 dbname=mydb user=myuser sslmode=require"
</code></pre>
<p>And confirm:</p>
<pre><code class="language-go">SELECT ssl_is_used();
</code></pre>
<p>If it returns true, the handshake succeeded.</p>
<p>The flow now looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/05d539b2-6fc3-4428-ac1d-bb2ec28067b8.png" alt="" style="display:block;margin:0 auto" />

<p>That is the correct configuration if encryption is part of your compliance or security requirement.</p>
<hr />
<h2>A Common Hidden Cause in Cloud Environments</h2>
<p>There is one more subtle scenario.</p>
<p>Sometimes you are used to managed services like Amazon RDS where SSL is enabled automatically.</p>
<p>Then you switch to:</p>
<ul>
<li><p>Self managed PostgreSQL</p>
</li>
<li><p>A container image</p>
</li>
<li><p>A VM based deployment</p>
</li>
</ul>
<p>And assume SSL is still enabled.</p>
<p>It is not.</p>
<p>Managed database services often handle:</p>
<ul>
<li><p>Certificate provisioning</p>
</li>
<li><p>Key rotation</p>
</li>
<li><p>TLS negotiation</p>
</li>
<li><p>Client CA bundles</p>
</li>
</ul>
<p>Self managed PostgreSQL does none of that unless you configure it explicitly.</p>
<p>That assumption gap is where this error usually comes from.</p>
<h2>Practical Debug Checklist</h2>
<p>Whenever you see this error, follow this mental model:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/a5b3fce1-27e9-4a9d-a36c-65d7e302cfde.png" alt="" style="display:block;margin:0 auto" />

<p>Keep it simple.</p>
<p>Test the network first. Then test without SSL.</p>
<p>Then make an intentional security decision.</p>
<hr />
<h2>Lessons Learned</h2>
<p>The key takeaway is this:</p>
<p>Just because PostgreSQL supports SSL does not mean your server instance is configured for it. Blindly adding <code>sslmode=require</code> because it feels secure can actually break perfectly valid internal architectures.</p>
<p>Security is about understanding your network boundaries, not just toggling flags in connection strings. If traffic never leaves a private subnet, SSL might be optional.</p>
<p>If traffic crosses public infrastructure, SSL should be enforced and configured correctly.</p>
<p>The important part is intention.</p>
<hr />
<h2>Final Thought</h2>
<p>This error is not PostgreSQL lying. It is PostgreSQL being honest.</p>
<p>It is simply telling you:</p>
<p>“I do not have SSL enabled, and you told me not to connect without it.”</p>
<p>Once you understand that, the fix becomes straightforward.</p>
<p>And like most infrastructure issues, the real problem was not PostgreSQL.</p>
<p>It was an assumption.</p>
]]></content:encoded></item><item><title><![CDATA[A Real World Journey Building on Tencent Cloud]]></title><description><![CDATA[When you first approach Tencent Cloud, it feels familiar if you come from AWS or Azure. There are VPCs, Kubernetes clusters, load balancers, object storage, IAM, everything you would expect. But once ]]></description><link>https://blog.nyzex.in/a-real-world-journey-building-on-tencent-cloud</link><guid isPermaLink="true">https://blog.nyzex.in/a-real-world-journey-building-on-tencent-cloud</guid><category><![CDATA[Devops]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[Devops articles]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Mon, 16 Mar 2026 03:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/94e36bdb-7591-4565-b9eb-c33debb36c4a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When you first approach Tencent Cloud, it feels familiar if you come from AWS or Azure. There are VPCs, Kubernetes clusters, load balancers, object storage, IAM, everything you would expect. But once you actually start building and deploying a real system, the differences start to show up in very real ways.</p>
<p>This is a story of building a production ready setup on Tencent Cloud, running into ICP restrictions, redesigning architecture, and eventually stabilizing with a private Kubernetes cluster, bastion access, and a Hong Kong deployment.</p>
<p>It is also a breakdown of how Tencent Cloud actually works under the hood, and how it compares to AWS and Azure in practical DevOps workflows.</p>
<hr />
<h2>Understanding Tencent Cloud at a Core Level</h2>
<p>Tencent Cloud follows a similar mental model to other cloud providers, but the naming and behavior have subtle differences.</p>
<p>At the foundation you have a VPC. Inside the VPC you define subnets which can be public or private. Compute resources like CVMs or Kubernetes nodes live inside these subnets. Networking is controlled through security groups and routing tables.</p>
<p>Kubernetes is offered through TKE which is Tencent Kubernetes Engine. It abstracts control plane management but still gives you flexibility over networking, node pools, and scaling.</p>
<p>Object storage is COS which behaves similarly to S3.</p>
<p>Load balancing is handled through CLB which supports both public and private exposure.</p>
<p>At a glance everything looks standard. The complexity begins when you try to expose services publicly inside mainland China.</p>
<hr />
<h2>The ICP Reality</h2>
<p>One of the biggest turning points in this journey was understanding ICP.</p>
<p>If you deploy infrastructure in mainland China regions such as Shanghai or Beijing, you cannot simply expose a public website or API like you would on AWS. You need an ICP license issued by the Chinese government.</p>
<p>Without ICP approval, public endpoints may not work reliably, may be blocked, or may never become accessible.</p>
<p>This creates a very real constraint.</p>
<p>You can build everything correctly from a technical standpoint and still fail at the final step of making your service reachable.</p>
<p>That is exactly what happened.</p>
<p>Everything was deployed in Shanghai. Kubernetes cluster was up. Services were running. Ingress was configured. But external access became the bottleneck due to ICP restrictions.</p>
<hr />
<h2>The Shift to Hong Kong</h2>
<p>The solution was not a code change. It was a region change.</p>
<p>Tencent Cloud’s Hong Kong region operates outside mainland China regulations. That means no ICP requirement.</p>
<p>Moving to Hong Kong immediately removed the compliance barrier and allowed public exposure of services without regulatory friction.</p>
<p>This shift also impacted latency and accessibility for global users in a positive way.</p>
<p>The architecture itself remained similar, but the operational experience improved drastically.</p>
<hr />
<h2>Architecture Overview</h2>
<p>The final architecture evolved into a secure and production ready setup with strong isolation.</p>
<p>You have a private Kubernetes cluster running inside a VPC. The nodes are not directly exposed to the internet. Instead, access is controlled through a bastion host.</p>
<p>Ingress is handled through NGINX inside the cluster. External traffic flows through a load balancer into the ingress controller, and then to services.</p>
<p>TCP services such as device communication ports are also exposed through the ingress configuration rather than separate load balancers.</p>
<p>Here is a simplified flow of the architecture:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/5ef1315c-dd43-4b2c-b6e8-10761164ee48.png" alt="" style="display:block;margin:0 auto" />

<hr />
<h2>Private Cluster and Bastion Access</h2>
<p>One of the key design decisions was keeping the Kubernetes cluster private.</p>
<p>This means nodes do not have public IPs. Direct SSH or API access from the internet is not allowed.</p>
<p>Instead, a bastion host is deployed in a public subnet.</p>
<p>You connect to the bastion, and from there access internal resources.</p>
<p>This adds a strong security boundary.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/ba0d58c4-248d-4ea9-9f5d-a0c597137656.png" alt="" style="display:block;margin:0 auto" />

<p>This approach is very similar to hardened AWS architectures, but in Tencent Cloud it feels more necessary because of networking defaults and access patterns.</p>
<hr />
<h2>Kubernetes Exposure Strategy</h2>
<p>Instead of creating multiple load balancers for each service, the setup uses a single ingress controller.</p>
<p>NGINX ingress handles both HTTP and TCP traffic.</p>
<p>This is especially useful when dealing with systems like GPS tracking devices or custom protocols where TCP ports need to be exposed.</p>
<p>Configuration is done through a ConfigMap that maps external ports to internal services.</p>
<p>This reduces cost and complexity since Tencent CLB instances are not as flexible or cheap as AWS alternatives in some cases.</p>
<hr />
<h2>Storage and Data Flow</h2>
<p>Object storage using COS works very similarly to S3.</p>
<p>You can upload files, serve static assets, and integrate with applications easily.</p>
<p>A simple flow looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/e098d5dd-9791-4a64-85dc-925e45827a7e.png" alt="" style="display:block;margin:0 auto" />

<p>The APIs are slightly different, but the mental model remains the same.</p>
<hr />
<h2>CI CD and Deployment</h2>
<p>Your pipeline builds Docker images, pushes them to a registry, and deploys them to Kubernetes.</p>
<p>Tencent Cloud can integrate with CI systems, but often external tools like GitHub Actions or self managed pipelines provide more flexibility.</p>
<p>The deployment flow looks like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/e92ae9fb-74ee-4e29-9255-f7a3b4ae9a46.png" alt="" style="display:block;margin:0 auto" />

<p>This part feels very similar to AWS and Azure workflows.</p>
<hr />
<h2>Key Differences from AWS and Azure</h2>
<p>Tencent Cloud is powerful, but the experience differs in important ways.</p>
<p>Documentation is not as consistent or detailed as AWS. You often need to experiment or translate concepts mentally.</p>
<p>Naming conventions are different which creates a small learning curve.</p>
<p>Networking behavior can feel stricter or less intuitive, especially with private clusters and routing.</p>
<p>The ICP requirement is something you never deal with in AWS or Azure global regions.</p>
<p>The console UI is functional but less polished compared to AWS.</p>
<p>On the positive side, Tencent Cloud integrates well with the Chinese ecosystem and provides strong performance in Asia.</p>
<hr />
<h2>Lessons I learned:</h2>
<p>The biggest lesson is that infrastructure is not just about technology. It is also about geography and regulation.</p>
<p>Choosing the wrong region can block your entire system even if everything is technically correct.</p>
<p>Private clusters with bastion access provide strong security and should be the default for production setups.</p>
<p>Ingress based exposure is more efficient than multiple load balancers, especially when dealing with mixed traffic types.</p>
<p>Always design for flexibility. Being able to shift regions saved the entire setup.</p>
<hr />
<h2>Conclusion</h2>
<p>Tencent Cloud is a capable platform, but it requires a different mindset compared to AWS and Azure.</p>
<p>Once you understand the constraints, especially around ICP and networking, you can build robust and scalable systems.</p>
<p>The journey from Shanghai to Hong Kong was not just a migration. It was a deeper understanding of how cloud infrastructure interacts with real world regulations and architecture decisions.</p>
<p>If you approach Tencent Cloud with the right expectations, it becomes a powerful tool rather than a frustrating one.</p>
]]></content:encoded></item><item><title><![CDATA[Lessons Learned Building a CI Pipeline That Auto-Tags and Deploys Docker Images]]></title><description><![CDATA[When I first automated Docker builds and deployments, I thought the hard part would be writing the YAML. It was not.
The real challenges were versioning, preventing accidental rollbacks, handling envi]]></description><link>https://blog.nyzex.in/lessons-learned-building-a-ci-pipeline-that-auto-tags-and-deploys-docker-images</link><guid isPermaLink="true">https://blog.nyzex.in/lessons-learned-building-a-ci-pipeline-that-auto-tags-and-deploys-docker-images</guid><category><![CDATA[Devops]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[software development]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Tue, 24 Feb 2026 08:58:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/3af115d6-9bff-48e3-b520-539d90be9f95.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I first automated Docker builds and deployments, I thought the hard part would be writing the YAML. It was not.</p>
<p>The real challenges were versioning, preventing accidental rollbacks, handling environment drift, and making deployments predictable. Over time, I built a CI pipeline that automatically tags Docker images, pushes them to a container registry, and deploys the latest version to a server without manual intervention.</p>
<p>This article walks through what worked, what broke, and what I learned while building a production-ready auto-tag and auto-deploy pipeline.</p>
<h2>The Goal</h2>
<p>The objective was simple:</p>
<ul>
<li><p>Every merge to the main branch should build a Docker image</p>
</li>
<li><p>The image should get a unique incrementing version tag</p>
</li>
<li><p>The image should be pushed to a container registry</p>
</li>
<li><p>The deployment server should pull the new version and restart the service automatically</p>
</li>
<li><p>No manual SSH, no manual tagging, no human version bumps</p>
</li>
</ul>
<p>The reality was more nuanced.</p>
<h2>The High-Level Architecture</h2>
<p>The system had four moving parts:</p>
<ul>
<li><p>Source repository</p>
</li>
<li><p>CI workflow</p>
</li>
<li><p>Container registry</p>
</li>
<li><p>Deployment server</p>
</li>
</ul>
<p>Here is the simplified flow.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/ef65f440-4ade-4935-88a5-5399f7397ddc.png" alt="" style="display:block;margin:0 auto" />

<p>At first glance, this looks trivial. The devil was in version control and deployment consistency.</p>
<hr />
<h2>Problem: Manual Versioning Does Not Scale</h2>
<p>Initially, I hardcoded the image tag like this:</p>
<pre><code class="language-plaintext">myapp:latest
</code></pre>
<p>That worked until it didn’t.</p>
<p>Using latest creates ambiguity. If something breaks, you cannot easily roll back. You also do not know what code is actually running in production.</p>
<p>So I moved to semantic versioning:</p>
<pre><code class="language-plaintext">0.0.1
0.0.2
0.0.3
</code></pre>
<p>But manually updating the version before each commit quickly became annoying and error prone.</p>
<p>The fix was automatic version incrementing inside the CI pipeline.</p>
<hr />
<h2>Automatic Version Tagging Strategy</h2>
<p>The pipeline logic became:</p>
<ul>
<li><p>Fetch the latest tag from the registry</p>
</li>
<li><p>Parse the version</p>
</li>
<li><p>Increment the patch number</p>
</li>
<li><p>Tag the new image with the incremented version</p>
</li>
<li><p>Push it</p>
</li>
</ul>
<p>Conceptually:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/b448d8af-21f5-49c9-bd22-a8a0c9e3a3fb.png" alt="" style="display:block;margin:0 auto" />

<p>This solved several problems:</p>
<ul>
<li><p>Every image became uniquely identifiable</p>
</li>
<li><p>Rollbacks became trivial</p>
</li>
<li><p>Production state became transparent</p>
</li>
</ul>
<p>One major lesson here was to avoid deriving version numbers from Git commit hashes for user-facing services. While hashes are unique, semantic versions are easier to reason about operationally.</p>
<h2>CI Pipeline Flow</h2>
<p>The CI pipeline was responsible for:</p>
<ul>
<li><p>Checking out the code</p>
</li>
<li><p>Logging into the registry</p>
</li>
<li><p>Building the Docker image</p>
</li>
<li><p>Tagging with the new version</p>
</li>
<li><p>Pushing both version tag and latest</p>
</li>
<li><p>Triggering deployment</p>
</li>
</ul>
<p>The full CI flow looked like this:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/c9653595-3087-4227-9b35-ed9d822638c3.png" alt="" style="display:block;margin:0 auto" />

<p>One key insight was pushing both version and latest tags.</p>
<p>The version tag gives traceability. The latest tag simplifies pull logic on the server.</p>
<h2>Deployment Automation</h2>
<p>The deployment server had a simple responsibility:</p>
<ul>
<li><p>Pull the newest image</p>
</li>
<li><p>Restart the container</p>
</li>
</ul>
<p>At first, I used a naive approach:</p>
<pre><code class="language-plaintext">docker pull myapp:latest
docker-compose up -d
</code></pre>
<p>This works, but only if you are disciplined.</p>
<p>The issue appears when the image digest does not change or when the server has cached layers in a strange state.</p>
<p>A more robust flow became:</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/e1f334c3-f459-482b-a090-a866fe20f6cf.png" alt="" style="display:block;margin:0 auto" />

<p>This avoids unnecessary restarts and reduces downtime.</p>
<h2>Avoiding Downtime</h2>
<p>Restarting a container blindly can cause momentary service disruption.</p>
<p>Two improvements helped:</p>
<ul>
<li><p>Health checks inside Docker</p>
</li>
<li><p>Graceful restart strategy</p>
</li>
</ul>
<p>Instead of stopping first and then starting, the improved approach was:</p>
<ul>
<li><p>Start new container</p>
</li>
<li><p>Verify health</p>
</li>
<li><p>Stop old container</p>
</li>
</ul>
<p>This pattern mimics blue green deployment at a smaller scale.</p>
<img src="https://cdn.hashnode.com/uploads/covers/68fb4c09020910660a830e8f/e840fa51-6975-4bcf-8acf-d1b1b7530793.png" alt="" style="display:block;margin:0 auto" />

<p>This reduced deployment risk significantly.</p>
<hr />
<h2>Security Lessons</h2>
<p>Several important security practices emerged:</p>
<ul>
<li><p>Never store registry credentials in plain text</p>
</li>
<li><p>Use CI secrets properly</p>
</li>
<li><p>Use short-lived tokens when possible</p>
</li>
<li><p>Restrict server SSH access</p>
</li>
</ul>
<p>Another key lesson was separating build and deploy permissions. The CI pipeline should not have unrestricted server access. Ideally, it triggers deployment via a webhook or controlled SSH user with limited privileges.</p>
<hr />
<h2>Observability Matters</h2>
<p>The first time a deployment silently failed, I realized logs were not optional.</p>
<p>You need:</p>
<ul>
<li><p>CI logs that clearly show version generated</p>
</li>
<li><p>Registry confirmation logs</p>
</li>
<li><p>Server deployment logs</p>
</li>
<li><p>Application startup logs</p>
</li>
</ul>
<p>Without observability, automation becomes guesswork.</p>
<hr />
<h2>Rollback Strategy</h2>
<p>One of the biggest advantages of version tagging is clean rollback.</p>
<p>If production breaks:</p>
<pre><code class="language-plaintext">docker pull myapp:0.0.7
docker run myapp:0.0.7
</code></pre>
<p>No rebuild required.</p>
<p>Rollback becomes a configuration change rather than a panic-driven patch.</p>
<hr />
<h2>What I Would Do Differently</h2>
<p>If building from scratch again:</p>
<ul>
<li><p>Use immutable image references by digest in production</p>
</li>
<li><p>Introduce deployment locking to prevent concurrent runs</p>
</li>
<li><p>Add structured logging for CI</p>
</li>
<li><p>Add automated smoke tests after deployment</p>
</li>
</ul>
<p>Automation is not about speed alone. It is about predictability.</p>
<hr />
<h2>Conclusion</h2>
<p>Building an auto-tagging and auto-deploy CI pipeline sounds simple. It is not.</p>
<p>The complexity lies in:</p>
<ul>
<li><p>Version consistency</p>
</li>
<li><p>Deployment safety</p>
</li>
<li><p>Rollback reliability</p>
</li>
<li><p>Security boundaries</p>
</li>
<li><p>Observability</p>
</li>
</ul>
<p>Once implemented correctly, the workflow changes how you ship software. Deployments stop being events and start becoming routine.</p>
<p>If you are still manually tagging Docker images or SSHing into servers to deploy, start automating today, that shift in mindset is the real upgrade.</p>
]]></content:encoded></item><item><title><![CDATA[What I Learned Migrating a Real App from Docker Compose to Kubernetes]]></title><description><![CDATA[For a long time, Docker Compose felt like the perfect solution. Simple YAML, fast local setup, predictable behavior. For a single service or even a small stack, it works beautifully.
But at some point, reality catches up.
As the application grew, tra...]]></description><link>https://blog.nyzex.in/what-i-learned-migrating-a-real-app-from-docker-compose-to-kubernetes</link><guid isPermaLink="true">https://blog.nyzex.in/what-i-learned-migrating-a-real-app-from-docker-compose-to-kubernetes</guid><category><![CDATA[Devops]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[Developer]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Thu, 12 Feb 2026 10:25:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770891894557/3014893b-4e5d-4122-a02c-e1ab204fc812.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>For a long time, Docker Compose felt like the perfect solution. Simple YAML, fast local setup, predictable behavior. For a single service or even a small stack, it works beautifully.</p>
<p>But at some point, reality catches up.</p>
<p>As the application grew, traffic became less predictable, deployments needed to be safer, and uptime started to matter more than convenience. That was the point where migrating from Docker Compose to Kubernetes stopped being optional and became inevitable.</p>
<p>This post is not a Kubernetes tutorial. It’s a reflection on what actually changed, what broke, and what I wish I had understood earlier before making the move.</p>
<hr />
<h2 id="heading-docker-compose-worked-until-it-didnt">Docker Compose Worked Until It Didn’t</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770891455318/80ad2f1d-95d7-4808-a1b2-b6f5e711388c.png" alt class="image--center mx-auto" /></p>
<p>Docker Compose is excellent at answering one question:<br />“How do I run multiple containers together on one machine?”</p>
<p>The problems started when my needs shifted to different questions:</p>
<ul>
<li><p>How do I scale only one service without touching others?</p>
</li>
<li><p>How do I deploy without downtime?</p>
</li>
<li><p>How do I expose HTTP and raw TCP services reliably?</p>
</li>
<li><p>How do I survive a node restart without manual intervention?</p>
</li>
</ul>
<p>Compose can technically handle some of this, but only with scripts, conventions, and a lot of discipline. Over time, the setup became fragile. A single bad deploy could take everything down.</p>
<p>Kubernetes didn’t magically solve these problems, but it gave me primitives that were designed for them.</p>
<hr />
<h2 id="heading-the-biggest-mental-shift-stop-thinking-in-containers">The Biggest Mental Shift: Stop Thinking in Containers</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770891569122/8340a6a8-86d0-4c6b-8dd1-e8f8e8312af6.png" alt class="image--center mx-auto" /></p>
<p>In Docker Compose, you think in containers.<br />In Kubernetes, you think in systems.</p>
<p>At first, this was uncomfortable. I kept asking questions like:</p>
<ul>
<li><p>Where is my container?</p>
</li>
<li><p>Why did Kubernetes restart it?</p>
</li>
<li><p>Why are there three replicas when I only started one?</p>
</li>
</ul>
<p>Eventually, I realized Kubernetes doesn’t care about my containers. It cares about desired state.</p>
<p>Once I stopped fighting that idea and started defining what I wanted instead of how to do it, things clicked.</p>
<p>I no longer deployed containers. I declared intentions.</p>
<hr />
<h2 id="heading-configuration-management-became-a-first-class-concern">Configuration Management Became a First-Class Concern</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770891580762/e41582e2-9687-45dc-883a-92866fa52382.png" alt class="image--center mx-auto" /></p>
<p>In Compose, environment variables often live in <code>.env</code> files or directly in YAML. That works until you have multiple environments, secrets, and rotating credentials.</p>
<p>Kubernetes forced me to clean this up.</p>
<p>ConfigMaps and Secrets felt verbose at first, but they created a clean separation:</p>
<ul>
<li><p>Application code stopped knowing where configuration came from</p>
</li>
<li><p>Sensitive values were no longer mixed with runtime logic</p>
</li>
<li><p>Environment differences became explicit instead of accidental</p>
</li>
</ul>
<p>This alone reduced production mistakes more than any CI rule I had before.</p>
<hr />
<h2 id="heading-networking-was-simpler-and-more-complicated-at-the-same-time">Networking Was Simpler and More Complicated at the Same Time</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770891708608/b7acf759-fe9e-4c74-9f17-aaca7af8c85d.png" alt class="image--center mx-auto" /></p>
<p>This was one of the most surprising parts.</p>
<p>Inside the cluster, service discovery is easier than Docker Compose. Every service gets a stable DNS name. Containers can restart, reschedule, or scale without breaking internal communication.</p>
<p>But ingress is where things got interesting.</p>
<p>Exposing HTTP traffic was straightforward once I adopted an ingress controller. TLS termination, routing, and host-based rules became declarative and repeatable.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770891730142/ad984e9c-96dc-48f6-b4ff-f721c670d22b.png" alt class="image--center mx-auto" /></p>
<p>Exposing raw TCP ports was harder. This is something Compose hides from you. In Kubernetes, you must understand:</p>
<ul>
<li><p>Services</p>
</li>
<li><p>NodePorts</p>
</li>
<li><p>LoadBalancers</p>
</li>
<li><p>Ingress TCP mappings</p>
</li>
</ul>
<p>Once configured correctly, it was more reliable than my old setup. Getting there required patience and a lot of reading logs.</p>
<hr />
<h2 id="heading-scaling-is-not-just-replicas">Scaling Is Not Just Replicas</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770891667148/8ec93e3b-255c-41f5-9fef-41db763be789.png" alt class="image--center mx-auto" /></p>
<p>In Compose, scaling usually means <code>--scale</code> and hoping nothing breaks.</p>
<p>In Kubernetes, scaling forced me to confront assumptions I didn’t know I had:</p>
<ul>
<li><p>Is my app stateless?</p>
</li>
<li><p>What happens if two replicas process the same request?</p>
</li>
<li><p>Where does session data live?</p>
</li>
<li><p>What happens during rolling updates?</p>
</li>
</ul>
<p>Horizontal Pod Autoscaling was powerful, but it exposed poor application design immediately. Anything relying on local state or filesystem assumptions broke fast.</p>
<p>This pain was useful. It forced architectural improvements that made the system more resilient overall.</p>
<hr />
<h2 id="heading-health-checks-are-not-optional-anymore">Health Checks Are Not Optional Anymore</h2>
<p>Docker Compose lets unhealthy services limp along. Kubernetes does not.</p>
<p>Once liveness and readiness probes were in place, I learned two important lessons:</p>
<ul>
<li><p>A service can be running and still be unusable</p>
</li>
<li><p>Restarting a container is often better than keeping it alive</p>
</li>
</ul>
<p>Bad health checks caused cascading failures. Good ones made deployments boring. Boring deployments are the goal.</p>
<hr />
<h2 id="heading-cicd-became-cleaner-and-more-predictable">CI/CD Became Cleaner and More Predictable</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770891627712/e556b4fe-174a-4d5f-9c96-d23a64758306.png" alt class="image--center mx-auto" /></p>
<p>Before Kubernetes, deployments were procedural.<br />Run this script. Pull this image. Restart that service.</p>
<p>After Kubernetes, deployments became declarative:</p>
<ul>
<li><p>Build image</p>
</li>
<li><p>Push image</p>
</li>
<li><p>Update manifest</p>
</li>
<li><p>Let the cluster reconcile</p>
</li>
</ul>
<p>This reduced the surface area for human error. If something went wrong, Kubernetes told me what failed and why. Logs, events, and pod states became a reliable source of truth instead of guesswork.</p>
<hr />
<h2 id="heading-kubernetes-did-not-reduce-complexity-it-reorganized-it">Kubernetes Did Not Reduce Complexity, It Reorganized It</h2>
<p>This is important to say clearly.</p>
<p>Kubernetes did not make the system simpler. It made the complexity explicit.</p>
<p>Things that were previously hidden inside scripts, assumptions, and tribal knowledge were now written down in YAML. That felt heavy at first, but it also meant:</p>
<ul>
<li><p>New environments were reproducible</p>
</li>
<li><p>Failures were diagnosable</p>
</li>
<li><p>Scaling decisions were intentional</p>
</li>
</ul>
<p>The complexity existed before. Kubernetes just stopped pretending it didn’t.</p>
<hr />
<h2 id="heading-what-i-would-do-differently-next-time">What I Would Do Differently Next Time</h2>
<p>I would start smaller.</p>
<p>Instead of migrating everything at once, I would:</p>
<ul>
<li><p>Move one stateless service first</p>
</li>
<li><p>Get ingress and TLS right early</p>
</li>
<li><p>Invest in logging and metrics from day one</p>
</li>
<li><p>Treat manifests as code, not configuration</p>
</li>
</ul>
<p>Most importantly, I would spend more time understanding Kubernetes concepts before trying to bend them into old patterns.</p>
<hr />
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Docker Compose is not bad. It’s just honest about what it is.</p>
<p>Kubernetes is not overkill when your system starts needing guarantees instead of convenience.</p>
<p>The migration was not smooth, but it was worth it. Not because Kubernetes is trendy, but because it forced better engineering decisions that I had been postponing.</p>
<p>If you are feeling friction with Docker Compose, that friction is a signal. Listen to it.</p>
]]></content:encoded></item><item><title><![CDATA[Running Apache Flink on Kubernetes: From Zero to a Fully Utilized Cluster]]></title><description><![CDATA[This blog walks through Apache Flink end to end, starting from what Flink is, how its architecture works, and how to deploy and properly utilize a Kubernetes cluster using Flink’s standalone Kubernetes mode. The goal is not just to get Flink running,...]]></description><link>https://blog.nyzex.in/running-apache-flink-on-kubernetes-from-zero-to-a-fully-utilized-cluster</link><guid isPermaLink="true">https://blog.nyzex.in/running-apache-flink-on-kubernetes-from-zero-to-a-fully-utilized-cluster</guid><category><![CDATA[Devops]]></category><category><![CDATA[flink]]></category><category><![CDATA[apache]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[k0s]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Sun, 18 Jan 2026 19:05:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768763003680/5bd6791e-0717-4b9b-a26d-6eaab0204f5a.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This blog walks through Apache Flink end to end, starting from what Flink is, how its architecture works, and how to deploy and properly utilize a Kubernetes cluster using Flink’s standalone Kubernetes mode. The goal is not just to get Flink running, but to make sure it runs <em>correctly</em>, efficiently, and in a way that matches how Flink is designed to work.</p>
<p>This guide is based on a real Kubernetes cluster with one control plane and two worker nodes, each with roughly 8 GB RAM.</p>
<hr />
<h2 id="heading-what-is-apache-flink">What is Apache Flink</h2>
<p>Apache Flink is a distributed stream and batch processing engine designed for stateful, low-latency, high-throughput data processing. Unlike traditional batch systems, Flink treats streaming as the primary model, with batch being a special case of bounded streams.</p>
<p>Key properties of Flink:</p>
<p>• True streaming engine, not micro-batching • Stateful processing with exactly-once guarantees • Event-time processing and watermarks • Horizontal scalability • Fault tolerance via checkpoints and state backends</p>
<p>Flink is commonly used for real-time analytics, event-driven applications, fraud detection, metrics aggregation, and complex event processing.</p>
<hr />
<h2 id="heading-core-flink-architecture">Core Flink Architecture</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768767510476/c49bd565-ef67-4ddb-b9ec-b7da5e5a9bfd.png" alt class="image--center mx-auto" /></p>
<p>A Flink cluster is composed of a small number of well-defined components.</p>
<h3 id="heading-jobmanager">JobManager</h3>
<p>The JobManager is the brain of the cluster.</p>
<p>It is responsible for:</p>
<p>• Accepting jobs • Creating execution graphs • Scheduling tasks • Coordinating checkpoints • Handling failures and restarts</p>
<p>Only one active JobManager exists at a time in standalone mode.</p>
<hr />
<h3 id="heading-taskmanager">TaskManager</h3>
<p>TaskManagers are the workers of the Flink cluster.</p>
<p>Each TaskManager:</p>
<p>• Runs tasks (operators) • Manages task slots • Executes user code • Maintains local state</p>
<p>A TaskManager exposes a fixed number of <em>task slots</em>. Slots are the unit of parallelism in Flink.</p>
<hr />
<h3 id="heading-slots-and-parallelism">Slots and Parallelism</h3>
<p>A slot represents a share of a TaskManager’s resources.</p>
<p>Total available parallelism is:</p>
<p>TaskManagers × Slots per TaskManager</p>
<p>For example:</p>
<p>• 4 TaskManagers • 2 slots each • Total parallelism = 8</p>
<p>Jobs can only run with parallelism up to the available slots.</p>
<hr />
<h2 id="heading-why-kubernetes-for-flink">Why Kubernetes for Flink</h2>
<p>Kubernetes provides a natural runtime for Flink because:</p>
<p>• Pods map cleanly to JobManager and TaskManager • Kubernetes scheduler handles placement • Native scaling via replicas • Built-in service discovery • Persistent volumes for state</p>
<p>Flink supports Kubernetes in multiple ways. In this blog we use <strong>Standalone Kubernetes mode</strong>, where Flink runs continuously as a cluster inside Kubernetes.</p>
<hr />
<h2 id="heading-cluster-prerequisites">Cluster Prerequisites</h2>
<p>The Kubernetes cluster used here:</p>
<p>• 1 control plane node • 2 worker nodes • ~8 GB RAM per worker • containerd runtime • local-path storage provisioner</p>
<p>Both worker nodes are labeled to allow Flink scheduling:</p>
<pre><code class="lang-javascript">kubectl label node nyzex-worker-node1 flink-role=worker
kubectl label node nyzex-worker-node2 flink-role=worker
</code></pre>
<hr />
<h2 id="heading-namespace-setup">Namespace Setup</h2>
<p>Create a dedicated namespace for Flink.</p>
<pre><code class="lang-javascript">kubectl create namespace flink
</code></pre>
<p>This keeps Flink resources isolated and easier to manage.</p>
<hr />
<h2 id="heading-persistent-storage-for-flink">Persistent Storage for Flink</h2>
<p>Flink requires persistent storage for:</p>
<p>• Checkpoints • Savepoints • High availability metadata (optional)</p>
<p>Using a PersistentVolumeClaim allows Kubernetes to dynamically provision storage.</p>
<h3 id="heading-pvc-definition">PVC Definition</h3>
<pre><code class="lang-javascript">apiVersion: v1
<span class="hljs-attr">kind</span>: PersistentVolumeClaim
<span class="hljs-attr">metadata</span>:
  name: flink-storage
  <span class="hljs-attr">namespace</span>: flink
<span class="hljs-attr">spec</span>:
  accessModes:
    - ReadWriteOnce
  <span class="hljs-attr">resources</span>:
    requests:
      storage: <span class="hljs-number">20</span>Gi
  <span class="hljs-attr">storageClassName</span>: local-path
</code></pre>
<p>Apply it:</p>
<pre><code class="lang-javascript">kubectl apply -f flink-pvc.yaml
</code></pre>
<p>With <code>WaitForFirstConsumer</code>, the volume binds only after a pod requests it. This is expected behavior.</p>
<hr />
<h2 id="heading-flink-configuration">Flink Configuration</h2>
<p>Flink is configured using <code>flink-conf.yaml</code> mounted into the pods via ConfigMap.</p>
<p>Key configuration:</p>
<pre><code class="lang-javascript">jobmanager.rpc.address: flink-jobmanager

state.backend: filesystem
state.checkpoints.dir: file:<span class="hljs-comment">///opt/flink/state/checkpoints</span>
state.savepoints.dir: file:<span class="hljs-comment">///opt/flink/state/savepoints</span>

execution.checkpointing.interval: <span class="hljs-number">10</span>s

parallelism.default: <span class="hljs-number">2</span>

kubernetes.taskmanager.node-selector.flink-role: worker
kubernetes.jobmanager.node-selector.flink-role: worker
</code></pre>
<p>This ensures:</p>
<p>• State is persisted • Checkpointing is enabled • Pods run only on worker nodes</p>
<p>What I used:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ConfigMap</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">flink-config</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">flink</span>
<span class="hljs-attr">data:</span>
  <span class="hljs-attr">flink-conf.yaml:</span> <span class="hljs-string">|
    jobmanager.rpc.address: flink-jobmanager
    taskmanager.numberOfTaskSlots: 2
    parallelism.default: 8
</span>
    <span class="hljs-comment"># Memory</span>
    <span class="hljs-attr">jobmanager.memory.process.size:</span> <span class="hljs-string">1024m</span>
    <span class="hljs-attr">taskmanager.memory.process.size:</span> <span class="hljs-string">3g</span>

    <span class="hljs-comment"># State backend</span>
    <span class="hljs-attr">state.backend:</span> <span class="hljs-string">rocksdb</span>
    <span class="hljs-attr">state.backend.incremental:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">state.checkpoints.dir:</span> <span class="hljs-string">file:///flink-data/checkpoints</span>
    <span class="hljs-attr">state.savepoints.dir:</span> <span class="hljs-string">file:///flink-data/savepoints</span>

    <span class="hljs-attr">execution.checkpointing.interval:</span> <span class="hljs-string">60s</span>
    <span class="hljs-attr">execution.checkpointing.min-pause:</span> <span class="hljs-string">30s</span>
    <span class="hljs-attr">execution.checkpointing.timeout:</span> <span class="hljs-string">10m</span>
</code></pre>
<hr />
<h2 id="heading-jobmanager-deployment">JobManager Deployment</h2>
<p>The JobManager runs as a Deployment with a single replica.</p>
<p>Key points:</p>
<p>• Uses Flink image • Exposes RPC and Web UI ports • Mounts persistent storage • Uses node selector</p>
<pre><code class="lang-javascript">apiVersion: apps/v1
<span class="hljs-attr">kind</span>: Deployment
<span class="hljs-attr">metadata</span>:
  name: flink-jobmanager
  <span class="hljs-attr">namespace</span>: flink
<span class="hljs-attr">spec</span>:
  replicas: <span class="hljs-number">1</span>
  <span class="hljs-attr">selector</span>:
    matchLabels:
      app: flink
      <span class="hljs-attr">component</span>: jobmanager
  <span class="hljs-attr">template</span>:
    metadata:
      labels:
        app: flink
        <span class="hljs-attr">component</span>: jobmanager
    <span class="hljs-attr">spec</span>:
      containers:
        - name: jobmanager
          <span class="hljs-attr">image</span>: flink:<span class="hljs-number">1.18</span>
          <span class="hljs-attr">args</span>: [<span class="hljs-string">"jobmanager"</span>]
          <span class="hljs-attr">ports</span>:
            - containerPort: <span class="hljs-number">6123</span>
            - containerPort: <span class="hljs-number">8081</span>
          <span class="hljs-attr">volumeMounts</span>:
            - name: flink-config-volume
              <span class="hljs-attr">mountPath</span>: <span class="hljs-regexp">/opt/</span>flink/conf
            - name: flink-storage
              <span class="hljs-attr">mountPath</span>: /flink-data
          <span class="hljs-attr">resources</span>:
            requests:
              memory: <span class="hljs-string">"1Gi"</span>
              <span class="hljs-attr">cpu</span>: <span class="hljs-string">"1"</span>
            <span class="hljs-attr">limits</span>:
              memory: <span class="hljs-string">"1Gi"</span>
              <span class="hljs-attr">cpu</span>: <span class="hljs-string">"1"</span>
      <span class="hljs-attr">volumes</span>:
        - name: flink-config-volume
          <span class="hljs-attr">configMap</span>:
            name: flink-config
        - name: flink-storage
          <span class="hljs-attr">persistentVolumeClaim</span>:
            claimName: flink-storage
</code></pre>
<hr />
<h2 id="heading-jobmanager-service">JobManager Service</h2>
<p>A Kubernetes Service exposes the JobManager internally.</p>
<pre><code class="lang-javascript">apiVersion: v1
<span class="hljs-attr">kind</span>: Service
<span class="hljs-attr">metadata</span>:
  name: flink-jobmanager
  <span class="hljs-attr">namespace</span>: flink
<span class="hljs-attr">spec</span>:
  ports:
    - name: rpc
      <span class="hljs-attr">port</span>: <span class="hljs-number">6123</span>
    - name: webui
      <span class="hljs-attr">port</span>: <span class="hljs-number">8081</span>
  <span class="hljs-attr">selector</span>:
    app: flink
    <span class="hljs-attr">component</span>: jobmanager
</code></pre>
<hr />
<h2 id="heading-taskmanager-deployment">TaskManager Deployment</h2>
<p>TaskManagers scale horizontally using replicas.</p>
<p>Important aspects:</p>
<p>• Resource requests and limits • Slot configuration • Pod anti-affinity for spreading</p>
<pre><code class="lang-javascript">apiVersion: apps/v1
<span class="hljs-attr">kind</span>: Deployment
<span class="hljs-attr">metadata</span>:
  name: flink-taskmanager
  <span class="hljs-attr">namespace</span>: flink
<span class="hljs-attr">spec</span>:
  replicas: <span class="hljs-number">4</span>
  <span class="hljs-attr">selector</span>:
    matchLabels:
      app: flink
      <span class="hljs-attr">component</span>: taskmanager
  <span class="hljs-attr">template</span>:
    metadata:
      labels:
        app: flink
        <span class="hljs-attr">component</span>: taskmanager
    <span class="hljs-attr">spec</span>:
      nodeSelector:
        flink-role: worker
      <span class="hljs-attr">affinity</span>:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: <span class="hljs-number">100</span>
              <span class="hljs-attr">podAffinityTerm</span>:
                labelSelector:
                  matchLabels:
                    component: taskmanager
                <span class="hljs-attr">topologyKey</span>: kubernetes.io/hostname
      <span class="hljs-attr">containers</span>:
        - name: taskmanager
          <span class="hljs-attr">image</span>: flink:<span class="hljs-number">1.18</span>
          <span class="hljs-attr">args</span>: [<span class="hljs-string">"taskmanager"</span>]
          <span class="hljs-attr">env</span>:
            - name: TASK_MANAGER_NUMBER_OF_TASK_SLOTS
              <span class="hljs-attr">value</span>: <span class="hljs-string">"2"</span>
          <span class="hljs-attr">resources</span>:
            requests:
              memory: <span class="hljs-string">"2Gi"</span>
              <span class="hljs-attr">cpu</span>: <span class="hljs-string">"1"</span>
            <span class="hljs-attr">limits</span>:
              memory: <span class="hljs-string">"3Gi"</span>
              <span class="hljs-attr">cpu</span>: <span class="hljs-string">"2"</span>
</code></pre>
<p>What I used:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">flink-taskmanager</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">flink</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">4</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">flink</span>
      <span class="hljs-attr">component:</span> <span class="hljs-string">taskmanager</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">flink</span>
        <span class="hljs-attr">component:</span> <span class="hljs-string">taskmanager</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">affinity:</span>
        <span class="hljs-attr">podAntiAffinity:</span>
          <span class="hljs-attr">preferredDuringSchedulingIgnoredDuringExecution:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">weight:</span> <span class="hljs-number">100</span>
              <span class="hljs-attr">podAffinityTerm:</span>
                <span class="hljs-attr">labelSelector:</span>
                  <span class="hljs-attr">matchLabels:</span>
                    <span class="hljs-attr">component:</span> <span class="hljs-string">taskmanager</span>
                <span class="hljs-attr">topologyKey:</span> <span class="hljs-string">kubernetes.io/hostname</span>
      <span class="hljs-attr">containers:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">taskmanager</span>
          <span class="hljs-attr">image:</span> <span class="hljs-string">flink:1.18</span>
          <span class="hljs-attr">args:</span> [<span class="hljs-string">"taskmanager"</span>]
          <span class="hljs-attr">resources:</span>
            <span class="hljs-attr">requests:</span>
              <span class="hljs-attr">memory:</span> <span class="hljs-string">"3Gi"</span>
              <span class="hljs-attr">cpu:</span> <span class="hljs-string">"2"</span>
            <span class="hljs-attr">limits:</span>
              <span class="hljs-attr">memory:</span> <span class="hljs-string">"3Gi"</span>
              <span class="hljs-attr">cpu:</span> <span class="hljs-string">"2"</span>
          <span class="hljs-attr">volumeMounts:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">flink-config-volume</span>
              <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/opt/flink/conf</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">flink-tmp</span>
              <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/tmp</span>
      <span class="hljs-attr">volumes:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">flink-config-volume</span>
          <span class="hljs-attr">configMap:</span>
            <span class="hljs-attr">name:</span> <span class="hljs-string">flink-config</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">flink-tmp</span>
          <span class="hljs-attr">emptyDir:</span> {}
</code></pre>
<p>This configuration allows Kubernetes to spread TaskManagers across both worker nodes.</p>
<h2 id="heading-some-important-queries-and-information">Some important queries and information:</h2>
<h3 id="heading-why-taskmanagers-are-getting-distributed-across-nodes">Why TaskManagers are getting distributed across nodes:</h3>
<p>TaskManagers are distributed because:</p>
<ul>
<li><p>Kubernetes schedules pods</p>
</li>
<li><p>You allowed scheduling on both worker nodes</p>
</li>
<li><p>Flink TaskManagers are <strong>stateless compute workers</strong></p>
</li>
<li><p>You added <strong>anti-affinity</strong>, so Kubernetes spreads them</p>
</li>
</ul>
<p>Flink itself does <strong>not</strong> decide node placement. Kubernetes does.</p>
<h3 id="heading-who-decides-where-a-taskmanager-runs">Who decides where a TaskManager runs?</h3>
<h3 id="heading-kubernetes-scheduler-not-flink">Kubernetes scheduler, not Flink</h3>
<p>When you create this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">replicas:</span> <span class="hljs-number">4</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
</code></pre>
<p>You are telling Kubernetes:</p>
<blockquote>
<p>“I want 4 identical TaskManager pods.”</p>
</blockquote>
<p>Kubernetes then:</p>
<ul>
<li><p>Looks at available nodes</p>
</li>
<li><p>Checks nodeSelector</p>
</li>
<li><p>Checks resource requests</p>
</li>
<li><p>Applies affinity rules</p>
</li>
<li><p>Chooses nodes</p>
</li>
</ul>
<p>Flink only sees:</p>
<blockquote>
<p>“I now have 4 TaskManagers connected to me.”</p>
</blockquote>
<hr />
<h3 id="heading-why-they-dont-all-land-on-one-node-anymore">Why they don’t all land on one node anymore:</h3>
<p>Once this is added:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">podAntiAffinity:</span>
  <span class="hljs-attr">preferredDuringSchedulingIgnoredDuringExecution:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">weight:</span> <span class="hljs-number">100</span>
      <span class="hljs-attr">podAffinityTerm:</span>
        <span class="hljs-attr">labelSelector:</span>
          <span class="hljs-attr">matchLabels:</span>
            <span class="hljs-attr">component:</span> <span class="hljs-string">taskmanager</span>
        <span class="hljs-attr">topologyKey:</span> <span class="hljs-string">kubernetes.io/hostname</span>
</code></pre>
<p>Result:</p>
<ul>
<li><p>Pods spread evenly</p>
</li>
<li><p>2 TaskManagers on worker-node1</p>
</li>
<li><p>2 TaskManagers on worker-node2</p>
</li>
</ul>
<hr />
<h3 id="heading-why-taskmanagers-do-not-use-the-pvc">Why TaskManagers do NOT use the PVC</h3>
<h3 id="heading-key-principle">Key principle</h3>
<p><strong>TaskManagers are ephemeral compute.</strong><br />They should be disposable.</p>
<p>Flink is designed so that:</p>
<ul>
<li><p>TaskManagers can die at any time</p>
</li>
<li><p>State is NOT tied to a specific TaskManager pod</p>
</li>
</ul>
<p>So by default:</p>
<ul>
<li><p>TaskManagers do NOT mount persistent volumes</p>
</li>
<li><p>TaskManagers use <strong>local ephemeral storage</strong></p>
</li>
<li><p>Persistent state lives elsewhere</p>
</li>
</ul>
<p>This is intentional.</p>
<hr />
<h3 id="heading-so-what-is-the-pvc-actually-used-for">So what is the PVC actually used for?</h3>
<p>In this setup, the PVC is mounted <strong>only on the JobManager</strong>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">volumeMounts:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">flink-storage</span>
    <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/opt/flink/state</span>
</code></pre>
<p>And configured in <code>flink-conf.yaml</code>:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">state.backend:</span> <span class="hljs-string">filesystem</span>
<span class="hljs-attr">state.checkpoints.dir:</span> <span class="hljs-string">file:///opt/flink/state/checkpoints</span>
<span class="hljs-attr">state.savepoints.dir:</span> <span class="hljs-string">file:///opt/flink/state/savepoints</span>
</code></pre>
<p>This means:</p>
<ul>
<li><p>Checkpoints are written to the PVC</p>
</li>
<li><p>Savepoints are written to the PVC</p>
</li>
<li><p>Job metadata survives pod restarts</p>
</li>
</ul>
<p>During a checkpoint:</p>
<ol>
<li><p>TaskManagers snapshot their local state</p>
</li>
<li><p>State is sent to the JobManager</p>
</li>
<li><p>JobManager persists it to the shared storage</p>
</li>
</ol>
<p>If a TaskManager dies:</p>
<ul>
<li><p>Kubernetes restarts it</p>
</li>
<li><p>Flink restores state from the checkpoint directory</p>
</li>
<li><p>Processing resumes</p>
</li>
</ul>
<hr />
<h2 id="heading-verifying-cluster-distribution">Verifying Cluster Distribution</h2>
<p>After deployment:</p>
<pre><code class="lang-javascript">kubectl get pods -n flink -o wide
</code></pre>
<p>Expected result:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768761856924/ddf36eae-b1fe-4e03-9005-9124400bc17c.png" alt class="image--center mx-auto" /></p>
<p>• JobManager on one worker • TaskManagers evenly split across nodes • No Pending pods</p>
<p>This confirms proper scheduling and full cluster utilization.</p>
<hr />
<h2 id="heading-accessing-flink-web-ui">Accessing Flink Web UI</h2>
<p>Port-forward the JobManager service:</p>
<pre><code class="lang-javascript">kubectl port-forward svc/flink-jobmanager <span class="hljs-number">8081</span>:<span class="hljs-number">8081</span> -n flink
</code></pre>
<p>Open in browser:</p>
<pre><code class="lang-javascript">http:<span class="hljs-comment">//localhost:8081</span>
</code></pre>
<p>You should see:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768761934517/90866881-bfec-4edc-8f01-f3b29917530b.png" alt class="image--center mx-auto" /></p>
<p>• All TaskManagers registered • Total available slots • Healthy cluster status</p>
<hr />
<h2 id="heading-running-a-sample-job">Running a Sample Job</h2>
<p>Run a built-in Flink example:</p>
<pre><code class="lang-javascript">kubectl exec -n flink deploy/flink-jobmanager -- \
  flink run /opt/flink/examples/streaming/WordCount.jar
</code></pre>
<p>Observe task distribution in the UI.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768761996811/772c46d1-e1a4-4aff-8710-c8cf1cfd87c4.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-how-kubernetes-and-flink-work-together">How Kubernetes and Flink Work Together</h2>
<p>Kubernetes handles:</p>
<p>• Pod scheduling • Resource isolation • Restarting failed pods • Networking</p>
<p>Flink handles:</p>
<p>• Task scheduling • State management • Checkpoints • Fault recovery</p>
<p>This separation keeps responsibilities clean and scalable.</p>
<hr />
<h2 id="heading-what-makes-this-production-ready">What Makes This Production-Ready</h2>
<p>This setup already includes:</p>
<p>• Persistent state • Checkpointing • Horizontal scalability • Proper pod distribution</p>
<p>Next improvements can include:</p>
<p>• High availability JobManager • External state backend (S3, MinIO) • Kafka integration • Prometheus metrics • Autoscaling</p>
<hr />
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Running Apache Flink on Kubernetes is not just about starting pods. Correct scheduling, slot planning, storage configuration, and understanding Flink’s execution model are critical.With this setup, your cluster resources are fully utilized, workloads scale correctly, and you are ready to run real streaming jobs with confidence..</p>
]]></content:encoded></item><item><title><![CDATA[Apache Flink, Kubernetes, and How It Works]]></title><description><![CDATA[Imagine you have a factory that processes things. Flink is like that factory, and Kubernetes is like the factory floor manager that decides where machines go and how they run.

1. What is Flink?
Apache Flink is a distributed data processing engine.

...]]></description><link>https://blog.nyzex.in/apache-flink-kubernetes-and-how-it-works</link><guid isPermaLink="true">https://blog.nyzex.in/apache-flink-kubernetes-and-how-it-works</guid><category><![CDATA[apache]]></category><category><![CDATA[flink]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Wed, 14 Jan 2026 18:30:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768764797099/f2dae534-2876-4963-a8f5-447439d9c467.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Imagine you have a <strong>factory</strong> that processes things. Flink is like that <strong>factory</strong>, and Kubernetes is like the <strong>factory floor manager</strong> that decides where machines go and how they run.</p>
<hr />
<h2 id="heading-1-what-is-flink">1. What is Flink?</h2>
<p>Apache Flink is a <strong>distributed data processing engine</strong>.</p>
<ul>
<li><p>“Distributed” means it can run across <strong>multiple computers (nodes)</strong> at the same time.</p>
</li>
<li><p>“Data processing” means it takes <strong>data streams or batches</strong> and transforms them into results.</p>
</li>
<li><p>It can <strong>process data as it arrives</strong> (streaming) or <strong>process a fixed dataset</strong> (batch).</p>
</li>
<li><p>Flink is <strong>stateful and fault-tolerant</strong>: it remembers important information and can recover if a worker dies.</p>
</li>
</ul>
<p><strong>Example:</strong></p>
<p>Like counting how many people enter a mall every minute:</p>
<ul>
<li><p>Flink will keep a <strong>running total</strong>.</p>
</li>
<li><p>If a computer crashes, it will <strong>recover the total from where it left off</strong>.</p>
</li>
</ul>
<hr />
<h2 id="heading-2-flink-cluster-components">2. Flink Cluster Components</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768767663876/bfe7225f-f057-42f0-8b6e-ffe860a879c7.png" alt class="image--center mx-auto" /></p>
<p>A Flink cluster has <strong>two main roles</strong>:</p>
<h3 id="heading-jobmanager-jm-the-brain">JobManager (JM): The Brain</h3>
<ul>
<li><p>The <strong>JobManager is like the manager of the factory</strong>.</p>
</li>
<li><p>Responsibilities:</p>
<ul>
<li><p>Accept jobs (the “instructions” for the factory)</p>
</li>
<li><p>Split the job into smaller tasks</p>
</li>
<li><p>Decide which worker (TaskManager) will execute each task</p>
</li>
<li><p>Keep track of the state and checkpoints</p>
</li>
<li><p>Handle failure recovery</p>
</li>
</ul>
</li>
<li><p>Usually, <strong>1 JobManager pod</strong> in Kubernetes</p>
</li>
<li><p>Needs <strong>persistent storage</strong> (PVC) because it stores checkpoints and savepoints</p>
</li>
</ul>
<hr />
<h3 id="heading-taskmanager-tm-the-worker">TaskManager (TM): The Worker</h3>
<ul>
<li><p>The <strong>TaskManager is like a worker machine</strong> on the factory floor.</p>
</li>
<li><p>Responsibilities:</p>
<ul>
<li><p>Execute tasks assigned by the JobManager</p>
</li>
<li><p>Keep temporary state in memory or local disk (ephemeral)</p>
</li>
<li><p>Report back progress to JobManager</p>
</li>
</ul>
</li>
<li><p>TaskManagers are <strong>stateless</strong> and <strong>ephemeral</strong>:</p>
<ul>
<li><p>If a TaskManager dies, Kubernetes will restart it somewhere else</p>
</li>
<li><p>JobManager uses checkpoints to restore the state</p>
</li>
</ul>
</li>
<li><p>Each TaskManager pod can have <strong>one or more slots</strong> (think “hands” to do work)</p>
</li>
</ul>
<hr />
<h2 id="heading-3-slots-hands-of-the-taskmanager">3. Slots: Hands of the TaskManager</h2>
<ul>
<li><p>Each TaskManager has <strong>slots</strong>, which are units of parallel work.</p>
</li>
<li><p>Each slot can run <strong>one subtask</strong> of a Flink job.</p>
</li>
</ul>
<p><strong>Analogy:</strong></p>
<ul>
<li><p>Imagine a worker has 2 hands → they can work on 2 small tasks at the same time</p>
</li>
<li><p>If you have 4 workers, each with 2 hands → 8 tasks can be worked on simultaneously</p>
</li>
<li><p>Slots allow Flink to <strong>divide work and control resources</strong> (memory, CPU) per task</p>
</li>
</ul>
<hr />
<h2 id="heading-4-parallelism-how-many-hands-work">4. Parallelism: How Many Hands Work</h2>
<p>Parallelism is <strong>how many subtasks a job is divided into</strong>.</p>
<ul>
<li><p>Each subtask needs one slot.</p>
</li>
<li><p>Maximum parallelism = total slots in cluster</p>
</li>
</ul>
<p><strong>Example:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Component</td><td>Pods</td><td>Slots per Pod</td><td>Total Slots</td></tr>
</thead>
<tbody>
<tr>
<td>JobManager</td><td>1</td><td>0</td><td>0</td></tr>
<tr>
<td>TaskManager</td><td>4</td><td>2</td><td>8</td></tr>
</tbody>
</table>
</div><ul>
<li><p><code>parallelism.default = 2</code> → job will run 2 subtasks if no <code>-p</code> is specified</p>
</li>
<li><p>Job parallelism = 6 → 6 subtasks run, distributed across the 4 TaskManagers</p>
</li>
<li><p>Job parallelism = 10 → only 8 subtasks can run at once (total slots), 2 wait</p>
</li>
</ul>
<hr />
<h2 id="heading-5-how-taskmanagers-store-data">5. How TaskManagers store data</h2>
<ul>
<li><p><strong>TaskManagers are ephemeral</strong>; they use local memory or local disk for temporary state.</p>
</li>
<li><p>They <strong>do NOT use PVC</strong>. If a TaskManager dies, its data is lost.</p>
</li>
<li><p><strong>JobManager + PVC</strong> is where <strong>durable state</strong> lives:</p>
<ul>
<li><p>Checkpoints</p>
</li>
<li><p>Savepoints</p>
</li>
</ul>
</li>
</ul>
<p><strong>Checkpoint flow:</strong></p>
<ol>
<li><p>JobManager asks TaskManagers to snapshot their local state</p>
</li>
<li><p>TaskManagers send snapshots to JobManager</p>
</li>
<li><p>JobManager writes snapshots to PVC (persistent storage)</p>
</li>
<li><p>If a TM dies, it is restarted → state is restored from PVC</p>
</li>
</ol>
<hr />
<h2 id="heading-6-how-kubernetes-schedules-taskmanagers">6. How Kubernetes schedules TaskManagers</h2>
<p>As an example, let us see this scenario:</p>
<ul>
<li><p>You requested <strong>4 TaskManager pods</strong></p>
</li>
<li><p>Kubernetes decides <strong>which node each pod runs on</strong></p>
</li>
<li><p>You added <strong>soft anti-affinity</strong> → tries to spread them across both nodes</p>
</li>
</ul>
<p>Example of your cluster after scheduling:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>TaskManager</td><td>Node</td></tr>
</thead>
<tbody>
<tr>
<td>TM1</td><td>node1</td></tr>
<tr>
<td>TM2</td><td>node1</td></tr>
<tr>
<td>TM3</td><td>node2</td></tr>
<tr>
<td>TM4</td><td>node2</td></tr>
</tbody>
</table>
</div><ul>
<li><p>JobManager pod runs on <strong>node1</strong>, mounts PVC for persistent storage</p>
</li>
<li><p>TaskManagers use <strong>local ephemeral storage</strong></p>
</li>
</ul>
<hr />
<h2 id="heading-7-step-by-step-of-job-execution">7. Step-by-step of job execution</h2>
<ol>
<li><p>Submit a job (example: WordCount)</p>
</li>
<li><p>JobManager splits job into subtasks according to parallelism</p>
</li>
<li><p>Assigns subtasks to TaskManager slots</p>
</li>
<li><p>TaskManagers execute tasks, keep temporary state</p>
</li>
<li><p>Periodically, TaskManagers checkpoint state to JobManager → written to PVC</p>
</li>
<li><p>If a TaskManager dies → Kubernetes restarts pod → JobManager restores state</p>
</li>
<li><p>Job continues processing seamlessly</p>
</li>
</ol>
<hr />
<h2 id="heading-8-example-analogy">8. Example analogy</h2>
<ul>
<li><p><strong>JobManager</strong> → Manager in a factory</p>
</li>
<li><p><strong>TaskManagers</strong> → Workers on the floor</p>
</li>
<li><p><strong>Slots</strong> → Worker’s hands</p>
</li>
<li><p><strong>Parallelism</strong> → Number of hands used on a job</p>
</li>
<li><p><strong>PVC</strong> → Manager’s filing cabinet with important records</p>
</li>
<li><p><strong>TaskManager ephemeral storage</strong> → Paper on worker’s desk (temporary, lost if worker leaves)</p>
</li>
</ul>
<hr />
<h2 id="heading-key-takeaways">Key takeaways</h2>
<ul>
<li><p>Flink separates <strong>computation</strong> (TaskManagers) from <strong>state</strong> (JobManager + PVC)</p>
</li>
<li><p><strong>Slots control parallelism at runtime</strong></p>
</li>
<li><p><strong>TaskManagers are disposable</strong>, Kubernetes can reschedule them anywhere</p>
</li>
<li><p><strong>JobManager is critical</strong>: PVC ensures job can recover if TMs die</p>
</li>
<li><p><strong>Parallelism.default</strong> is just a default; maximum parallelism is determined by <strong>total slots in cluster</strong></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Running Traccar on Kubernetes: Lessons Learned from Ingress, TCP Services, and Scaling]]></title><description><![CDATA[Traccar looks simple on the surface. It is just a GPS tracking server with a web interface. Once you attempt to run it in Kubernetes, especially for real device traffic, you quickly realize that it is not a typical HTTP application. Traccar is a mix ...]]></description><link>https://blog.nyzex.in/running-traccar-on-kubernetes-lessons-learned-from-ingress-tcp-services-and-scaling</link><guid isPermaLink="true">https://blog.nyzex.in/running-traccar-on-kubernetes-lessons-learned-from-ingress-tcp-services-and-scaling</guid><category><![CDATA[traccar]]></category><category><![CDATA[Devops]]></category><category><![CDATA[ci-cd]]></category><category><![CDATA[software development]]></category><category><![CDATA[Open Source]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Mon, 29 Dec 2025 12:04:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767009921542/786d82fa-b148-4398-928f-5843b8543fea.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Traccar looks simple on the surface. It is just a GPS tracking server with a web interface. Once you attempt to run it in Kubernetes, especially for real device traffic, you quickly realize that it is not a typical HTTP application. Traccar is a mix of HTTP, long-lived TCP connections, multiple device protocols, and stateful behavior that does not always align cleanly with cloud-native assumptions.</p>
<p>This post documents what worked, what did not, and why certain architectural decisions were made while running Traccar on Kubernetes. The goal is not to present a perfect reference architecture, but to share practical lessons learned from real deployments and iterations.</p>
<hr />
<h2 id="heading-understanding-traccars-traffic-model">Understanding Traccar’s Traffic Model</h2>
<p>Before touching Kubernetes, it is important to understand how Traccar actually receives traffic.</p>
<p>The Traccar web interface is a standard HTTP application. It runs on port 8082 by default and can be exposed using any normal HTTP reverse proxy.</p>
<p>Device traffic is very different. Each GPS device speaks its own protocol. These protocols are almost always raw TCP. Examples include:</p>
<ul>
<li><p>Port 5027 for Teltonika devices like FMB125</p>
</li>
<li><p>Port 5004 for OsmAnd</p>
</li>
<li><p>Many other ports depending on protocol configuration</p>
</li>
</ul>
<p>Devices open long-lived TCP connections and continuously send data. They do not behave like short HTTP requests. This distinction heavily influences how Kubernetes networking must be designed.</p>
<hr />
<h2 id="heading-initial-attempt-treating-traccar-like-a-normal-web-app">Initial Attempt: Treating Traccar Like a Normal Web App</h2>
<p>The first deployment followed a standard Kubernetes pattern.</p>
<ul>
<li><p>Traccar ran in a Deployment</p>
</li>
<li><p>A ClusterIP Service exposed ports 8082, 5027, and 5004</p>
</li>
<li><p>An NGINX Ingress exposed the web interface on a domain</p>
</li>
</ul>
<p>The web interface worked immediately. Logging in, viewing devices, and maps all functioned correctly.</p>
<p>Device connections did not.</p>
<p>At first, it looked like a firewall or security group issue. Ports were open. Services existed. Pods were running. Logs showed no incoming device traffic.</p>
<p>The core mistake was assuming that Kubernetes Ingress could route arbitrary TCP traffic in the same way it routes HTTP.</p>
<hr />
<h2 id="heading-why-standard-ingress-does-not-work-for-device-ports">Why Standard Ingress Does Not Work for Device Ports</h2>
<p>Kubernetes Ingress is an HTTP abstraction. It understands hosts, paths, headers, and HTTP semantics.</p>
<p>Traccar device protocols are raw TCP. There is no HTTP handshake, no headers, and no routing metadata.</p>
<p>Most Ingress controllers, including NGINX Ingress, completely ignore non-HTTP traffic unless explicitly configured to handle TCP streams.</p>
<p>This is why simply adding device ports to a Service and expecting Ingress to route them does not work.</p>
<h2 id="heading-initial-naive-architecture-what-did-not-work">Initial Naive Architecture (What Did Not Work)</h2>
<p>This represents the first attempt where Traccar was treated like a normal HTTP application.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767009425849/f7eaad29-2082-4aa4-adfd-baf59ec943f3.png" alt class="image--center mx-auto" /></p>
<p><strong>Why this failed</strong></p>
<ul>
<li><p>Ingress only understood HTTP</p>
</li>
<li><p>TCP packets from devices were silently dropped</p>
</li>
<li><p>No errors were obvious unless Ingress logs were inspected carefully</p>
</li>
</ul>
<p>This is useful to show that <strong>nothing looked wrong from a Kubernetes resource perspective</strong>, yet device traffic never arrived.</p>
<hr />
<h2 id="heading-option-1-separate-loadbalancer-for-device-traffic">Option 1: Separate LoadBalancer for Device Traffic</h2>
<p>The simplest working solution was to create a separate Service of type LoadBalancer for device ports.</p>
<ul>
<li><p>One LoadBalancer for ports 5027, 5004, and others</p>
</li>
<li><p>Another Ingress for the web UI</p>
</li>
</ul>
<p>This worked immediately. Devices connected successfully and data started flowing.</p>
<p>However, this approach had clear downsides.</p>
<ul>
<li><p>Every LoadBalancer costs money</p>
</li>
<li><p>Managing DNS and certificates becomes fragmented</p>
</li>
<li><p>Operational complexity increases with each additional protocol</p>
</li>
</ul>
<p>For a small setup this might be acceptable. For a production system with many protocols, it quickly becomes messy.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767009465530/ff76cdaa-ca0f-4b29-8f35-03c7a4166556.png" alt class="image--center mx-auto" /></p>
<p><strong>Why this worked</strong></p>
<ul>
<li><p>Kubernetes LoadBalancer services handle raw TCP natively</p>
</li>
<li><p>Devices connected immediately</p>
</li>
</ul>
<p><strong>Why this was not ideal</strong></p>
<ul>
<li><p>Multiple public IPs</p>
</li>
<li><p>Higher cost</p>
</li>
<li><p>DNS and certificate management became fragmented</p>
</li>
<li><p>Scaling to many protocols would multiply LoadBalancers</p>
</li>
</ul>
<p>This is important because it shows a <strong>valid stepping stone</strong>, not a mistake.</p>
<hr />
<h2 id="heading-option-2-nginx-ingress-with-tcp-services">Option 2: NGINX Ingress with TCP Services</h2>
<p>The more scalable approach was to use <strong>NGINX Ingress TCP services</strong>.</p>
<p>NGINX Ingress supports raw TCP forwarding through a ConfigMap. This feature is not enabled by default and requires explicit configuration.</p>
<h3 id="heading-how-tcp-services-work">How TCP Services Work</h3>
<p>Instead of using an Ingress resource, TCP ports are mapped directly in a ConfigMap.</p>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ConfigMap</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">tcp-services</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">ingress-nginx</span>
<span class="hljs-attr">data:</span>
  <span class="hljs-attr">"5027":</span> <span class="hljs-string">"traccar/traccar:5027"</span>
  <span class="hljs-attr">"5004":</span> <span class="hljs-string">"traccar/traccar:5004"</span>
</code></pre>
<p>This tells the NGINX Ingress controller:</p>
<ul>
<li><p>Listen on port 5027</p>
</li>
<li><p>Forward raw TCP traffic to the Traccar Service on port 5027</p>
</li>
</ul>
<p>The NGINX Ingress controller must also be started with flags enabling TCP services:</p>
<pre><code class="lang-yaml"><span class="hljs-string">--tcp-services-configmap=ingress-nginx/tcp-services</span>
</code></pre>
<p>Once this was configured correctly, device traffic started flowing without any separate LoadBalancer.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767009495353/0c1f2e8d-d2ba-4d78-a384-e442ee409e75.png" alt class="image--center mx-auto" /></p>
<p><strong>Key points this diagram communicates</strong></p>
<ul>
<li><p>One public entry point</p>
</li>
<li><p>HTTP and TCP are handled differently but coexist cleanly</p>
</li>
<li><p>No extra LoadBalancers</p>
</li>
<li><p>Devices and users share the same domain, different ports</p>
</li>
</ul>
<p>This is the centerpiece of the blog.</p>
<hr />
<h2 id="heading-single-loadbalancer-multiple-protocols">Single LoadBalancer, Multiple Protocols</h2>
<p>With TCP services enabled, the architecture became much cleaner.</p>
<ul>
<li><p>One NGINX Ingress LoadBalancer</p>
</li>
<li><p>HTTP traffic routed via Ingress rules</p>
</li>
<li><p>TCP traffic routed via ConfigMap</p>
</li>
<li><p>One public IP</p>
</li>
<li><p>One DNS domain</p>
</li>
</ul>
<p>Devices connected to the same IP or domain, simply using different ports.</p>
<p>This was the first setup that felt production-ready.</p>
<hr />
<h2 id="heading-dns-and-domain-based-device-connections">DNS and Domain-Based Device Connections</h2>
<p>Some devices, including Teltonika FMB125, support connecting to a domain name instead of an IP address.</p>
<p>This was important for flexibility.</p>
<p>A CNAME record was created pointing to the Ingress LoadBalancer DNS name.</p>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-string">traccar.example.com</span> <span class="hljs-string">-&gt;</span> <span class="hljs-string">ingress-lb.amazonaws.com</span>
</code></pre>
<p>Devices were configured to connect to <a target="_blank" href="http://traccar.example.com:5027"><code>traccar.example.com:5027</code></a>.</p>
<p>This allowed infrastructure changes without touching device configurations, which is critical once devices are deployed in the field.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767009545563/604c8cf0-e6cd-470c-802e-2a4ac3aa1e34.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>Devices never depend on a fixed IP</p>
</li>
<li><p>Infrastructure can change underneath</p>
</li>
<li><p>Field devices remain untouched</p>
</li>
</ul>
<p>This is often overlooked but is critical in real deployments.</p>
<hr />
<h2 id="heading-scaling-traccar-pods-and-the-tcp-reality">Scaling Traccar Pods and the TCP Reality</h2>
<p>At this point, the next natural step was scaling.</p>
<p>Horizontal Pod Autoscaler was enabled based on CPU usage. Traccar pods scaled up as load increased.</p>
<p>This is where another subtle issue appeared.</p>
<h3 id="heading-tcp-connections-are-sticky">TCP Connections Are Sticky</h3>
<p>When a device connects over TCP, the connection is established to a specific pod through NGINX. That connection stays open for a long time.</p>
<p>If the pod restarts, the connection drops. Devices reconnect, but not always immediately.</p>
<p>If traffic is distributed across multiple pods, each pod holds its own set of device connections. This is not inherently bad, but it has consequences:</p>
<ul>
<li><p>Pod restarts cause device disconnects</p>
</li>
<li><p>Rolling updates must be carefully controlled</p>
</li>
<li><p>Aggressive autoscaling can harm connection stability</p>
</li>
</ul>
<p>For this reason, scaling Traccar is not as simple as scaling stateless HTTP services.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767009601660/d3de90b0-e859-4dbe-8620-4e46618ed9b8.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>TCP connections are sticky</p>
</li>
<li><p>Restarting a pod drops active devices</p>
</li>
<li><p>Autoscaling must be conservative</p>
</li>
<li><p>Rolling updates must avoid simultaneous pod restarts</p>
</li>
</ul>
<p>This visually explains why Traccar is not truly stateless.</p>
<p><strong>Lessons on Scaling Strategy</strong></p>
<p>A few practical rules emerged.</p>
<ul>
<li><p>Keep a minimum number of replicas to avoid cold starts</p>
</li>
<li><p>Avoid frequent pod restarts</p>
</li>
<li><p>Use rolling updates with maxUnavailable set to zero</p>
</li>
<li><p>Scale based on memory and connection count, not only CPU</p>
</li>
</ul>
<p>In some cases, vertical scaling provided more stability than horizontal scaling.</p>
<hr />
<h2 id="heading-database-considerations">Database Considerations</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1767009629255/2d11926b-4b2c-4ae3-8923-9204880cd4b6.png" alt class="image--center mx-auto" /></p>
<p>Running the database inside the cluster was tested initially.</p>
<p>This was quickly abandoned.</p>
<p>Traccar is stateful. Device positions, events, and history must never be lost. Kubernetes pods are ephemeral by design.</p>
<p>Moving PostgreSQL to a managed service like RDS simplified operations significantly.</p>
<ul>
<li><p>Backups became reliable</p>
</li>
<li><p>Pod restarts no longer risked data integrity</p>
</li>
<li><p>Performance was more predictable</p>
</li>
</ul>
<p>This separation of concerns was one of the most important architectural decisions.</p>
<hr />
<h2 id="heading-observability-and-debugging-device-traffic">Observability and Debugging Device Traffic</h2>
<p>Debugging TCP traffic is harder than debugging HTTP.</p>
<p>A few practices helped significantly:</p>
<ul>
<li><p>Enable detailed Traccar protocol logs temporarily</p>
</li>
<li><p>Use <code>kubectl logs</code> with timestamps</p>
</li>
<li><p>Test device connections using netcat or protocol simulators</p>
</li>
<li><p>Monitor NGINX Ingress logs for connection errors</p>
</li>
</ul>
<p>Blindly assuming that devices are sending data is a common mistake. Always validate traffic at each layer.</p>
<hr />
<h2 id="heading-cicd-and-image-strategy">CI/CD and Image Strategy</h2>
<p>Building a custom Traccar image simplified deployment.</p>
<ul>
<li><p>Web UI built once</p>
</li>
<li><p>Backend and frontend shipped together</p>
</li>
<li><p>Image pushed to a registry</p>
</li>
<li><p>Kubernetes deployment updated with new tag</p>
</li>
</ul>
<p>This avoided runtime builds and reduced startup time.</p>
<p>Automated image tagging and controlled rollouts were critical to avoid accidental mass disconnects during updates.</p>
<hr />
<h2 id="heading-final-architecture-summary">Final Architecture Summary</h2>
<p>The final stable setup looked like this:</p>
<ul>
<li><p>Traccar runs as a Deployment in Kubernetes</p>
</li>
<li><p>PostgreSQL runs outside the cluster</p>
</li>
<li><p>NGINX Ingress exposes:</p>
<ul>
<li><p>HTTP via Ingress rules</p>
</li>
<li><p>TCP device ports via TCP services ConfigMap</p>
</li>
</ul>
</li>
<li><p>One LoadBalancer</p>
</li>
<li><p>Devices connect using a domain name</p>
</li>
<li><p>Scaling is conservative and connection-aware</p>
</li>
</ul>
<p>This architecture balanced Kubernetes flexibility with the realities of long-lived TCP connections.</p>
<hr />
<h2 id="heading-closing-thoughts">Closing Thoughts</h2>
<p>Traccar can run very well on Kubernetes, but only if it is treated as a mixed-protocol, semi-stateful system rather than a simple web application.</p>
<p>The biggest lesson was that Kubernetes abstractions are powerful, but they do not remove the need to understand how applications actually communicate.</p>
<p>If you respect Traccar’s networking model and design around it, Kubernetes becomes an advantage rather than a source of constant friction.</p>
<p>Thanks to <a target="_blank" href="http://mermaid.live">mermaid.live</a> to be able to use it to create these flow diagrams!</p>
]]></content:encoded></item><item><title><![CDATA[Why Your Kubernetes Cluster Works Fine Until Traffic Spikes]]></title><description><![CDATA[A Deep Dive into Resource Requests, Limits, and Real World Failures
A Kubernetes cluster often appears healthy during normal operation. Pods are running, dashboards are green, and alerts stay quiet. Then traffic spikes. Suddenly requests slow down, p...]]></description><link>https://blog.nyzex.in/why-your-kubernetes-cluster-works-fine-until-traffic-spikes</link><guid isPermaLink="true">https://blog.nyzex.in/why-your-kubernetes-cluster-works-fine-until-traffic-spikes</guid><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[k8s]]></category><category><![CDATA[cluster]]></category><category><![CDATA[resources]]></category><category><![CDATA[YAML]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Mon, 15 Dec 2025 18:56:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765824912114/eb3a9446-0072-4b90-a2ba-cd6c5226ad1c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-a-deep-dive-into-resource-requests-limits-and-real-world-failures">A Deep Dive into Resource Requests, Limits, and Real World Failures</h2>
<p>A Kubernetes cluster often appears healthy during normal operation. Pods are running, dashboards are green, and alerts stay quiet. Then traffic spikes. Suddenly requests slow down, pods restart, and some workloads disappear entirely. This situation surprises many teams because nothing changed in the cluster configuration. The problem usually lies in how resource requests and limits were defined, or not defined at all.</p>
<p>This article explains why these failures happen, how Kubernetes actually uses CPU and memory settings, and how small configuration choices can prevent large production outages.</p>
<hr />
<h2 id="heading-the-false-comfort-of-a-quiet-cluster">The False Comfort of a Quiet Cluster</h2>
<p>A cluster that handles low traffic smoothly is not necessarily well configured. During calm periods, most applications consume only a fraction of their potential resources. Kubernetes does not enforce limits aggressively when there is no contention. This creates the illusion that default settings are sufficient.</p>
<p>When traffic increases, applications begin to compete for CPU and memory. At that point, Kubernetes must make decisions quickly. If resource definitions are unrealistic, the scheduler and the kubelet respond in ways that feel unpredictable.</p>
<hr />
<h2 id="heading-what-resource-requests-really-mean">What Resource Requests Really Mean</h2>
<p>A resource request is a promise. When a pod declares a CPU or memory request, it is telling Kubernetes how much it needs to function reliably. The scheduler uses this information to decide where the pod can run.</p>
<p>Consider a simple deployment:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">resources:</span>
  <span class="hljs-attr">requests:</span>
    <span class="hljs-attr">cpu:</span> <span class="hljs-string">"250m"</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"256Mi"</span>
</code></pre>
<p>This configuration means Kubernetes will place the pod only on a node that has at least 250 millicores of CPU and 256 MiB of memory available. Even if the pod uses much less most of the time, the scheduler reserves this capacity.</p>
<p>If requests are set too low, Kubernetes may pack too many pods onto the same node. Everything works until load increases. At that point, the node becomes overwhelmed.</p>
<hr />
<h2 id="heading-what-limits-actually-do">What Limits Actually Do</h2>
<p>Limits define the maximum resources a container can use.</p>
<p>For CPU, exceeding the limit causes throttling. The application does not crash but becomes slower. Latency increases and timeouts appear.</p>
<p>For memory, exceeding the limit results in an immediate termination. The container is killed with an Out Of Memory error and restarted if a restart policy exists.</p>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">resources:</span>
  <span class="hljs-attr">limits:</span>
    <span class="hljs-attr">cpu:</span> <span class="hljs-string">"500m"</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"512Mi"</span>
</code></pre>
<p>If the application suddenly needs more memory during a traffic spike, Kubernetes will not negotiate. The container is terminated.</p>
<hr />
<h2 id="heading-a-common-real-world-failure-scenario">A Common Real World Failure Scenario</h2>
<p>Imagine an API service running three replicas on a small cluster. Each pod has low requests and tight memory limits.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">resources:</span>
  <span class="hljs-attr">requests:</span>
    <span class="hljs-attr">cpu:</span> <span class="hljs-string">"100m"</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"128Mi"</span>
  <span class="hljs-attr">limits:</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"256Mi"</span>
</code></pre>
<p>During normal usage, memory consumption stays below 150 MiB. Everything looks fine.</p>
<p>Now a traffic spike occurs. More requests mean more objects in memory, larger request payloads, and more concurrent goroutines or threads. Memory usage climbs past 256 MiB. Kubernetes kills the container. The pod restarts and immediately receives traffic again. The cycle repeats.</p>
<p>From the outside, this looks like instability or a Kubernetes bug. In reality, the limits were never aligned with real application behavior.</p>
<hr />
<h2 id="heading-why-pod-evictions-appear-during-traffic-spikes">Why Pod Evictions Appear During Traffic Spikes</h2>
<p>Even if limits are not reached, nodes can still come under pressure. When total memory usage on a node approaches capacity, Kubernetes starts evicting pods.</p>
<p>Pods with lower priority and lower memory requests are evicted first. If requests are unrealistically small, critical workloads may be removed before less important ones.</p>
<p>This is why teams sometimes see monitoring agents or tracing systems disappear under load, even though the main application survives.</p>
<hr />
<h2 id="heading-the-hidden-cost-of-copy-pasted-values">The Hidden Cost of Copy Pasted Values</h2>
<p>Many Helm charts ship with conservative defaults. These values are designed to work everywhere, not to reflect your workload.</p>
<p>Copying these defaults into production without measurement leads to two common problems:</p>
<ul>
<li><p>Requests are too low, causing over scheduling and node pressure</p>
</li>
<li><p>Limits are too tight, causing restarts under load</p>
</li>
</ul>
<p>The cluster appears cost efficient but becomes fragile.</p>
<hr />
<h2 id="heading-observing-the-problem-properly">Observing the Problem Properly</h2>
<p>Kubernetes provides enough signals to understand what is happening if they are used correctly.</p>
<p><code>kubectl top pod</code> shows real CPU and memory usage over time.<br /><code>kubectl describe pod</code> reveals throttling, OOMKills, and eviction reasons.<br />Node metrics show whether failures are isolated or systemic.</p>
<p>The key is to observe during peak traffic, not during quiet periods.</p>
<hr />
<h2 id="heading-setting-requests-based-on-reality">Setting Requests Based on Reality</h2>
<p>A practical approach is to observe average usage under moderate load and add a safety margin.</p>
<p>If a service consistently uses 300 MiB of memory during busy periods, setting a request around 350 MiB and a limit around 600 MiB gives Kubernetes room to operate.</p>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">resources:</span>
  <span class="hljs-attr">requests:</span>
    <span class="hljs-attr">cpu:</span> <span class="hljs-string">"300m"</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"350Mi"</span>
  <span class="hljs-attr">limits:</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"600Mi"</span>
</code></pre>
<p>This configuration allows bursts without immediate termination and helps the scheduler place pods intelligently.</p>
<hr />
<h2 id="heading-why-autoscaling-alone-does-not-save-you">Why Autoscaling Alone Does Not Save You</h2>
<p>Horizontal Pod Autoscalers react to metrics. They do not prevent individual pods from being killed. If pods are crashing due to memory limits, scaling replicas only increases the number of failing pods.</p>
<p>Autoscaling works best when each pod is stable under load and requests reflect real needs.</p>
<hr />
<h2 id="heading-a-stable-cluster-is-an-honest-cluster">A Stable Cluster Is an Honest Cluster</h2>
<p>Kubernetes does exactly what it is told. If resource definitions are optimistic, the cluster becomes optimistic too. Traffic spikes expose this optimism brutally.</p>
<p>Clusters that survive real world traffic are not over provisioned. They are well measured, honestly configured, and continuously observed.</p>
<p>The difference between a calm cluster and a resilient one is rarely the number of nodes. It is almost always the quality of resource requests and limits.</p>
]]></content:encoded></item><item><title><![CDATA[Understanding Pod Evictions: Why Kubernetes Removes Your Pods And How To Prevent It]]></title><description><![CDATA[Running applications on Kubernetes usually feels smooth and stable until one day a pod suddenly disappears. Kubernetes calls this process eviction, and it happens when the cluster decides that removing a pod is the safest way to protect the node or t...]]></description><link>https://blog.nyzex.in/understanding-pod-evictions-why-kubernetes-removes-your-pods-and-how-to-prevent-it</link><guid isPermaLink="true">https://blog.nyzex.in/understanding-pod-evictions-why-kubernetes-removes-your-pods-and-how-to-prevent-it</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[monitoring]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[k8s]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Wed, 10 Dec 2025 19:31:15 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765395054165/c02244a1-2bb2-4248-9f77-884e744d6aa2.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Running applications on Kubernetes usually feels smooth and stable until one day a pod suddenly disappears. Kubernetes calls this process <strong>eviction</strong>, and it happens when the cluster decides that removing a pod is the safest way to protect the node or the rest of the workload. Pod evictions can feel mysterious at first, especially when they occur during heavy traffic or during important workloads. Fortunately, Kubernetes eviction patterns are predictable once you understand what triggers them and how to prevent them.</p>
<p>This guide explains why pod evictions happen, how to identify the root cause, and the practical steps you can follow to keep your pods running reliably. Everything is supported with real Kubernetes commands, YAML examples, and explainers that make the entire topic easy to apply in your own cluster.</p>
<hr />
<h2 id="heading-what-exactly-is-a-pod-eviction"><strong>What Exactly Is A Pod Eviction</strong></h2>
<p>A pod eviction happens when the Kubernetes control plane removes a pod from a node because the node is under pressure. This is not the same as a pod crashing. Eviction is a deliberate decision taken by Kubernetes to protect node stability.</p>
<p>When an eviction happens, you will usually see events such as:</p>
<pre><code class="lang-yaml"><span class="hljs-string">The</span> <span class="hljs-string">node</span> <span class="hljs-string">had</span> <span class="hljs-string">disk</span> <span class="hljs-string">pressure</span>
<span class="hljs-string">The</span> <span class="hljs-string">node</span> <span class="hljs-string">had</span> <span class="hljs-string">memory</span> <span class="hljs-string">pressure</span>
<span class="hljs-string">The</span> <span class="hljs-string">node</span> <span class="hljs-string">had</span> <span class="hljs-string">PID</span> <span class="hljs-string">pressure</span>
<span class="hljs-string">The</span> <span class="hljs-string">node</span> <span class="hljs-string">was</span> <span class="hljs-string">unreachable</span>
<span class="hljs-string">Evicting</span> <span class="hljs-string">Pod</span> <span class="hljs-string">due</span> <span class="hljs-string">to</span> <span class="hljs-string">node</span> <span class="hljs-string">condition</span>
</code></pre>
<p>Pods that are part of a deployment will be recreated on another node, but single node clusters or clusters with insufficient resources often suffer downtime as a result.</p>
<hr />
<h2 id="heading-the-main-reasons-why-pods-get-evicted"><strong>The Main Reasons Why Pods Get Evicted</strong></h2>
<p>Although Kubernetes can report several types of pressure, these three are the most common causes.</p>
<h3 id="heading-memory-pressure"><strong>Memory Pressure</strong></h3>
<p>This is the most frequent reason for evictions. When a node runs out of memory, Kubernetes starts removing pods that exceed their memory requests or pods with lower priority.</p>
<p>Common signs:</p>
<ul>
<li><p>Eviction messages mentioning memory pressure</p>
</li>
<li><p>OOMKilled events</p>
</li>
<li><p>Node metrics showing high memory usage</p>
</li>
</ul>
<h3 id="heading-disk-pressure"><strong>Disk Pressure</strong></h3>
<p>This happens when the node runs low on either free disk space or free inodes. Logging, large ephemeral storage usage, container images, and runaway volumes often cause this.</p>
<p>Signs include:</p>
<ul>
<li><p>Events mentioning disk pressure</p>
</li>
<li><p>Eviction messages citing low disk availability</p>
</li>
</ul>
<h3 id="heading-pid-pressure"><strong>PID Pressure</strong></h3>
<p>This occurs when a node runs out of process identifiers. Too many running processes or certain badly designed sidecars can trigger this.</p>
<hr />
<h2 id="heading-how-to-inspect-pod-evictions-in-your-cluster"><strong>How To Inspect Pod Evictions In Your Cluster</strong></h2>
<h3 id="heading-checking-events"><strong>Checking Events</strong></h3>
<p>Events provide the first clue. Run:</p>
<pre><code class="lang-bash">kubectl get events --sort-by=.lastTimestamp
</code></pre>
<p>Evicted pods usually report messages like:</p>
<pre><code class="lang-bash">Evicted Pod         The node had memory pressure
Evicted Pod         The node had disk pressure
</code></pre>
<h3 id="heading-describing-the-pod"><strong>Describing The Pod</strong></h3>
<p>Even after eviction, the history remains:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">describe</span> <span class="hljs-string">pod</span> <span class="hljs-string">&lt;pod-name&gt;</span> <span class="hljs-string">-n</span> <span class="hljs-string">&lt;namespace&gt;</span>
</code></pre>
<p>Look for the “Status” and “Last State” sections. They usually contain a reason such as:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">Reason:</span> <span class="hljs-string">Evicted</span>
<span class="hljs-attr">Message:</span> <span class="hljs-string">The</span> <span class="hljs-string">node</span> <span class="hljs-string">had</span> <span class="hljs-string">memory</span> <span class="hljs-string">pressure</span>
</code></pre>
<h3 id="heading-checking-node-conditions"><strong>Checking Node Conditions</strong></h3>
<p>Nodes show their pressure state clearly:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">describe</span> <span class="hljs-string">node</span> <span class="hljs-string">&lt;node-name&gt;</span>
</code></pre>
<p>You might see conditions like:</p>
<pre><code class="lang-yaml"><span class="hljs-string">MemoryPressure</span>   <span class="hljs-literal">True</span>
<span class="hljs-string">DiskPressure</span>     <span class="hljs-literal">True</span>
<span class="hljs-string">PIDPressure</span>      <span class="hljs-literal">False</span>
</code></pre>
<hr />
<h2 id="heading-how-kubernetes-decides-which-pods-to-evict"><strong>How Kubernetes Decides Which Pods To Evict</strong></h2>
<p>Kubernetes follows a predictable order when choosing which pods to evict.</p>
<h3 id="heading-pod-priority"><strong>Pod Priority</strong></h3>
<p>Pods with higher priority are protected. Lower priority pods face eviction first.</p>
<p>Example of a priority class:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">scheduling.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">PriorityClass</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">critical-service</span>
<span class="hljs-attr">value:</span> <span class="hljs-number">100000</span>
<span class="hljs-attr">globalDefault:</span> <span class="hljs-literal">false</span>
<span class="hljs-attr">description:</span> <span class="hljs-string">"Critical system components"</span>
</code></pre>
<p>Use it in a pod or deployment:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">spec:</span>
  <span class="hljs-attr">priorityClassName:</span> <span class="hljs-string">critical-service</span>
</code></pre>
<h3 id="heading-resource-requests"><strong>Resource Requests</strong></h3>
<p>Pods that request low memory but actually use much more are frequent eviction targets. Kubernetes tries to preserve pods that are within their declared requests.</p>
<hr />
<h2 id="heading-how-to-prevent-pod-evictions"><strong>How To Prevent Pod Evictions</strong></h2>
<h3 id="heading-set-proper-resource-requests-and-limits"><strong>Set Proper Resource Requests And Limits</strong></h3>
<p>One of the strongest defenses against eviction is correct sizing. A pod that consumes far more memory than it requests will be evicted during memory pressure.</p>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">resources:</span>
  <span class="hljs-attr">requests:</span>
    <span class="hljs-attr">cpu:</span> <span class="hljs-string">"200m"</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"512Mi"</span>
  <span class="hljs-attr">limits:</span>
    <span class="hljs-attr">cpu:</span> <span class="hljs-string">"500m"</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"1Gi"</span>
</code></pre>
<p>Requests tell Kubernetes how much the pod needs to run reliably. If these numbers are too low, Kubernetes will treat the pod as a low priority candidate for eviction.</p>
<h3 id="heading-enable-limit-ranges-and-resource-quotas"><strong>Enable Limit Ranges And Resource Quotas</strong></h3>
<p>In shared namespaces, this prevents runaway pods from disrupting others:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">LimitRange</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">default-limits</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">limits:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">type:</span> <span class="hljs-string">Container</span>
    <span class="hljs-attr">defaultRequest:</span>
      <span class="hljs-attr">memory:</span> <span class="hljs-string">256Mi</span>
      <span class="hljs-attr">cpu:</span> <span class="hljs-string">100m</span>
</code></pre>
<h3 id="heading-use-pod-priority-classes"><strong>Use Pod Priority Classes</strong></h3>
<p>Critical workloads deserve higher protection, especially in clusters with limited nodes.</p>
<h3 id="heading-avoid-overcommitting-the-node"><strong>Avoid Overcommitting The Node</strong></h3>
<p>Overcommitting CPU is safe, but overcommitting memory is dangerous. Pods will be removed during memory pressure, even if your deployment expects them to stay alive.</p>
<h3 id="heading-manage-disk-usage-properly"><strong>Manage Disk Usage Properly</strong></h3>
<p>Disk pressure is often caused by:</p>
<ul>
<li><p>Large logging output</p>
</li>
<li><p>Growing emptyDir volumes</p>
</li>
<li><p>Too many container images on the node</p>
</li>
</ul>
<p>Use log rotation or a logging agent with proper configuration.</p>
<h3 id="heading-use-node-autoscaling-when-possible"><strong>Use Node Autoscaling When Possible</strong></h3>
<p>In managed Kubernetes like AKS or EKS, enabling cluster autoscaling significantly reduces unwanted evictions because new nodes appear when pressure increases.</p>
<hr />
<h2 id="heading-a-real-example-diagnosing-and-fixing-a-memory-eviction"><strong>A Real Example: Diagnosing And Fixing A Memory Eviction</strong></h2>
<p>Imagine a cluster using a single Standard D3 class VM. Under heavy traffic, one of your application pods disappears. The event log shows:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">Evicted:</span> <span class="hljs-string">The</span> <span class="hljs-string">node</span> <span class="hljs-string">had</span> <span class="hljs-string">memory</span> <span class="hljs-string">pressure</span>
</code></pre>
<p>Next steps:</p>
<h3 id="heading-step-1-check-node-memory"><strong>Step 1: Check Node Memory</strong></h3>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">describe</span> <span class="hljs-string">node</span> <span class="hljs-string">&lt;node-name&gt;</span>
</code></pre>
<p>Output:</p>
<pre><code class="lang-yaml"><span class="hljs-string">MemoryPressure</span>   <span class="hljs-literal">True</span>
</code></pre>
<h3 id="heading-step-2-compare-consumption-with-requests"><strong>Step 2: Compare Consumption With Requests</strong></h3>
<p>You notice the pod requests only 128Mi but uses over 800Mi during traffic spikes. The node cannot allocate enough memory, so Kubernetes removes the pod.</p>
<h3 id="heading-step-3-fix-the-deployment"><strong>Step 3: Fix The Deployment</strong></h3>
<p>Update the deployment with realistic memory requests.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">resources:</span>
  <span class="hljs-attr">requests:</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"512Mi"</span>
  <span class="hljs-attr">limits:</span>
    <span class="hljs-attr">memory:</span> <span class="hljs-string">"1Gi"</span>
</code></pre>
<h3 id="heading-step-4-add-a-priorityclass-if-necessary"><strong>Step 4: Add A PriorityClass If Necessary</strong></h3>
<p>Critical services can be protected with a higher priority.</p>
<h3 id="heading-step-5-monitor"><strong>Step 5: Monitor</strong></h3>
<p>Use Prometheus, Grafana, or Kubelet metrics to observe memory growth and make adjustments.</p>
<p>if you want to see why monitoring (and a minimal kubernetes loki setup) is very important, read my previous blog:<br /><a target="_blank" href="https://blog.nyzex.in/why-observability-is-the-unsung-hero-in-modern-cloud-applications">https://blog.nyzex.in/why-observability-is-the-unsung-hero-in-modern-cloud-applications</a></p>
<p>After this fix, the pod no longer gets evicted during traffic spikes.</p>
<hr />
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Pod evictions are not random events. They are Kubernetes actively protecting your cluster from resource exhaustion. Once you understand the signals that trigger these evictions and how Kubernetes chooses which pods to remove, you can build workloads that are far more resilient.</p>
<p>By setting correct resource requests, monitoring node pressure, controlling logging and ephemeral storage, tuning pod priorities, and sizing nodes properly, you ensure that your critical applications continue running without interruption.</p>
]]></content:encoded></item><item><title><![CDATA[Why Observability is the Unsung Hero in Modern Cloud Applications]]></title><description><![CDATA[Modern cloud applications are incredibly powerful, but with great power comes great complexity. Applications today often consist of multiple microservices, databases, and third-party integrations running across distributed environments. This makes it...]]></description><link>https://blog.nyzex.in/why-observability-is-the-unsung-hero-in-modern-cloud-applications</link><guid isPermaLink="true">https://blog.nyzex.in/why-observability-is-the-unsung-hero-in-modern-cloud-applications</guid><category><![CDATA[monitoring]]></category><category><![CDATA[Devops]]></category><category><![CDATA[observability]]></category><category><![CDATA[Grafana]]></category><category><![CDATA[#prometheus]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Mon, 08 Dec 2025 19:27:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765222043109/ea59b623-1d3f-4e60-adaf-8edc71d3b4e4.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern cloud applications are incredibly powerful, but with great power comes great complexity. Applications today often consist of multiple microservices, databases, and third-party integrations running across distributed environments. This makes it challenging for developers and operations teams to understand what is happening inside the system, especially when something goes wrong. Traditional monitoring can only take you so far. This is where <strong>observability</strong> becomes essential.</p>
<p>Observability is the practice of designing systems in such a way that their internal state can be inferred from external outputs. It provides actionable insights that help engineers understand, diagnose, and optimize system performance in real time.</p>
<hr />
<h2 id="heading-understanding-observability">Understanding Observability</h2>
<p>Before diving into tools, it is important to distinguish <strong>monitoring</strong>, <strong>logging</strong>, and <strong>observability</strong>:</p>
<ul>
<li><p><strong>Monitoring</strong> collects predefined metrics such as CPU usage, memory usage, or request latency. Alerts notify teams when thresholds are crossed.</p>
</li>
<li><p><strong>Logging</strong> captures events and errors, providing detailed records of system behavior for debugging.</p>
</li>
<li><p><strong>Observability</strong> goes further. It combines metrics, logs, and traces to give engineers a complete understanding of system behavior. It allows you to ask “why” something happened, not just “what” happened.</p>
</li>
</ul>
<p>Observability enables answers to questions like:</p>
<ul>
<li><p>Why did a specific request take unusually long to process?</p>
</li>
<li><p>Which microservice caused a failure in a transaction?</p>
</li>
<li><p>How does a change in one component affect others across the system?</p>
</li>
</ul>
<hr />
<h2 id="heading-real-world-example-an-e-commerce-platform">Real-World Example: An E-Commerce Platform</h2>
<p>Consider an e-commerce platform built using microservices for inventory, payment, and shipping. During a holiday sale, the checkout process slows down dramatically. Without observability, the engineering team might only see high CPU usage in one service but not understand the root cause.</p>
<p>By implementing observability, the team can:</p>
<ul>
<li><p><strong>Trace requests</strong> from the frontend through the payment, inventory, and shipping services.</p>
</li>
<li><p><strong>Analyze logs</strong> to identify error spikes in payment validation.</p>
</li>
<li><p><strong>Inspect metrics</strong> to find latency bottlenecks in database queries.</p>
</li>
</ul>
<p>In this example, observability allows the team to pinpoint the root cause: a slow payment gateway integration rather than blindly optimizing unrelated services.</p>
<hr />
<h2 id="heading-observability-tools-and-how-to-use-them">Observability Tools and How to Use Them</h2>
<p>Here are some widely used tools for observability, along with installation and basic usage examples.</p>
<h3 id="heading-1-jaeger-distributed-tracing">1. Jaeger (Distributed Tracing)</h3>
<p>Jaeger helps track requests as they move through microservices.</p>
<p><strong>Installation (local setup with Docker):</strong></p>
<pre><code class="lang-python">docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HTTP_PORT=<span class="hljs-number">9411</span> \
  -p <span class="hljs-number">5775</span>:<span class="hljs-number">5775</span>/udp \
  -p <span class="hljs-number">6831</span>:<span class="hljs-number">6831</span>/udp \
  -p <span class="hljs-number">6832</span>:<span class="hljs-number">6832</span>/udp \
  -p <span class="hljs-number">5778</span>:<span class="hljs-number">5778</span> \
  -p <span class="hljs-number">16686</span>:<span class="hljs-number">16686</span> \
  -p <span class="hljs-number">14268</span>:<span class="hljs-number">14268</span> \
  -p <span class="hljs-number">14250</span>:<span class="hljs-number">14250</span> \
  -p <span class="hljs-number">9411</span>:<span class="hljs-number">9411</span> \
  jaegertracing/all-<span class="hljs-keyword">in</span>-one:<span class="hljs-number">1.41</span>
</code></pre>
<p><strong>Usage:</strong></p>
<ul>
<li><p>Access the Jaeger UI at <a target="_blank" href="http://localhost:16686"><code>http://localhost:16686</code></a>.</p>
</li>
<li><p>Send traces from your application using OpenTelemetry or Jaeger client libraries.</p>
</li>
<li><p>Explore request paths to identify latency and bottlenecks.</p>
</li>
</ul>
<hr />
<h3 id="heading-2-prometheus-metrics-collection">2. Prometheus (Metrics Collection)</h3>
<p>Prometheus collects and stores metrics from your application.</p>
<p><strong>Installation (local setup with Docker):</strong></p>
<pre><code class="lang-python">docker run -d --name prometheus \
  -p <span class="hljs-number">9090</span>:<span class="hljs-number">9090</span> \
  -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus
</code></pre>
<p><strong>Example prometheus.yml:</strong></p>
<pre><code class="lang-python"><span class="hljs-keyword">global</span>:
  scrape_interval: <span class="hljs-number">15</span>s

scrape_configs:
  - job_name: <span class="hljs-string">'my-app'</span>
    static_configs:
      - targets: [<span class="hljs-string">'host.docker.internal:8000'</span>]
</code></pre>
<p><strong>Usage:</strong></p>
<ul>
<li><p>Access Prometheus at <a target="_blank" href="http://localhost:9090"><code>http://localhost:9090</code></a>.</p>
</li>
<li><p>Query metrics such as request duration or error rate.</p>
</li>
<li><p>Combine with Grafana to create dashboards.</p>
</li>
</ul>
<hr />
<h3 id="heading-3-grafana-visualization">3. Grafana (Visualization)</h3>
<p>Grafana provides rich dashboards for visualizing metrics and logs.</p>
<p><strong>Installation (local setup with Docker):</strong></p>
<pre><code class="lang-python">docker run -d -p <span class="hljs-number">3000</span>:<span class="hljs-number">3000</span> --name=grafana grafana/grafana
</code></pre>
<p><strong>Usage:</strong></p>
<ul>
<li><p>Access Grafana at <a target="_blank" href="http://localhost:3000"><code>http://localhost:3000</code></a>.</p>
</li>
<li><p>Connect Prometheus as a data source.</p>
</li>
<li><p>Build dashboards to monitor service performance and visualize latency, error rates, and request throughput.</p>
</li>
</ul>
<p>When working with kubernetes, I tend to use loki-stack helm chart as it comes with everything built-in!  </p>
<pre><code class="lang-yaml">
<span class="hljs-attr">loki:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">isDefault:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">url:</span> <span class="hljs-string">http://{{(include</span> <span class="hljs-string">"loki.serviceName"</span> <span class="hljs-string">.)}}:{{</span> <span class="hljs-string">.Values.loki.service.port</span> <span class="hljs-string">}}</span>
  <span class="hljs-attr">readinessProbe:</span>
    <span class="hljs-attr">httpGet:</span>
      <span class="hljs-attr">path:</span> <span class="hljs-string">/ready</span>
      <span class="hljs-attr">port:</span> <span class="hljs-string">http-metrics</span>
    <span class="hljs-attr">initialDelaySeconds:</span> <span class="hljs-number">45</span>
  <span class="hljs-attr">livenessProbe:</span>
    <span class="hljs-attr">httpGet:</span>
      <span class="hljs-attr">path:</span> <span class="hljs-string">/ready</span>
      <span class="hljs-attr">port:</span> <span class="hljs-string">http-metrics</span>
    <span class="hljs-attr">initialDelaySeconds:</span> <span class="hljs-number">45</span>
  <span class="hljs-attr">datasource:</span>
    <span class="hljs-attr">jsonData:</span> <span class="hljs-string">"{}"</span>
    <span class="hljs-attr">uid:</span> <span class="hljs-string">""</span>


<span class="hljs-attr">promtail:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">config:</span>
    <span class="hljs-attr">logLevel:</span> <span class="hljs-string">info</span>
    <span class="hljs-attr">serverPort:</span> <span class="hljs-number">3101</span>
    <span class="hljs-attr">clients:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">url:</span> <span class="hljs-string">http://{{</span> <span class="hljs-string">.Release.Name</span> <span class="hljs-string">}}:3100/loki/api/v1/push</span>


<span class="hljs-attr">grafana:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">adminUser:</span> <span class="hljs-string">"admin"</span>
  <span class="hljs-attr">adminPassword:</span> <span class="hljs-string">"devops123"</span>
  <span class="hljs-attr">image:</span>
    <span class="hljs-comment">#tag: 10.3.3</span>
    <span class="hljs-attr">tag:</span> <span class="hljs-number">11.4</span><span class="hljs-number">.0</span>
  <span class="hljs-attr">datasources:</span>
    <span class="hljs-attr">datasources.yaml:</span>
      <span class="hljs-attr">apiVersion:</span> <span class="hljs-number">1</span>
      <span class="hljs-attr">datasources:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Prometheus</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">prometheus</span>
          <span class="hljs-attr">access:</span> <span class="hljs-string">proxy</span>
          <span class="hljs-attr">url:</span> <span class="hljs-string">http://{{</span> <span class="hljs-string">include</span> <span class="hljs-string">"prometheus.fullname"</span> <span class="hljs-string">.</span> <span class="hljs-string">}}:9090</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Loki</span>
          <span class="hljs-attr">type:</span> <span class="hljs-string">loki</span>
          <span class="hljs-attr">access:</span> <span class="hljs-string">proxy</span>
          <span class="hljs-attr">url:</span> <span class="hljs-string">http://{{</span> <span class="hljs-string">include</span> <span class="hljs-string">"loki.fullname"</span> <span class="hljs-string">.</span> <span class="hljs-string">}}:3100</span>



<span class="hljs-attr">prometheus:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">isDefault:</span> <span class="hljs-literal">false</span>
  <span class="hljs-attr">server:</span>
    <span class="hljs-attr">service:</span>
      <span class="hljs-attr">servicePort:</span> <span class="hljs-number">9090</span>
  <span class="hljs-attr">url:</span> <span class="hljs-string">http://{{</span> <span class="hljs-string">include</span> <span class="hljs-string">"prometheus.fullname"</span> <span class="hljs-string">.}}:{{</span> <span class="hljs-string">.Values.prometheus.server.service.servicePort</span> <span class="hljs-string">}}{{</span> <span class="hljs-string">.Values.prometheus.server.prefixURL</span> <span class="hljs-string">}}</span>
  <span class="hljs-attr">datasource:</span>
    <span class="hljs-attr">jsonData:</span> <span class="hljs-string">"{}"</span>


<span class="hljs-attr">filebeat:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">false</span>


<span class="hljs-attr">logstash:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">false</span>


<span class="hljs-attr">fluent-bit:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">false</span>
</code></pre>
<p>I just simply apply it with:  </p>
<pre><code class="lang-bash">helm repo add grafana https://grafana.github.io/helm-charts  
helm repo update

helm upgrade --install loki-stack grafana/loki-stack -n loki -f loki_stack_values.yaml
</code></pre>
<p>To check:</p>
<pre><code class="lang-bash">nyzex@nyzex-systems % kubectl get pods -n loki                                                      
NAME                                                 READY   STATUS    RESTARTS   AGE
loki-stack-0                                         1/1     Running   0          20m
loki-stack-alertmanager-0                            1/1     Running   0          20m
loki-stack-grafana-7d4fdcd58c-cs8fk                  2/2     Running   0          20m
loki-stack-kube-state-metrics-fb7f548d6-jg2cq        1/1     Running   0          20m
loki-stack-prometheus-node-exporter-cg57k            1/1     Running   0          20m
loki-stack-prometheus-pushgateway-5649b6944b-9k9fj   1/1     Running   0          20m
loki-stack-prometheus-server-5c8c8f584d-6chxx        2/2     Running   0          20m
loki-stack-promtail-blwfh                            1/1     Running   0          20m

nyzex@nyzex-systems % kubectl get svc -n loki 
NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
loki-stack                            ClusterIP   10.43.174.77    &lt;none&gt;        3100/TCP   21m
loki-stack-alertmanager               ClusterIP   10.43.32.102    &lt;none&gt;        9093/TCP   21m
loki-stack-alertmanager-headless      ClusterIP   None            &lt;none&gt;        9093/TCP   21m
loki-stack-grafana                    ClusterIP   10.43.16.157    &lt;none&gt;        80/TCP     21m
loki-stack-headless                   ClusterIP   None            &lt;none&gt;        3100/TCP   21m
loki-stack-kube-state-metrics         ClusterIP   10.43.119.248   &lt;none&gt;        8080/TCP   21m
loki-stack-memberlist                 ClusterIP   None            &lt;none&gt;        7946/TCP   21m
loki-stack-prometheus-node-exporter   ClusterIP   10.43.20.145    &lt;none&gt;        9100/TCP   21m
loki-stack-prometheus-pushgateway     ClusterIP   10.43.113.177   &lt;none&gt;        9091/TCP   21m
loki-stack-prometheus-server          ClusterIP   10.43.201.93    &lt;none&gt;        9090/TCP   21m
</code></pre>
<p>Then we can just add ingress to the grafana and we are good to go!</p>
<p>This has been depreciated as of December 2025</p>
<p><a target="_blank" href="https://artifacthub.io/packages/helm/grafana/loki-stack">https://artifacthub.io/packages/helm/grafana/loki-stack</a></p>
<hr />
<h3 id="heading-putting-it-all-together">Putting It All Together</h3>
<p>By combining these tools:</p>
<ul>
<li><p>Prometheus provides metrics and system health data.</p>
</li>
<li><p>Jaeger provides request traces across microservices.</p>
</li>
<li><p>Grafana visualizes both metrics and traces.</p>
</li>
</ul>
<p>For example, a slow API call can be traced using Jaeger, the request load can be seen in Prometheus, and a Grafana dashboard can provide a real-time view of system performance. This makes identifying and resolving issues faster and more reliable.</p>
<hr />
<h2 id="heading-practical-tips-for-observability">Practical Tips for Observability</h2>
<ol>
<li><p><strong>Start Small</strong>: Focus on critical services and gradually expand coverage.</p>
</li>
<li><p><strong>Instrument Key Components</strong>: Ensure metrics, logs, and traces are collected for all important paths.</p>
</li>
<li><p><strong>Automate Alerts</strong>: Configure alerts in Prometheus or Grafana to proactively detect anomalies.</p>
</li>
<li><p><strong>Review and Iterate</strong>: Observability is an ongoing process. Learn from incidents and refine instrumentation.</p>
</li>
</ol>
<hr />
<h2 id="heading-conclusion">Conclusion</h2>
<p>Observability is no longer optional for modern cloud applications. While monitoring and logging provide snapshots of system behavior, observability delivers a deep understanding of how systems operate and interact. By implementing tools like Jaeger, Prometheus, and Grafana, teams can diagnose problems faster, optimize performance, and ensure a reliable user experience. Observability is the unsung hero that allows engineers to manage complexity effectively, transforming chaos into clarity.</p>
]]></content:encoded></item><item><title><![CDATA[How I Automated GitHub Repository Deployments Into K3s Using SSH and Helm]]></title><description><![CDATA[Building a smooth deployment pipeline for Kubernetes is usually a challenge for small teams and individual developers. Many solutions require complex CI pipelines, container registries, secret management and multiple integration steps. My goal was to...]]></description><link>https://blog.nyzex.in/how-i-automated-github-repository-deployments-into-k3s-using-ssh-and-helm</link><guid isPermaLink="true">https://blog.nyzex.in/how-i-automated-github-repository-deployments-into-k3s-using-ssh-and-helm</guid><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[k3s]]></category><category><![CDATA[automation]]></category><category><![CDATA[GitHub]]></category><category><![CDATA[Docker]]></category><category><![CDATA[ci-cd]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Fri, 05 Dec 2025 19:02:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764961329799/f40cf393-210b-4b57-b9d1-d488c8ffd995.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Building a smooth deployment pipeline for Kubernetes is usually a challenge for small teams and individual developers. Many solutions require complex CI pipelines, container registries, secret management and multiple integration steps. My goal was to create something simpler. I wanted a system that could take any GitHub repository containing a Dockerfile, build the image on a remote server, and deploy it directly into my K3s cluster.</p>
<p>This article explains the complete architecture and the final workflow that made this possible. It uses SSH for secure communication and Helm for templated Kubernetes deployments. Everything happens automatically once a repository is selected.</p>
<p>I have kept this blog short, as it was a large Proof of Concept (PoC), we will revisit this in detailed manner, some other day :)</p>
<hr />
<h2 id="heading-the-overall-objective"><strong>The Overall Objective</strong></h2>
<p>The intention was to create a simple but powerful “Select a GitHub repository and deploy it to K3s” experience. Achieving this required solving the following problems:</p>
<ul>
<li><p>Authenticating users through GitHub OAuth</p>
</li>
<li><p>Cloning the selected repository on the remote server</p>
</li>
<li><p>Building a Docker image using the repository’s Dockerfile</p>
</li>
<li><p>Extracting the application port from the Dockerfile</p>
</li>
<li><p>Pushing the image to a container registry when required</p>
</li>
<li><p>Generating Kubernetes manifests dynamically</p>
</li>
<li><p>Applying them to a K3s cluster using remote kubectl</p>
</li>
<li><p>Keeping the workflow secure without exposing SSH passwords</p>
</li>
</ul>
<p>With these requirements in mind, I built the system around a Streamlit based interface, a backend with SSH utilities, and a Helm driven deployment mechanism.</p>
<hr />
<h2 id="heading-authentication-through-github-oauth"><strong>Authentication Through GitHub OAuth</strong></h2>
<p>The first step was allowing users to log in with GitHub and access their repositories. During OAuth callback, I stored the GitHub username and token in a database so that future interactions could occur without requesting credentials again.</p>
<p>Once authenticated, the application fetched all repositories that the user had access to. The user could then pick the repository that needed deployment. This made the process extremely convenient because there was no need for manual cloning or token entry.</p>
<hr />
<h2 id="heading-cloning-the-repository-over-ssh"><strong>Cloning the Repository Over SSH</strong></h2>
<p>After a repository was selected, the system connected to the remote host where K3s was installed. The connection used a PEM key, not a password. This removed the need to expose secrets and made the communication secure.</p>
<p>The repository was cloned into a temporary directory on the remote server:</p>
<pre><code class="lang-javascript">git clone https:<span class="hljs-comment">//github.com/&lt;user&gt;/&lt;repo&gt;.git /tmp/deploy/&lt;repo-name&gt;</span>
</code></pre>
<p>The entire build and deployment workflow was executed inside this directory.</p>
<hr />
<h2 id="heading-building-the-docker-image-remotely"><strong>Building the Docker Image Remotely</strong></h2>
<p>Instead of building the image locally and pushing it to a registry, I chose to build the image directly on the remote server. This avoided large uploads and made the pipeline significantly faster.</p>
<pre><code class="lang-javascript">sudo docker build -t &lt;repo-name&gt;:latest .
</code></pre>
<p>Since I used K3s with containerd by default, I installed Docker separately and configured the system so that it could load images into K3s. This was done using the Docker to containerd import step whenever necessary.</p>
<p>If AWS ECR integration was enabled, the image was tagged and pushed to the registry. That part was optional and only used in certain deployments.</p>
<hr />
<h2 id="heading-reading-the-application-port-from-the-dockerfile"><strong>Reading the Application Port From the Dockerfile</strong></h2>
<p>Many applications expose ports through the Dockerfile. To make deployments dynamic, I extracted the port by scanning for the EXPOSE instruction:</p>
<pre><code class="lang-javascript">EXPOSE <span class="hljs-number">8080</span>
</code></pre>
<p>If this instruction was present, it became the container port in the Kubernetes Deployment manifest. If it was missing, the system used a default value that could be configured.</p>
<p>This simple extraction made deployments far more flexible. It removed the need for manual adjustments each time a repository changed its application port.</p>
<hr />
<h2 id="heading-generating-kubernetes-manifests-automatically"><strong>Generating Kubernetes Manifests Automatically</strong></h2>
<p>Each repository received its own namespace. Namespaces were created if they did not already exist:</p>
<pre><code class="lang-javascript">kubectl create namespace &lt;repo-name&gt;
</code></pre>
<p>I used Helm to generate the Deployment, Service and optional Ingress files. A lightweight chart template was created and values were injected programmatically. The values included:</p>
<ul>
<li><p>image name</p>
</li>
<li><p>container port</p>
</li>
<li><p>replica count</p>
</li>
<li><p>environment variables</p>
</li>
<li><p>namespace</p>
</li>
</ul>
<p>The resulting Helm command looked like this:</p>
<pre><code class="lang-javascript">helm upgrade --install &lt;repo-name&gt; ./chart \
  --namespace &lt;repo-name&gt; \
  --set image.repository=&lt;image&gt; \
  --set image.tag=latest \
  --set containerPort=&lt;port&gt;
</code></pre>
<p>Helm provided a clean way to handle templating and versioning without writing YAML repeatedly.</p>
<hr />
<h2 id="heading-applying-the-manifests-to-k3s"><strong>Applying the Manifests to K3s</strong></h2>
<p>Once Helm processed the templates, the K3s cluster applied everything instantly. The deployment became active with one replica. The Service exposed the container port. If Ingress was configured, the application became public with a stable URL.</p>
<p>All of this ran within the remote environment using SSH commands executed from my Python application.</p>
<hr />
<h2 id="heading-the-final-automated-workflow"><strong>The Final Automated Workflow</strong></h2>
<p>After completing the entire pipeline, this became the final experience for the user:</p>
<ol>
<li><p>Log in with GitHub.</p>
</li>
<li><p>Select a repository from the list.</p>
</li>
<li><p>Click “Deploy”.</p>
</li>
</ol>
<p>Behind the scenes, the system handled:</p>
<ul>
<li><p>cloning</p>
</li>
<li><p>image building</p>
</li>
<li><p>port extraction</p>
</li>
<li><p>manifest creation</p>
</li>
<li><p>Helm deployment</p>
</li>
</ul>
<p>The user did not need to interact with Docker commands, kubectl or YAML files. Everything was automatic.</p>
<hr />
<h2 id="heading-challenges-and-solutions"><strong>Challenges and Solutions</strong></h2>
<p><strong>1. Large repositories and slow builds</strong><br />To solve this, I ensured that the remote server had cached layers whenever possible by reusing previous build directories.</p>
<p><strong>2. Managing SSH timeouts</strong><br />Increasing the SSH keep alive configuration and using a resilient execution wrapper helped prevent failures during long deployments.</p>
<p><strong>3. Ensuring reproducible deployments</strong><br />Helm values files were logged to the database so that each deployment had a traceable configuration history.</p>
<p><strong>4. Handling errors gracefully</strong><br />Whenever Docker or kubectl produced errors, they were captured and displayed in the Streamlit UI so that the user could fix the repository or Dockerfile.</p>
<hr />
<h2 id="heading-benefits-of-this-approach"><strong>Benefits of This Approach</strong></h2>
<p>This approach is not designed to replace full CI systems. However, it solves a specific problem extremely well. It provides a fast method to deploy small to medium applications into K3s without the overhead of pipelines, registries or private runners.</p>
<p>Some benefits include:</p>
<ul>
<li><p>Very quick setup for new applications</p>
</li>
<li><p>No complex infrastructure required</p>
</li>
<li><p>Secure SSH communication</p>
</li>
<li><p>Automatic image handling</p>
</li>
<li><p>Helm based templating</p>
</li>
<li><p>Namespaced isolation for each deployment</p>
</li>
</ul>
<p>It is ideal for personal projects, prototypes, microservices or self-hosted internal tools.</p>
<hr />
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Automating GitHub repository deployments into a K3s cluster became an enjoyable project. By combining SSH, Docker, Kubernetes and Helm into a single workflow, I created a flexible and dynamic deployment system. It saves time, reduces manual work and makes it possible to deploy new applications with a simple click.</p>
]]></content:encoded></item><item><title><![CDATA[Mastering Port Forwarding as a Service: Running Kubernetes Port Forwards with systemd]]></title><description><![CDATA[Kubernetes port forwarding is extremely useful for quick access to internal services. It allows local tools to reach cluster applications without exposing them through Ingress or LoadBalancer services. The problem is that port forwarding breaks easil...]]></description><link>https://blog.nyzex.in/mastering-port-forwarding-as-a-service-running-kubernetes-port-forwards-with-systemd</link><guid isPermaLink="true">https://blog.nyzex.in/mastering-port-forwarding-as-a-service-running-kubernetes-port-forwards-with-systemd</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[PostgreSQL]]></category><category><![CDATA[#kubernetes #container ]]></category><category><![CDATA[portfowarding]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Tue, 02 Dec 2025 19:56:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764705314146/4a74fbbc-59e2-4ea8-b868-4610254d886f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Kubernetes port forwarding is extremely useful for quick access to internal services. It allows local tools to reach cluster applications without exposing them through Ingress or LoadBalancer services. The problem is that port forwarding breaks easily. It stops when the terminal closes, when the network changes, or when the forwarding process restarts. Many engineers use it only during development because of these limitations.</p>
<p>The good news is that port forwarding can be turned into a persistent background service that runs automatically on boot and restarts on failure. This can be achieved by running <code>kubectl port-forward</code> under systemd. Once configured, the forwarding behaves like any other system-level service.</p>
<h2 id="heading-understanding-the-port-forward-setup-for-postgres">Understanding the Port-Forward Setup for Postgres</h2>
<p>The goal was to expose a PostgreSQL instance running <strong>inside your Kubernetes cluster</strong> so that it could be accessed securely from <strong>another machine</strong>. Instead of exposing the database publicly or creating a LoadBalancer, you used a <strong>port-forward</strong> running as a persistent systemd service. This behaves very much like an <strong>SSH tunnel</strong>, but managed automatically by Kubernetes.</p>
<p>Accessing a PostgreSQL database that lives inside a Kubernetes cluster often starts with a quick command:</p>
<pre><code class="lang-python">kubectl port-forward pod/postgres <span class="hljs-number">15432</span>:<span class="hljs-number">5432</span> -n postgresql
</code></pre>
<p>It works.<br />It is simple.<br />And it also <strong>breaks</strong> the moment:</p>
<ul>
<li><p>your SSH session closes</p>
</li>
<li><p>the terminal dies</p>
</li>
<li><p>the pod restarts</p>
</li>
<li><p>network hiccups occur</p>
</li>
</ul>
<p>For development or for connecting external applications, this becomes frustrating. You want PostgreSQL running inside the cluster to feel like it is running locally. Always reachable. No interruptions.</p>
<p>This is exactly where a <strong>persistent port-forwarding setup</strong> becomes extremely useful.</p>
<p>In this blog, I explain how I automated PostgreSQL port-forwarding using:</p>
<ul>
<li><p>a simple shell script</p>
</li>
<li><p>a systemd service</p>
</li>
<li><p>dynamic pod detection</p>
</li>
<li><p>persistent reconnection</p>
</li>
</ul>
<p>This ensures that even if the pod restarts or the port-forward crashes, the system automatically brings it back up.</p>
<hr />
<h2 id="heading-why-manual-port-forwarding-fails-over-time"><strong>Why Manual Port Forwarding Fails Over Time</strong></h2>
<p>Port-forwarding is not designed for long-term, production-grade networking. It is a debugging convenience.</p>
<p>When you run:</p>
<pre><code class="lang-python">kubectl port-forward pod/postgres <span class="hljs-number">15432</span>:<span class="hljs-number">5432</span>
</code></pre>
<p>You are telling Kubernetes:</p>
<ul>
<li><p>Open <strong>15432</strong> on your <strong>local machine</strong></p>
</li>
<li><p>Forward all traffic from this local port</p>
</li>
<li><p>To <strong>port 5432</strong> inside the <strong>postgres pod</strong></p>
</li>
</ul>
<p>The moment the terminal stops, or the pod is replaced by a new one during a deployment or restart, the connection is lost.</p>
<p>This leads to:</p>
<ul>
<li><p>connection refused</p>
</li>
<li><p>ECONNRESET</p>
</li>
<li><p>your app cannot connect to the database</p>
</li>
<li><p>your scripts or migrations failing mid-run</p>
</li>
</ul>
<p>The ideal fix is using a LoadBalancer, NodePort, an internal service mesh, or VPN into the cluster.<br />But when that is not possible (for example in a locked-down internal environment), <strong>persistent port-forwarding is a surprisingly useful workaround</strong>.</p>
<h2 id="heading-solution-overview"><strong>Solution Overview</strong></h2>
<p>We will set up:</p>
<ol>
<li><p><strong>A shell script</strong> that:</p>
<ul>
<li><p>Continuously finds the current PostgreSQL pod</p>
</li>
<li><p>Establishes the port-forward</p>
</li>
<li><p>Automatically retries if the pod changes or the forward crashes</p>
</li>
</ul>
</li>
<li><p><strong>A systemd service</strong> that:</p>
<ul>
<li><p>Runs this script in the background</p>
</li>
<li><p>Starts automatically on boot</p>
</li>
<li><p>Restarts on failure</p>
</li>
</ul>
</li>
</ol>
<p>This results in a self-healing port-forward that always stays alive.</p>
<hr />
<h2 id="heading-step-1-the-final-working-script"><strong>Step 1: The Final Working Script</strong></h2>
<p>Save it as:</p>
<pre><code class="lang-python">/home/ubuntu/pg-portforward.sh
</code></pre>
<pre><code class="lang-python"><span class="hljs-comment">#!/bin/bash</span>
set -e

export KUBECONFIG=<span class="hljs-string">"${HOME}/.k0s/kubeconfig"</span>

NAMESPACE=<span class="hljs-string">"postgresql"</span>
LOCAL_PORT=<span class="hljs-number">15432</span>
REMOTE_PORT=<span class="hljs-number">5432</span>

<span class="hljs-keyword">while</span> true; do
  POD=$(kubectl get pod -n $NAMESPACE -l app=postgres \
        -o jsonpath=<span class="hljs-string">'{.items[0].metadata.name}'</span> <span class="hljs-number">2</span>&gt;/dev/null)

  <span class="hljs-keyword">if</span> [ -z <span class="hljs-string">"$POD"</span> ]; then
    echo <span class="hljs-string">"[$(date)] Postgres pod not found. Retrying in 5s..."</span>
    sleep <span class="hljs-number">5</span>
    <span class="hljs-keyword">continue</span>
  fi

  echo <span class="hljs-string">"[$(date)] Forwarding to pod: $POD"</span>
  kubectl -n $NAMESPACE port-forward pod/$POD ${LOCAL_PORT}:${REMOTE_PORT}

  echo <span class="hljs-string">"[$(date)] Port-forward crashed. Restarting in 5s..."</span>
  sleep <span class="hljs-number">5</span>
done
</code></pre>
<p>Make it executable:</p>
<pre><code class="lang-python">chmod +x pg-portforward.sh
</code></pre>
<hr />
<h2 id="heading-step-2-the-systemd-unit-file"><strong>Step 2: The systemd Unit File</strong></h2>
<p>Create:</p>
<pre><code class="lang-python">/etc/systemd/system/pg-portforward.service
</code></pre>
<pre><code class="lang-python">[Unit]
Description=Persistent port-forward <span class="hljs-keyword">for</span> PostgreSQL pod
After=network.target

[Service]
User=ubuntu
Environment=<span class="hljs-string">"KUBECONFIG=/home/ubuntu/.k0s/kubeconfig"</span>
Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ExecStart=/home/ubuntu/pg-portforward.sh
Restart=always
RestartSec=<span class="hljs-number">5</span>
KillMode=process

[Install]
WantedBy=multi-user.target
</code></pre>
<p>Enable and start it:</p>
<pre><code class="lang-python">sudo systemctl daemon-reload
sudo systemctl enable pg-portforward
sudo systemctl start pg-portforward
sudo systemctl status pg-portforward
</code></pre>
<hr />
<h2 id="heading-how-this-works"><strong>How This Works</strong></h2>
<h3 id="heading-1-dynamic-pod-discovery"><strong>1. Dynamic Pod Discovery</strong></h3>
<p>PostgreSQL pods often restart during:</p>
<ul>
<li><p>upgrades</p>
</li>
<li><p>node draining</p>
</li>
<li><p>scaling events</p>
</li>
</ul>
<p>A static pod name does not work.<br />The script uses a label selector:</p>
<pre><code class="lang-python">-l app=postgres
</code></pre>
<p>You can substitute whatever label your Helm chart applies.</p>
<h3 id="heading-2-automatic-reconnection"><strong>2. Automatic Reconnection</strong></h3>
<p>If the port-forward dies (common), the script simply loops and starts again.</p>
<h3 id="heading-3-systemd-keeps-it-alive"><strong>3. systemd keeps it alive</strong></h3>
<p>If the script itself fails, systemd restarts it.</p>
<p>If the machine reboots, the service auto-starts.</p>
<p>This ensures <strong>PostgreSQL inside Kubernetes remains reachable on</strong> <a target="_blank" href="http://localhost:5432"><strong>localhost:5432</strong></a> <strong>at all times</strong>.</p>
<h3 id="heading-why-a-systemd-service">Why a Systemd Service</h3>
<p>Port-forwarding dies when:</p>
<ul>
<li><p>the pod restarts</p>
</li>
<li><p>the connection breaks</p>
</li>
<li><p>kubectl crashes</p>
</li>
<li><p>the terminal closes</p>
</li>
</ul>
<p>To keep the port-forward <strong>alive forever</strong>, you wrapped it in:</p>
<ul>
<li><p>a small bash script that loops forever, automatically reconnecting</p>
</li>
<li><p>a systemd service that starts on boot and restarts on failure</p>
</li>
</ul>
<p>This means your Postgres database is always reachable through that forward without babysitting the terminal.</p>
<hr />
<h2 id="heading-testing-the-setup"><strong>Testing the Setup</strong></h2>
<p>From your local machine:</p>
<pre><code class="lang-python">psql -h &lt;your-node-ip&gt; -p <span class="hljs-number">15432</span> -U postgres -d yourdb
</code></pre>
<p>If SSH-tunneling:</p>
<pre><code class="lang-python">ssh -L <span class="hljs-number">5432</span>:localhost:<span class="hljs-number">15432</span> ubuntu@&lt;server&gt;
psql -h localhost -p <span class="hljs-number">5432</span>
</code></pre>
<p>If port-forward is running correctly, the connection will be instant.</p>
<p>In DBBeaver:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764705001160/cfe7b46c-878b-4c53-9d27-dfe7c2d78590.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764705003805/0f055eb1-ac85-4ddf-86c7-0260e4fe8256.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-common-issues-and-fixes"><strong>Common Issues and Fixes</strong></h2>
<h3 id="heading-kubectl-not-found-inside-systemd"><strong>kubectl not found inside systemd</strong></h3>
<p>systemd does not automatically inherit your PATH.<br />This is why we added:</p>
<pre><code class="lang-python">Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
</code></pre>
<h3 id="heading-kubeconfig-not-respected"><strong>KUBECONFIG not respected</strong></h3>
<p>Your kubeconfig was located here:</p>
<pre><code class="lang-python">/home/ubuntu/.k0s/kubeconfig
</code></pre>
<p>So we exported it in both the script <em>and</em> the service.</p>
<h3 id="heading-pod-name-changing"><strong>Pod name changing</strong></h3>
<p>The script automatically handles this.</p>
<h3 id="heading-security-benefits">Security Benefits</h3>
<p>You avoided exposing Postgres publicly. Only users who can SSH into the server can reach the database. No load balancers. No ingress. No nodePort. Just a controlled, encrypted tunnel.</p>
<hr />
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Port forwarding is usually thought of as a temporary debugging feature, but with a small wrapper script and a systemd unit, it becomes a powerful mechanism for stable access to internal services. For PostgreSQL in particular, this method delivers reliable availability without relaxing cluster security.</p>
<p>This setup is easy to maintain, self-healing, and ideal for environments where PostgreSQL must be reachable consistently through a trusted machine.</p>
]]></content:encoded></item><item><title><![CDATA[Brainwave Visualization Using ESP32, BioAmpEXG, FastAPI, and Interactive Charts]]></title><description><![CDATA[Monitoring brainwave activity in real time has always fascinated me. I wanted to build something that could collect EEG signals, process them on a lightweight device, and display the results on a clean web dashboard. With an ESP32, a MAX30100 sensor,...]]></description><link>https://blog.nyzex.in/brainwave-visualization-using-esp32-bioampexg-fastapi-and-interactive-charts</link><guid isPermaLink="true">https://blog.nyzex.in/brainwave-visualization-using-esp32-bioampexg-fastapi-and-interactive-charts</guid><category><![CDATA[bioamp]]></category><category><![CDATA[brainwave]]></category><category><![CDATA[eeg]]></category><category><![CDATA[visualization]]></category><category><![CDATA[iot]]></category><category><![CDATA[Internet of Things]]></category><category><![CDATA[ESP32]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Mon, 01 Dec 2025 03:49:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764560916745/0054b306-9459-416d-a074-3f8db63e43c8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Monitoring brainwave activity in real time has always fascinated me. I wanted to build something that could collect EEG signals, process them on a lightweight device, and display the results on a clean web dashboard. With an ESP32, a MAX30100 sensor, a small analog EEG input, and a FastAPI backend, I was able to create a portable system that measures Alpha, Beta, and Gamma brainwave bands and visualises them as simple bar graphs.</p>
<p>This article explains how I built the entire pipeline, from data collection to real-time visualisation.</p>
<p>I also wrote a short study based on this and used M5Stack for it:<br /><a target="_blank" href="https://www.researchgate.net/publication/391839761_Short-Term_Neurophysiological_Changes_During_Transcendental_Meditation_A_Pilot_EEG_and_ECG-Based_Study">https://www.researchgate.net/publication/391839761_Short-Term_Neurophysiological_Changes_During_Transcendental_Meditation_A_Pilot_EEG_and_ECG-Based_Study</a></p>
<hr />
<h2 id="heading-why-i-built-this"><strong>Why I Built This</strong></h2>
<p>I wanted a portable and affordable setup that could:</p>
<ul>
<li><p>Collect EEG data through a simple analog pin</p>
</li>
<li><p>Compute frequency bands using Fast Fourier Transform</p>
</li>
<li><p>Add additional biometric data from a MAX30100 sensor</p>
</li>
<li><p>Send all readings to a backend server over Wi-Fi</p>
</li>
<li><p>Show clean and simple charts on a dashboard</p>
</li>
<li><p>Make the system completely wireless</p>
</li>
</ul>
<p>The ESP32 Zero 2 WH (or any small ESP32 board) was perfect because it is inexpensive, efficient, and supports both Wi-Fi and continuous sensor sampling.</p>
<hr />
<h1 id="heading-hardware-setup"><strong>Hardware Setup</strong></h1>
<h3 id="heading-components-used"><strong>Components Used</strong></h3>
<ul>
<li><p><strong>ESP32 (M5Core2 or Zero 2 WH)</strong></p>
</li>
<li><p><strong>MAX30100 sensor</strong> for infrared and red readings, later used ECG Module</p>
</li>
<li><p><strong>EEG analog signal</strong> (BioAmpEXG Pill) connected to an ADC pin</p>
</li>
<li><p><strong>Wi-Fi</strong> to push data to backend</p>
</li>
<li><p><strong>Power source</strong> (USB or portable battery)</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764560486606/3f3113c1-b753-4137-bbc7-2dd29994d015.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-wiring-overview"><strong>Wiring Overview</strong></h3>
<ul>
<li><p>MAX30100 SDA → ESP32 SDA pin</p>
</li>
<li><p>MAX30100 SCL → ESP32 SCL pin</p>
</li>
<li><p>EEG analog output → ESP32 ADC pin</p>
</li>
<li><p>Common ground for all components</p>
</li>
</ul>
<p>The MAX30100 is optional for brainwave detection, but I wanted additional IR/RED data to calculate orderliness and signal health. This was the initial choice, but later I switched to AD8232</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764560575071/f817df19-8025-4e42-9e76-bc2a3e158fdb.png" alt class="image--center mx-auto" /></p>
<hr />
<h1 id="heading-esp32-firmware-logic"><strong>ESP32 Firmware Logic</strong></h1>
<p>The ESP32 collects data continuously. Each cycle does the following:</p>
<ol>
<li><p>Read <strong>raw EEG analog values</strong></p>
</li>
<li><p>Read <strong>IR and RED values</strong> from MAX30100</p>
</li>
<li><p>Apply <strong>Fast Fourier Transform</strong> to EEG samples</p>
</li>
<li><p>Extract <strong>band powers</strong>:</p>
<ul>
<li><p>Alpha (8 to 12 Hz)</p>
</li>
<li><p>Beta (12 to 30 Hz)</p>
</li>
<li><p>Gamma (30 to 100 Hz)</p>
</li>
</ul>
</li>
<li><p>Package everything into a JSON payload</p>
</li>
<li><p>Send the JSON data to the FastAPI backend via Wi-Fi</p>
</li>
</ol>
<h3 id="heading-example-json-payload"><strong>Example JSON Payload</strong></h3>
<pre><code class="lang-json">{
    <span class="hljs-attr">"alpha"</span>: <span class="hljs-number">42.3</span>,
    <span class="hljs-attr">"beta"</span>: <span class="hljs-number">28.1</span>,
    <span class="hljs-attr">"gamma"</span>: <span class="hljs-number">10.5</span>,
    <span class="hljs-attr">"orderliness"</span>: <span class="hljs-number">0.82</span>,
    <span class="hljs-attr">"ir"</span>: <span class="hljs-number">51200</span>,
    <span class="hljs-attr">"red"</span>: <span class="hljs-number">50390</span>
}
</code></pre>
<hr />
<h1 id="heading-backend-fastapi-application"><strong>Backend: FastAPI Application</strong></h1>
<p>The backend receives data, stores it, and serves it to the dashboard.</p>
<h3 id="heading-key-features"><strong>Key Features</strong></h3>
<ul>
<li><p>Endpoint for receiving ESP32 JSON data</p>
</li>
<li><p>In-memory store or Redis for fast access</p>
</li>
<li><p>REST endpoint for the dashboard</p>
</li>
<li><p>CORS enabled</p>
</li>
<li><p>Very low latency</p>
</li>
</ul>
<h3 id="heading-example-fastapi-endpoint"><strong>Example FastAPI Endpoint</strong></h3>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> fastapi <span class="hljs-keyword">import</span> FastAPI
<span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> BaseModel

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">BrainwaveData</span>(<span class="hljs-params">BaseModel</span>):</span>
    alpha: float
    beta: float
    gamma: float
    orderliness: float
    ir: int
    red: int

app = FastAPI()

latest_data = BrainwaveData(
    alpha=<span class="hljs-number">0</span>, beta=<span class="hljs-number">0</span>, gamma=<span class="hljs-number">0</span>, orderliness=<span class="hljs-number">0</span>, ir=<span class="hljs-number">0</span>, red=<span class="hljs-number">0</span>
)

<span class="hljs-meta">@app.post("/update")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">update</span>(<span class="hljs-params">data: BrainwaveData</span>):</span>
    <span class="hljs-keyword">global</span> latest_data
    latest_data = data
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"status"</span>: <span class="hljs-string">"ok"</span>}

<span class="hljs-meta">@app.get("/data")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_data</span>():</span>
    <span class="hljs-keyword">return</span> latest_data
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764560497211/eacc795f-3981-41bf-868a-b44b4f8b659b.png" alt class="image--center mx-auto" /></p>
<hr />
<h1 id="heading-building-the-dashboard"><strong>Building the Dashboard</strong></h1>
<p>I wanted a clean visualisation with no curves, only bar graphs.<br />The dashboard uses:</p>
<ul>
<li><p><strong>HTML + Bootstrap</strong></p>
</li>
<li><p><strong>Chart.js</strong> for bar charts</p>
</li>
<li><p>Auto-refresh using JavaScript</p>
</li>
<li><p>Smooth transitions</p>
</li>
</ul>
<h3 id="heading-why-bar-graphs"><strong>Why Bar Graphs?</strong></h3>
<p>Bar graphs work well because brainwave bands are relative.<br />The magnitude of Alpha versus Beta is the most important insight, and bars make comparison easy.</p>
<hr />
<h1 id="heading-dashboard-layout"><strong>Dashboard Layout</strong></h1>
<p>The dashboard has:</p>
<ul>
<li><p>A bar graph for Alpha, Beta, and Gamma</p>
</li>
<li><p>A card showing orderliness</p>
</li>
<li><p>A small panel showing IR and RED values</p>
</li>
<li><p>A refresh interval of 1 second</p>
</li>
</ul>
<h3 id="heading-example-chartjs-code-snippet"><strong>Example Chart.js Code Snippet</strong></h3>
<pre><code class="lang-javascript"><span class="hljs-keyword">const</span> ctx = <span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">"brainChart"</span>);

<span class="hljs-keyword">const</span> chart = <span class="hljs-keyword">new</span> Chart(ctx, {
    <span class="hljs-attr">type</span>: <span class="hljs-string">"bar"</span>,
    <span class="hljs-attr">data</span>: {
        <span class="hljs-attr">labels</span>: [<span class="hljs-string">"Alpha"</span>, <span class="hljs-string">"Beta"</span>, <span class="hljs-string">"Gamma"</span>],
        <span class="hljs-attr">datasets</span>: [{
            <span class="hljs-attr">data</span>: [<span class="hljs-number">0</span>, <span class="hljs-number">0</span>, <span class="hljs-number">0</span>]
        }]
    },
    <span class="hljs-attr">options</span>: {
        <span class="hljs-attr">animation</span>: <span class="hljs-literal">false</span>,
        <span class="hljs-attr">scales</span>: {
            <span class="hljs-attr">y</span>: { <span class="hljs-attr">beginAtZero</span>: <span class="hljs-literal">true</span> }
        }
    }
});

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">refreshData</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">const</span> r = <span class="hljs-keyword">await</span> fetch(<span class="hljs-string">"/data"</span>);
    <span class="hljs-keyword">const</span> d = <span class="hljs-keyword">await</span> r.json();
    chart.data.datasets[<span class="hljs-number">0</span>].data = [d.alpha, d.beta, d.gamma];
    chart.update();
}

<span class="hljs-built_in">setInterval</span>(refreshData, <span class="hljs-number">1000</span>);
</code></pre>
<p>Here is a graph that I obtained:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764560645072/95106eca-d1c8-44f5-90a1-6078d2ffc458.png" alt class="image--center mx-auto" /></p>
<hr />
<h1 id="heading-how-it-works-together"><strong>How It Works Together</strong></h1>
<h3 id="heading-end-to-end-pipeline"><strong>End-to-End Pipeline</strong></h3>
<ol>
<li><p>ESP32 reads EEG values and MAX30100 values</p>
</li>
<li><p>ESP32 performs FFT and computes band powers</p>
</li>
<li><p>ESP32 sends JSON to FastAPI backend</p>
</li>
<li><p>Dashboard fetches latest data through <code>/data</code></p>
</li>
<li><p>Chart.js updates the bars in real time</p>
</li>
</ol>
<hr />
<h1 id="heading-challenges-i-faced"><strong>Challenges I Faced</strong></h1>
<h3 id="heading-1-noise-in-the-eeg-signal"><strong>1. Noise in the EEG Signal</strong></h3>
<p>Low-cost EEG is noisy.<br />I had to apply:</p>
<ul>
<li><p>Moving average filters</p>
</li>
<li><p>Calibration</p>
</li>
<li><p>Proper grounding</p>
</li>
<li><p>FFT windowing</p>
</li>
</ul>
<h3 id="heading-2-sampling-rate-stability"><strong>2. Sampling Rate Stability</strong></h3>
<p>To extract accurate brainwave bands, the sampling rate must be stable.<br />I locked the ESP32 ADC sampling to a consistent interval.</p>
<h3 id="heading-3-fast-refresh-rendering"><strong>3. Fast Refresh Rendering</strong></h3>
<p>Continuous updates caused stuttering until I disabled animation in Chart.js.</p>
<hr />
<h1 id="heading-final-result"><strong>Final Result</strong></h1>
<p>The dashboard provides a clean and real-time visualisation of:</p>
<ul>
<li><p>Alpha, Beta, Gamma brain activity</p>
</li>
<li><p>Signal orderliness</p>
</li>
<li><p>Infrared and red biometric data</p>
</li>
</ul>
<p>It works smoothly on both desktop and mobile browsers and updates once every second.</p>
<hr />
<h1 id="heading-future-improvements"><strong>Future Improvements</strong></h1>
<p>I plan to enhance the system with:</p>
<ul>
<li><p>WebSocket streaming instead of polling</p>
</li>
<li><p>A rolling timeline view for long sessions</p>
</li>
<li><p>Support for multiple users</p>
</li>
<li><p>A database for storing and analysing sessions</p>
</li>
<li><p>A machine learning model that detects focus, stress, or calmness</p>
</li>
</ul>
<hr />
<h1 id="heading-conclusion"><strong>Conclusion</strong></h1>
<p>This project showed me how much can be done with simple hardware and a clean backend architecture. By combining an ESP32, BioAmpEXG, FFT analysis, FastAPI, and a lightweight dashboard, it is possible to create a fully portable and real-time brainwave monitoring system.</p>
]]></content:encoded></item><item><title><![CDATA[Visualizing Latency Comparisons Between LLM APIs: OpenRouter vs Bedrock]]></title><description><![CDATA[Large Language Models (LLMs) are now integral to modern software applications, powering tasks such as summarization, code generation, and technical explanations. When evaluating multiple LLM APIs, latency, response quality, and consistency are critic...]]></description><link>https://blog.nyzex.in/visualizing-latency-comparisons-between-llm-apis-openrouter-vs-bedrock</link><guid isPermaLink="true">https://blog.nyzex.in/visualizing-latency-comparisons-between-llm-apis-openrouter-vs-bedrock</guid><category><![CDATA[llm]]></category><category><![CDATA[AI]]></category><category><![CDATA[AWS]]></category><category><![CDATA[openrouter]]></category><category><![CDATA[bedrock]]></category><category><![CDATA[Cloud Computing]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Wed, 26 Nov 2025 13:50:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764165025416/7ec4269a-b580-460f-9411-805d59fb7780.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Large Language Models (LLMs) are now integral to modern software applications, powering tasks such as summarization, code generation, and technical explanations. When evaluating multiple LLM APIs, <strong>latency</strong>, <strong>response quality</strong>, and <strong>consistency</strong> are critical. Today, I share a detailed analysis of latency comparison between <strong>OpenRouter</strong> and <strong>Bedrock</strong>, along with methodology, visualization, and insights.</p>
<h2 id="heading-experiment-overview"><strong>Experiment Overview</strong></h2>
<p>The primary objective of this experiment was to measure and compare <strong>response latency</strong> for OpenRouter and Bedrock across multiple prompts. The experiment was designed to capture not only the speed of each API but also its consistency across repeated queries.</p>
<h3 id="heading-prompts-used"><strong>Prompts Used</strong></h3>
<p>Three representative prompts were chosen for the comparison:</p>
<ol>
<li><p>Explain Kubernetes in simple terms for a beginner.</p>
</li>
<li><p>Write a Python function to reverse a linked list.</p>
</li>
<li><p>Summarize the book <em>Atomic Habits</em> in three sentences.</p>
</li>
</ol>
<p>Each prompt was sent <strong>five times</strong> to each API to generate multiple latency measurements for statistical analysis.</p>
<h2 id="heading-data-collection-methodology"><strong>Data Collection Methodology</strong></h2>
<p>The latency comparison was conducted using Python, with the following approach:</p>
<ol>
<li><p><strong>OpenRouter API Calls:</strong></p>
<ul>
<li><p>Sent HTTP POST requests to the OpenRouter API with the prompt, specifying the <code>gpt-oss-20b</code> model.</p>
</li>
<li><p>Measured <strong>start and end timestamps</strong> to calculate latency.</p>
</li>
<li><p>Extracted the text response from the API JSON payload.</p>
</li>
</ul>
</li>
<li><p><strong>AWS Bedrock API Calls:</strong></p>
<ul>
<li><p>Used the <code>boto3</code> client to invoke the Bedrock model <code>openai.gpt-oss-20b-1</code>.</p>
</li>
<li><p>Sent the prompt in the OpenAI-style chat format.</p>
</li>
<li><p>Measured latency from request initiation to response.</p>
</li>
<li><p>Extracted the returned text from the API payload.</p>
</li>
</ul>
</li>
<li><p><strong>Data Storage:</strong></p>
<ul>
<li><p>Each query stored the following fields: prompt, repeat number, OpenRouter response, OpenRouter latency, Bedrock response, and Bedrock latency.</p>
</li>
<li><p>All results were saved into a CSV file (<code>llm_comparison.csv</code>) for analysis and visualization.</p>
</li>
</ul>
</li>
</ol>
<p>This setup ensured a <strong>repeatable and reliable</strong> dataset for performance analysis and comparison.</p>
<p>Here is a <strong>condensed snippet</strong> showing the main idea of the comparison script:</p>
<pre><code class="lang-python"><span class="hljs-keyword">for</span> prompt <span class="hljs-keyword">in</span> prompts:
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(REPEATS):
        or_text, or_time = call_openrouter(prompt)
        print(<span class="hljs-string">f"OpenRouter [<span class="hljs-subst">{i+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{REPEATS}</span>] Latency: <span class="hljs-subst">{or_time:<span class="hljs-number">.2</span>f}</span>s"</span>)
        br_text, br_time = call_bedrock(prompt)
        print(<span class="hljs-string">f"Bedrock   [<span class="hljs-subst">{i+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{REPEATS}</span>] Latency: <span class="hljs-subst">{br_time:<span class="hljs-number">.2</span>f}</span>s"</span>)
        data_rows.append({
            <span class="hljs-string">"prompt"</span>: prompt,
            <span class="hljs-string">"repeat"</span>: i+<span class="hljs-number">1</span>,
            <span class="hljs-string">"openrouter_response"</span>: or_text,
            <span class="hljs-string">"openrouter_latency"</span>: or_time,
            <span class="hljs-string">"bedrock_response"</span>: br_text,
            <span class="hljs-string">"bedrock_latency"</span>: br_time
        })
</code></pre>
<p>This allowed me to build a <strong>structured dataset</strong> with both responses and latencies for each prompt and repeat.</p>
<h2 id="heading-latency-analysis"><strong>Latency Analysis</strong></h2>
<p>Using the CSV data, we conducted both statistical and visual analysis to compare the APIs.</p>
<h3 id="heading-openrouter-latency"><strong>OpenRouter Latency</strong></h3>
<ul>
<li><p>Minimum Latency: 2.32 seconds</p>
</li>
<li><p>Maximum Latency: 7.28 seconds</p>
</li>
<li><p>Average Latency: Approximately 4.60 seconds</p>
</li>
<li><p>Observation: OpenRouter exhibited <strong>higher variability</strong>, particularly for repeated technical explanation prompts.</p>
</li>
</ul>
<h3 id="heading-bedrock-latency"><strong>Bedrock Latency</strong></h3>
<ul>
<li><p>Minimum Latency: 2.00 seconds</p>
</li>
<li><p>Maximum Latency: 3.24 seconds</p>
</li>
<li><p>Average Latency: Approximately 3.05 seconds</p>
</li>
<li><p>Observation: Bedrock was consistently faster and <strong>more stable</strong> across repeats and prompt types.</p>
</li>
</ul>
<h3 id="heading-prompt-specific-patterns"><strong>Prompt-Specific Patterns</strong></h3>
<ul>
<li><p><strong>Kubernetes Explanation:</strong> Bedrock consistently responded under 3 seconds, while OpenRouter spiked to over 7 seconds in one repeat.</p>
</li>
<li><p><strong>Python Code Reversal:</strong> Both APIs performed similarly in early repeats, but Bedrock remained slightly faster.</p>
</li>
<li><p><strong>Book Summarization:</strong> Bedrock maintained both speed and stability, whereas OpenRouter showed variability in later repeats.</p>
</li>
</ul>
<h2 id="heading-visualization-approach"><strong>Visualization Approach</strong></h2>
<p>To better understand latency differences, the following visualizations were created:</p>
<ol>
<li><p><strong>Boxplot:</strong> Shows overall latency distribution for each API, highlighting median, quartiles, and outliers.</p>
</li>
<li><p><strong>Lineplot Per Prompt:</strong> Displays latency across repeats for each prompt, revealing <strong>consistency and spikes</strong>.</p>
</li>
</ol>
<p>These visualizations make trends immediately clear, allowing developers to make informed choices between APIs.</p>
<h2 id="heading-python-script-for-plotting"><strong>Python Script for Plotting</strong></h2>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt
<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns

df = pd.read_csv(<span class="hljs-string">"llm_latency_comparison.csv"</span>)

print(df[[<span class="hljs-string">'openrouter_latency'</span>, <span class="hljs-string">'bedrock_latency'</span>]].describe())

plt.figure(figsize=(<span class="hljs-number">12</span>,<span class="hljs-number">6</span>))
sns.boxplot(data=df[[<span class="hljs-string">'openrouter_latency'</span>, <span class="hljs-string">'bedrock_latency'</span>]])
plt.title(<span class="hljs-string">"Latency Comparison: OpenRouter vs Bedrock"</span>)
plt.ylabel(<span class="hljs-string">"Latency (seconds)"</span>)
plt.show()

plt.figure(figsize=(<span class="hljs-number">14</span>,<span class="hljs-number">6</span>))
<span class="hljs-keyword">for</span> prompt <span class="hljs-keyword">in</span> df[<span class="hljs-string">'prompt'</span>].unique():
    prompt_data = df[df[<span class="hljs-string">'prompt'</span>] == prompt]
    sns.lineplot(x=<span class="hljs-string">'repeat'</span>, y=<span class="hljs-string">'openrouter_latency'</span>, data=prompt_data, label=<span class="hljs-string">f'OpenRouter: <span class="hljs-subst">{prompt}</span>'</span>, marker=<span class="hljs-string">'o'</span>)
    sns.lineplot(x=<span class="hljs-string">'repeat'</span>, y=<span class="hljs-string">'bedrock_latency'</span>, data=prompt_data, label=<span class="hljs-string">f'Bedrock: <span class="hljs-subst">{prompt}</span>'</span>, marker=<span class="hljs-string">'o'</span>)
plt.title(<span class="hljs-string">"Latency Trends Per Prompt Repeat"</span>)
plt.xlabel(<span class="hljs-string">"Repeat Number"</span>)
plt.ylabel(<span class="hljs-string">"Latency (seconds)"</span>)
plt.legend(bbox_to_anchor=(<span class="hljs-number">1.05</span>, <span class="hljs-number">1</span>), loc=<span class="hljs-string">'upper left'</span>)
plt.show()
</code></pre>
<p>We obtained the following from the visualtion:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764164561012/6904465f-f831-400b-9041-a0f87fea18c3.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-insights-from-the-data"><strong>Insights From the Data</strong></h2>
<ol>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764164571884/f1159935-a5a8-4318-b10f-981e0c0bddda.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>OpenRouter Latency Observations:</strong></p>
<ul>
<li><p>Minimum Latency: 2.32 seconds</p>
</li>
<li><p>Maximum Latency: 7.28 seconds</p>
</li>
<li><p>Average Latency: Approximately 4.60 seconds</p>
</li>
<li><p>Variability: Significant across different prompts and repeats, indicating inconsistent performance under certain queries.</p>
</li>
</ul>
</li>
<li><p><strong>Bedrock Latency Observations:</strong></p>
<ul>
<li><p>Minimum Latency: 2.00 seconds</p>
</li>
<li><p>Maximum Latency: 3.24 seconds</p>
</li>
<li><p>Average Latency: Approximately 3.05 seconds</p>
</li>
<li><p>Variability: Much lower than OpenRouter, indicating more consistent performance.</p>
</li>
</ul>
</li>
<li><p><strong>Prompt-Specific Trends:</strong></p>
<ul>
<li><p>For <strong>Kubernetes explanation prompts</strong>, OpenRouter latency increased up to 7.28 seconds in the fourth repeat, while Bedrock remained under 3 seconds.</p>
</li>
<li><p>For <strong>code generation prompts</strong>, both APIs performed similarly in early repeats, but Bedrock consistently had faster responses.</p>
</li>
<li><p>For <strong>book summarization</strong>, Bedrock was faster and more stable, with lower standard deviation.</p>
</li>
</ul>
</li>
</ol>
<h2 id="heading-takeaways"><strong>Takeaways</strong></h2>
<ol>
<li><p><strong>Consistency Matters:</strong> Bedrock is more predictable, making it preferable for real-time applications.</p>
</li>
<li><p><strong>Measure Repeats:</strong> Single API calls can be misleading; repeated measurements reveal stability.</p>
</li>
<li><p><strong>Latency vs. Prompt Complexity:</strong> Certain prompts can trigger spikes in OpenRouter latency, which developers should consider for production workloads.</p>
</li>
<li><p><strong>Data-Driven Decision Making:</strong> Structured data collection enables informed API selection.</p>
</li>
</ol>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>This experiment shows that <strong>Bedrock provides lower and more consistent latency</strong> across prompts and repeated queries compared to OpenRouter. Collecting and visualizing latency not only reveals performance differences but also helps developers make informed choices about which API to integrate for production systems.</p>
<p>By sharing both the <strong>data collection and visualization workflow</strong>, I hope to provide a <strong>practical template</strong> for evaluating LLM APIs for real-world projects.</p>
]]></content:encoded></item><item><title><![CDATA[How I Migrated an AKS Cluster Across Regions Using Velero]]></title><description><![CDATA[Migrating an entire Kubernetes cluster is one of those tasks that sounds straightforward until you actually begin. When I recently needed to migrate an Azure Kubernetes Service (AKS) cluster from the Central India region to the East US region, the pr...]]></description><link>https://blog.nyzex.in/how-i-migrated-an-aks-cluster-across-regions-using-velero</link><guid isPermaLink="true">https://blog.nyzex.in/how-i-migrated-an-aks-cluster-across-regions-using-velero</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Azure]]></category><category><![CDATA[velero]]></category><category><![CDATA[Backup]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Tue, 25 Nov 2025 18:21:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1764094855291/6fddacd1-ac5a-417a-9f03-19174bcec506.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Migrating an entire Kubernetes cluster is one of those tasks that sounds straightforward until you actually begin. When I recently needed to migrate an Azure Kubernetes Service (AKS) cluster from the Central India region to the East US region, the process involved more considerations than simply exporting YAML files and applying them elsewhere.</p>
<p>The requirement was clear. Migrate every workload, every Persistent Volume Claim, every Secret, every ConfigMap, every Custom Resource Definition, and all namespace-level objects. Everything had to move with accuracy.</p>
<p>After evaluating different approaches, Velero turned out to be the most reliable and practical tool for a complete AKS region migration. Velero supports backup and restoration of cluster state as well as persistent storage, and its Azure plugin works smoothly with both Azure Blob Storage and Azure Disk snapshots.</p>
<p>This guide describes the exact steps I followed, the challenges I encountered, and the final workflow that resulted in a successful migration.</p>
<hr />
<h2 id="heading-1-why-velero-is-ideal-for-aks-region-migration"><strong>1. Why Velero Is Ideal for AKS Region Migration</strong></h2>
<p>Velero has several advantages for full-cluster migration:</p>
<h3 id="heading-backup-and-restore-of-namespaced-and-cluster-level-resources"><strong>Backup and restore of namespaced and cluster-level resources</strong></h3>
<p>This includes Deployments, StatefulSets, Services, Secrets, ConfigMaps, RBAC objects, and Custom Resources.</p>
<h3 id="heading-support-for-disk-snapshots"><strong>Support for disk snapshots</strong></h3>
<p>This is essential when migrating workloads that depend on Persistent Volume Claims.</p>
<h3 id="heading-version-compatibility-with-aks"><strong>Version compatibility with AKS</strong></h3>
<p>Velero supports most Kubernetes API versions used in AKS.</p>
<h3 id="heading-easy-restoration-into-a-completely-new-cluster"><strong>Easy restoration into a completely new cluster</strong></h3>
<p>This is useful when the source and destination are in different regions.</p>
<hr />
<h2 id="heading-2-preparing-azure-for-velero"><strong>2. Preparing Azure for Velero</strong></h2>
<p>Velero requires an Azure Blob Storage account and a resource group for snapshot management.</p>
<h3 id="heading-step-1-create-a-storage-account"><strong>Step 1: Create a Storage Account</strong></h3>
<p>Choose a region for the storage account. You can use either the source or destination region because Velero backups are region independent.</p>
<pre><code class="lang-plaintext">az storage account create \
    --name veleroaccount123 \
    --resource-group velero-rg \
    --location eastus \
    --sku Standard_GRS
</code></pre>
<h3 id="heading-step-2-create-a-blob-container"><strong>Step 2: Create a Blob Container</strong></h3>
<pre><code class="lang-plaintext">az storage container create \
    --name velero-backups \
    --account-name veleroaccount123
</code></pre>
<h3 id="heading-step-3-create-a-service-principal-or-use-a-managed-identity"><strong>Step 3: Create a Service Principal or Use a Managed Identity</strong></h3>
<p>In my case, I used a managed identity created for Velero with the required permissions:</p>
<ul>
<li><p>Contributor role on the Resource Group</p>
</li>
<li><p>Storage Blob Data Contributor on the Storage Account</p>
</li>
</ul>
<p>This avoids the need for storing client secrets.</p>
<hr />
<h2 id="heading-3-installing-velero-on-the-source-aks-cluster"><strong>3. Installing Velero on the Source AKS Cluster</strong></h2>
<p>Once the storage and identity were ready, I installed Velero on the source cluster.</p>
<h3 id="heading-step-1-install-velero-cli"><strong>Step 1: Install Velero CLI</strong></h3>
<pre><code class="lang-plaintext">brew install velero
</code></pre>
<p>(Or download from the official GitHub releases page if not using macOS.)</p>
<h3 id="heading-step-2-install-velero-on-the-cluster"><strong>Step 2: Install Velero on the Cluster</strong></h3>
<pre><code class="lang-plaintext">velero install \
    --provider azure \
    --plugins velero/velero-plugin-for-microsoft-azure:v1.10.0 \
    --bucket velero-backups \
    --secret-file ./credentials-velero \
    --backup-location-config resourceGroup=velero-rg,storageAccount=veleroaccount123 \
    --use-volume-snapshots=true \
    --snapshot-location-config resourceGroup=velero-rg
</code></pre>
<p>Now Velero runs inside the cluster with all necessary permissions.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095088526/7284a72c-a334-479d-9b00-a3b0a27e9587.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-4-creating-backups"><strong>4. Creating Backups</strong></h2>
<p>I wanted a complete backup of every namespace, every CRD, and all persistent volumes.</p>
<h3 id="heading-step-1-confirm-velero-is-healthy"><strong>Step 1: Confirm Velero Is Healthy</strong></h3>
<pre><code class="lang-plaintext">velero version
kubectl get pods -n velero
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764095124616/bbf5426f-dc11-47ad-abe9-faab600ae076.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-step-2-run-a-full-cluster-backup"><strong>Step 2: Run a Full Cluster Backup</strong></h3>
<pre><code class="lang-plaintext">velero backup create aks-full-backup \
    --include-namespaces '*' \
    --wait
</code></pre>
<p>The backup took several minutes because the cluster had StatefulSets and large PVCs.</p>
<h3 id="heading-step-3-verify-the-backup"><strong>Step 3: Verify the Backup</strong></h3>
<pre><code class="lang-plaintext">velero backup describe aks-full-backup
velero backup logs aks-full-backup
</code></pre>
<p>At this stage I had a complete backup stored in Azure Blob Storage and snapshots created for all PVCs.</p>
<hr />
<h2 id="heading-5-creating-the-destination-aks-cluster"><strong>5. Creating the Destination AKS Cluster</strong></h2>
<p>I created a new cluster in East US with the same Kubernetes version as the source cluster. Matching the Kubernetes version is important because restoring cluster-level objects may otherwise cause compatibility issues.</p>
<pre><code class="lang-plaintext">az aks create \
  --resource-group uspreprod-rg \
  --name uspreprod-aks \
  --location eastus \
  --node-count 3 \
  --kubernetes-version 1.29
</code></pre>
<p>After the cluster was ready, I connected to it:</p>
<pre><code class="lang-plaintext">az aks get-credentials \
    --resource-group uspreprod-rg \
    --name uspreprod-aks \
    --overwrite-existing
</code></pre>
<hr />
<h2 id="heading-6-installing-velero-on-the-destination-cluster"><strong>6. Installing Velero on the Destination Cluster</strong></h2>
<p>The installation process is similar to the source cluster:</p>
<pre><code class="lang-plaintext">velero install \
    --provider azure \
    --plugins velero/velero-plugin-for-microsoft-azure:v1.10.0 \
    --bucket velero-backups \
    --secret-file ./credentials-velero \
    --backup-location-config resourceGroup=velero-rg,storageAccount=veleroaccount123 \
    --use-volume-snapshots=true \
    --snapshot-location-config resourceGroup=velero-rg
</code></pre>
<p>Velero now has access to the same storage account and snapshots that were created from the source cluster.</p>
<hr />
<h2 id="heading-7-restoring-the-full-backup"><strong>7. Restoring the Full Backup</strong></h2>
<h3 id="heading-step-1-trigger-the-restore"><strong>Step 1: Trigger the Restore</strong></h3>
<pre><code class="lang-plaintext">velero restore create aks-full-restore \
    --from-backup aks-full-backup \
    --wait
</code></pre>
<p>Velero recreated:</p>
<ul>
<li><p>Namespaces</p>
</li>
<li><p>Deployments</p>
</li>
<li><p>StatefulSets</p>
</li>
<li><p>Services</p>
</li>
<li><p>Ingress objects</p>
</li>
<li><p>Secrets and ConfigMaps</p>
</li>
<li><p>CRDs and CRs</p>
</li>
<li><p>Everything linked to snapshots</p>
</li>
</ul>
<h3 id="heading-step-2-verify-restore-objects"><strong>Step 2: Verify Restore Objects</strong></h3>
<pre><code class="lang-plaintext">velero restore describe aks-full-restore
velero restore logs aks-full-restore
</code></pre>
<h3 id="heading-step-3-validate-cluster-state"><strong>Step 3: Validate Cluster State</strong></h3>
<p>I validated that workloads came up correctly:</p>
<pre><code class="lang-plaintext">kubectl get pods --all-namespaces
kubectl get pvc --all-namespaces
kubectl get ingress --all-namespaces
</code></pre>
<p>Any workload that depended on Persistent Volumes was able to recover because the Azure snapshots were restored successfully.</p>
<hr />
<h2 id="heading-8-issues-encountered-and-fixes"><strong>8. Issues Encountered and Fixes</strong></h2>
<h3 id="heading-missing-crds-before-restore"><strong>Missing CRDs Before Restore</strong></h3>
<p>Some CRDs must exist before restoring their corresponding objects.<br />Solution: Install CRDs manually or let Velero restore cluster-level CRDs first.</p>
<h3 id="heading-snapshot-restore-delay"><strong>Snapshot Restore Delay</strong></h3>
<p>Azure snapshots sometimes take time to rehydrate into new disks.<br />Solution: Wait a few minutes and reapply StatefulSets if needed.</p>
<h3 id="heading-identity-permission-issues"><strong>Identity Permission Issues</strong></h3>
<p>The managed identity must have Contributor access on both resource groups.<br />Without this, PVC restore will fail.</p>
<h3 id="heading-ingress-controller-differences"><strong>Ingress Controller Differences</strong></h3>
<p>The new cluster may create a different external IP for the ingress controller.<br />Update DNS records accordingly.</p>
<hr />
<h2 id="heading-9-final-thoughts"><strong>9. Final Thoughts</strong></h2>
<p>Migrating an AKS cluster across regions can feel overwhelming due to the number of moving parts involved. Velero simplifies the process significantly by offering a predictable and reliable way to back up and restore clusters at scale.</p>
<p>In my case, Velero successfully migrated every namespace, every workload, and every Persistent Volume from the Central India AKS cluster to a completely new cluster in East US. The process was clean, repeatable, and did not require manual recreation of YAML files.</p>
<p>If you are planning a similar migration, I strongly recommend preparing the destination cluster with the same Kubernetes version, ensuring proper identity permissions, and validating your snapshot restores.</p>
<p>Velero is a powerful tool, and with the right configuration, it can handle migrations across regions with very little manual effort.</p>
]]></content:encoded></item><item><title><![CDATA[Deploying Harbor on Kubernetes: A Step-by-Step Guide]]></title><description><![CDATA[Harbor is a cloud-native registry that allows storing, signing, and scanning container images for vulnerabilities. Deploying Harbor on Kubernetes provides a scalable, highly available registry with integrated security features. In this guide, I will ...]]></description><link>https://blog.nyzex.in/deploying-harbor-on-kubernetes-a-step-by-step-guide</link><guid isPermaLink="true">https://blog.nyzex.in/deploying-harbor-on-kubernetes-a-step-by-step-guide</guid><category><![CDATA[Docker]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Security]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Thu, 20 Nov 2025 19:23:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1763666565226/ff14e923-2ba1-458f-8acf-f1535989feee.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Harbor is a cloud-native registry that allows storing, signing, and scanning container images for vulnerabilities. Deploying Harbor on Kubernetes provides a scalable, highly available registry with integrated security features. In this guide, I will walk you through a full setup on a Kubernetes cluster using Helm, from installation to pushing your first Docker image.</p>
<hr />
<h2 id="heading-preparing-for-installation">Preparing for Installation</h2>
<p>Before starting, ensure that your Kubernetes cluster is ready and Helm is installed. You should also have an ingress controller configured if you plan to expose Harbor externally.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763665014974/e52930ba-87be-4f7e-92af-8bde5b2c22ef.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-adding-the-harbor-helm-repository">Adding the Harbor Helm Repository</h2>
<p>Helm simplifies deploying Harbor on Kubernetes. Begin by adding the Harbor Helm repository and fetching the chart:</p>
<pre><code class="lang-yaml"><span class="hljs-string">helm</span> <span class="hljs-string">repo</span> <span class="hljs-string">add</span> <span class="hljs-string">harbor</span> <span class="hljs-string">https://helm.goharbor.io</span>
<span class="hljs-string">helm</span> <span class="hljs-string">repo</span> <span class="hljs-string">update</span>
<span class="hljs-string">helm</span> <span class="hljs-string">fetch</span> <span class="hljs-string">harbor/harbor</span> <span class="hljs-string">--untar</span>
</code></pre>
<p>Harbor provides detailed documentation for a high-availability deployment:</p>
<p><a target="_blank" href="https://goharbor.io/docs/2.14.0/install-config/harbor-ha-helm/">Harbor HA Helm Installation Guide</a>.</p>
<hr />
<h2 id="heading-configuring-values-for-your-deployment">Configuring Values for Your Deployment</h2>
<p>Harbor is highly configurable through a values file. I created a <code>values.yaml</code> to suit my environment. Key configurations included:</p>
<ul>
<li><p>Exposing Harbor via ingress with TLS enabled using the <code>pangolin</code> certificate.</p>
</li>
<li><p>Enabling persistence for Registry, Jobservice, and Trivy to ensure images are not lost during pod restarts.</p>
</li>
<li><p>Configuring internal components like Redis and PostgreSQL.</p>
</li>
</ul>
<p><strong>Values file excerpts:</strong></p>
<pre><code class="lang-yaml"><span class="hljs-attr">expose:</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">ingress</span>
  <span class="hljs-attr">tls:</span>
    <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">certSource:</span> <span class="hljs-string">pangolin</span>
  <span class="hljs-attr">ingress:</span>
    <span class="hljs-attr">hosts:</span>
      <span class="hljs-attr">core:</span> <span class="hljs-string">harbor.nyzex.in</span>
    <span class="hljs-attr">className:</span> <span class="hljs-string">"nginx"</span>

<span class="hljs-attr">externalURL:</span> <span class="hljs-string">https://harbor.nyzex.in</span>

<span class="hljs-attr">persistence:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">resourcePolicy:</span> <span class="hljs-string">keep</span>
  <span class="hljs-attr">persistentVolumeClaim:</span>
    <span class="hljs-attr">registry:</span>
      <span class="hljs-attr">size:</span> <span class="hljs-string">10Gi</span>
    <span class="hljs-attr">jobservice:</span>
      <span class="hljs-attr">jobLog:</span>
        <span class="hljs-attr">size:</span> <span class="hljs-string">1Gi</span>
    <span class="hljs-attr">trivy:</span>
      <span class="hljs-attr">size:</span> <span class="hljs-string">5Gi</span>

<span class="hljs-attr">redis:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">password:</span> <span class="hljs-string">"redisStrongPassword"</span>

<span class="hljs-attr">harborAdminPassword:</span> <span class="hljs-string">"Harbor12345"</span>
</code></pre>
<p>We have disabled tls here, because it is being handled by pangolin for us. To understand that setup, please check my other blog:</p>
<p><a target="_blank" href="https://blog.nyzex.in/exposing-kubernetes-services-over-the-internet-using-metallb-nginx-ingress-and-pangolin">Exposing Kubernetes Services Over the Internet Using MetalLB, NGINX Ingress, and Pangolin</a></p>
<hr />
<h2 id="heading-installing-harbor-via-helm">Installing Harbor via Helm</h2>
<p>Once your values file is ready, install Harbor in the <code>harbor</code> namespace:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">create</span> <span class="hljs-string">namespace</span> <span class="hljs-string">harbor</span>
<span class="hljs-string">helm</span> <span class="hljs-string">install</span> <span class="hljs-string">my-release</span> <span class="hljs-string">./harbor</span> <span class="hljs-string">-n</span> <span class="hljs-string">harbor</span> <span class="hljs-string">-f</span> <span class="hljs-string">values.yaml</span>
</code></pre>
<p>Check the status of the pods and services:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">pods</span> <span class="hljs-string">-n</span> <span class="hljs-string">harbor</span>
<span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">svc</span> <span class="hljs-string">-n</span> <span class="hljs-string">harbor</span>
<span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">ingress</span> <span class="hljs-string">-n</span> <span class="hljs-string">harbor</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763665573997/ce0db74d-d4c3-4a48-9926-9cd98b9eb8a0.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-retrieving-the-admin-password">Retrieving the Admin Password</h2>
<p>Harbor generates an admin password automatically if not set. To retrieve it:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">secret</span> <span class="hljs-string">my-release-harbor-core</span> <span class="hljs-string">-n</span> <span class="hljs-string">harbor</span> <span class="hljs-string">-o</span> <span class="hljs-string">jsonpath="{.data.HARBOR_ADMIN_PASSWORD}"</span> <span class="hljs-string">|</span> <span class="hljs-string">base64</span> <span class="hljs-string">--decode</span>
</code></pre>
<p>This password allows you to log in to the Harbor UI at <code>https://harbor.nyzex.in</code>.</p>
<hr />
<h2 id="heading-fixing-the-unauthorized-issue">Fixing the Unauthorized Issue</h2>
<p>After logging in, I encountered an unauthorized error while pushing images. The crucial fix was setting:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">registry:</span>
  <span class="hljs-attr">relativeurls:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>This change ensures Harbor correctly handles relative paths for the registry.</p>
<hr />
<h2 id="heading-docker-workflow">Docker Workflow</h2>
<p>After logging in, I tested Harbor by building and pushing a sample Docker image.</p>
<p><strong>Sample Dockerfile:</strong></p>
<pre><code class="lang-yaml"><span class="hljs-string">FROM</span> <span class="hljs-string">busybox:latest</span>
</code></pre>
<p><strong>Build and tag the image:</strong></p>
<pre><code class="lang-yaml"><span class="hljs-string">docker</span> <span class="hljs-string">build</span> <span class="hljs-string">-t</span> <span class="hljs-string">harbor.nyzex.in/myproj/test-image:latest</span> <span class="hljs-string">.</span>
</code></pre>
<p><strong>Login to Harbor:</strong></p>
<pre><code class="lang-yaml"><span class="hljs-string">docker</span> <span class="hljs-string">login</span> <span class="hljs-string">harbor.nyzex.in</span>
</code></pre>
<p><strong>Push the image:</strong></p>
<pre><code class="lang-yaml"><span class="hljs-string">docker</span> <span class="hljs-string">push</span> <span class="hljs-string">harbor.nyzex.in/myproj/test-image:latest</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763665337086/709bae7d-48b4-4a05-bd20-06a60a40287d.png" alt class="image--center mx-auto" /></p>
<p>The image successfully uploaded, and Harbor started scanning it if Trivy is enabled.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763665725766/4d468cb1-f127-4482-a4ca-bc9064ca45c8.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-understanding-harbor-components-and-persistence">Understanding Harbor Components and Persistence</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763666358092/77ea5e16-359f-4038-bb17-c5886f3c39db.png" alt class="image--center mx-auto" /></p>
<p>Harbor consists of several core components, each with its role:</p>
<ul>
<li><p><strong>Registry:</strong> Stores Docker images. Requires persistent storage to avoid losing images.</p>
</li>
<li><p><strong>Core:</strong> Provides the UI, API, and handles authentication.</p>
</li>
<li><p><strong>Jobservice:</strong> Executes background jobs, such as image replication or garbage collection.</p>
</li>
<li><p><strong>Redis:</strong> Caching and session storage.</p>
</li>
<li><p><strong>PostgreSQL:</strong> Stores metadata, configurations, and user information.</p>
</li>
<li><p><strong>Trivy (optional):</strong> Performs vulnerability scans on images.</p>
</li>
</ul>
<p>Persistent volumes ensure that each component retains data even if pods restart or are rescheduled.</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">pvc</span> <span class="hljs-string">-n</span> <span class="hljs-string">harbor</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763665373771/a46b088e-1a12-4613-b148-4e1715ba7ee8.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-harbor-registry-my-release-harbor-registry"><strong>Harbor Registry (</strong><code>my-release-harbor-registry</code>)</h3>
<ul>
<li><p><strong>Role:</strong> This is the <strong>core Docker registry</strong>. All images you push or pull are stored here.</p>
</li>
<li><p><strong>Storage:</strong> <code>/storage</code> inside the pod (mapped to PVC <code>my-release-harbor-registry</code>).</p>
</li>
<li><p><strong>Growth:</strong> Every <code>docker push</code> increases this PVC.</p>
</li>
<li><p><strong>Key takeaway:</strong> This is the main storage to monitor and expand if needed.</p>
</li>
</ul>
<h3 id="heading-redis-my-release-harbor-redis"><strong>Redis (</strong><code>my-release-harbor-redis</code>)</h3>
<ul>
<li><p><strong>Role:</strong> Redis is used as <strong>cache and message broker</strong> for Harbor. It speeds up operations like:</p>
<ul>
<li><p>Session management</p>
</li>
<li><p>Job queue for replication</p>
</li>
<li><p>Temporary caching of metadata</p>
</li>
</ul>
</li>
<li><p><strong>Storage:</strong> Minimal, mostly in-memory. PVC (<code>data-my-release-harbor-redis-0</code>) is 1Gi because Redis mostly uses RAM.</p>
</li>
<li><p><strong>Growth:</strong> Usually does <strong>not grow with image pushes</strong>; only used for transient data.</p>
</li>
</ul>
<h3 id="heading-trivy-my-release-harbor-trivy"><strong>Trivy (</strong><code>my-release-harbor-trivy</code>)</h3>
<ul>
<li><p><strong>Role:</strong> Trivy is Harbor’s <strong>vulnerability scanner</strong>. It scans container images for CVEs.</p>
</li>
<li><p><strong>Storage:</strong> Stores vulnerability DB and scan cache in its PVC (<code>data-my-release-harbor-trivy-0</code>).</p>
</li>
<li><p><strong>Growth:</strong> Increases if you scan many images, because it caches scan results and the CVE database (~5Gi here).</p>
</li>
<li><p><strong>Key takeaway:</strong> You don’t need to increase this PVC unless you do tons of scans.</p>
</li>
</ul>
<h3 id="heading-database-my-release-harbor-database"><strong>Database (</strong><code>my-release-harbor-database</code>)</h3>
<ul>
<li><p><strong>Role:</strong> Harbor’s <strong>PostgreSQL database</strong>. Stores:</p>
<ul>
<li><p>Users, projects, and roles</p>
</li>
<li><p>Repository metadata</p>
</li>
<li><p>Scan results</p>
</li>
<li><p>Jobs, quotas, and configurations</p>
</li>
</ul>
</li>
<li><p><strong>Storage:</strong> Your PVC is small (1Gi). Actual usage is tiny at first.</p>
</li>
<li><p><strong>Growth:</strong> Will grow slowly with metadata; pushing images does <strong>not increase it significantly</strong>.</p>
</li>
</ul>
<h3 id="heading-jobservice-my-release-harbor-jobservice"><strong>Jobservice (</strong><code>my-release-harbor-jobservice</code>)</h3>
<ul>
<li><p><strong>Role:</strong> Manages <strong>asynchronous jobs</strong> for Harbor:</p>
<ul>
<li><p>Image replication</p>
</li>
<li><p>Garbage collection</p>
</li>
<li><p>Scan jobs (Trivy)</p>
</li>
<li><p>Retention policies</p>
</li>
</ul>
</li>
<li><p><strong>Storage:</strong> Uses its PVC (<code>my-release-harbor-jobservice</code>) minimally for job queues and logs (~1Gi).</p>
</li>
<li><p><strong>Growth:</strong> Typically small; no impact on image storage.</p>
</li>
</ul>
<p><strong>Summary:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Component</td><td>PVC used</td><td>Grows with image push?</td><td>Notes</td></tr>
</thead>
<tbody>
<tr>
<td>Registry</td><td><code>my-release-harbor-registry</code></td><td>Yes</td><td>This is your main image storage</td></tr>
<tr>
<td>Redis</td><td><code>data-my-release-harbor-redis-0</code></td><td>No</td><td>Only cache, ephemeral data</td></tr>
<tr>
<td>Trivy</td><td><code>data-my-release-harbor-trivy-0</code></td><td>Slightly</td><td>Stores scan DB &amp; results</td></tr>
<tr>
<td>Database</td><td><code>database-data-my-release-harbor-database-0</code></td><td>Minor</td><td>Stores metadata</td></tr>
<tr>
<td>Jobservice</td><td><code>my-release-harbor-jobservice</code></td><td>Minor</td><td>Handles background jobs</td></tr>
</tbody>
</table>
</div><h3 id="heading-quick-check-of-storage">Quick check of <code>/storage</code></h3>
<p>Now that we know <code>/storage</code> is to be monitored for usage, we can use the following to keep check on it:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">exec</span> <span class="hljs-string">-n</span> <span class="hljs-string">harbor</span> <span class="hljs-string">-it</span> <span class="hljs-string">deployment/my-release-harbor-registry</span> <span class="hljs-string">--</span> <span class="hljs-string">sh</span> <span class="hljs-string">-c</span> <span class="hljs-string">"du -sh /storage"</span>
</code></pre>
<p>This gives total usage.</p>
<h3 id="heading-check-usage-per-subdirectory-blobs-and-repositories">Check usage per subdirectory (<code>blobs</code> and <code>repositories</code>)</h3>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">exec</span> <span class="hljs-string">-n</span> <span class="hljs-string">harbor</span> <span class="hljs-string">-it</span> <span class="hljs-string">deployment/my-release-harbor-registry</span> <span class="hljs-string">--</span> <span class="hljs-string">sh</span> <span class="hljs-string">-c</span> <span class="hljs-string">"du -sh /storage/*"</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763665485335/0463af93-e97a-4879-885a-d65fce95622f.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-lessons-learned">Lessons Learned</h2>
<ul>
<li><p>The <code>registry.relativeurls</code> fix is crucial to avoid push failures.</p>
</li>
<li><p>Persistence ensures images, logs, and scan data are retained safely.</p>
</li>
<li><p>Properly configuring ingress and TLS certificates is essential for secure access.</p>
</li>
<li><p>Redis and PostgreSQL are critical for Harbor functionality and must be monitored.</p>
</li>
<li><p>Docker login, build, and push are straightforward once Harbor is correctly configured.</p>
</li>
</ul>
<p>Deploying Harbor with Helm on Kubernetes is straightforward if you follow these steps carefully. With persistence, security, and a proper workflow, Harbor becomes a reliable, enterprise-ready registry for your container images.</p>
]]></content:encoded></item><item><title><![CDATA[Understanding Kubernetes Networking, Load Balancers, Subnets, and MetalLB]]></title><description><![CDATA[Kubernetes networking often feels abstract when you first begin working with clusters, Services, and Ingress controllers. Many engineers are able to “make things work” without fully understanding how packets travel across machines, how load balancers...]]></description><link>https://blog.nyzex.in/understanding-kubernetes-networking-load-balancers-subnets-and-metallb</link><guid isPermaLink="true">https://blog.nyzex.in/understanding-kubernetes-networking-load-balancers-subnets-and-metallb</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Load Balancing]]></category><category><![CDATA[metallb]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[architecture]]></category><category><![CDATA[Security]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Tue, 18 Nov 2025 19:51:38 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1763495477632/85e919e9-0092-4a5c-8faf-529482d714e7.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Kubernetes networking often feels abstract when you first begin working with clusters, Services, and Ingress controllers. Many engineers are able to “make things work” without fully understanding how packets travel across machines, how load balancers actually assign IP addresses, or what role MetalLB plays in bare-metal deployments.</p>
<p>This guide removes the mystery. It begins with foundational networking concepts and gradually builds toward a complete, unified understanding of Kubernetes networking, external access, load balancers, ingress controllers, and MetalLB.</p>
<p>If you want a single resource that connects all of these concepts clearly, this article is designed to be your go-to reference.</p>
<hr />
<h2 id="heading-the-fundamentals-what-is-a-network"><strong>The Fundamentals: What Is a Network?</strong></h2>
<p>A network is a group of connected machines that communicate by sending packets to each other. Every machine on the network must have a unique address so that packets know where to go.</p>
<p>That unique address is an <strong>IP address</strong>.</p>
<hr />
<h2 id="heading-what-is-an-ip-address"><strong>What Is an IP Address?</strong></h2>
<p>An IP address is a numerical identifier assigned to every device on a network. The most common format is IPv4, which looks like this:</p>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.1</span><span class="hljs-number">.50</span>
</code></pre>
<p>Each IP address has two parts:</p>
<ol>
<li><p><strong>Network portion</strong></p>
</li>
<li><p><strong>Host portion</strong></p>
</li>
</ol>
<p>The network portion identifies which network the device belongs to.<br />The host portion identifies which specific device it is.</p>
<p>How do we know where one portion ends and the other begins?</p>
<p>That is decided by the <strong>subnet mask</strong>.</p>
<hr />
<h2 id="heading-what-is-a-subnet"><strong>What Is a Subnet?</strong></h2>
<p>A <strong>subnet</strong> (sub-network) divides a large network into smaller logical networks.</p>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.1</span><span class="hljs-number">.0</span><span class="hljs-string">/24</span>
</code></pre>
<p>Here:</p>
<ul>
<li><p><code>/24</code> means the first 24 bits are the network portion.</p>
</li>
<li><p>The network contains:</p>
<ul>
<li><p>192.168.1.0 (network address)</p>
</li>
<li><p>192.168.1.1 to 192.168.1.254 (usable host IPs)</p>
</li>
<li><p>192.168.1.255 (broadcast address)</p>
</li>
</ul>
</li>
</ul>
<p>This gives 254 usable IPs. So:</p>
<ul>
<li><p>256 total addresses (0–255)</p>
</li>
<li><p>254 usable host addresses (1–254)</p>
</li>
<li><p>1 network address (.0)</p>
</li>
<li><p>1 broadcast address (.255)</p>
</li>
</ul>
<p>A subnet tells devices where to send packets.<br />If an IP is inside the same subnet, communication is local.<br />If the IP is outside the subnet, traffic goes through a gateway.</p>
<p>Subnets define the boundaries within which Kubernetes nodes, pods, and services receive IP addresses.</p>
<p>Your home router typically gives IPs from this subnet to devices using <strong>DHCP</strong>.</p>
<hr />
<h2 id="heading-how-are-ips-allocated-in-a-network"><strong>How Are IPs Allocated in a Network?</strong></h2>
<p>There are two ways:</p>
<h3 id="heading-static-allocation"><strong>Static allocation</strong></h3>
<p>The IP is manually configured.<br />Example:<br />Your server is set to always use <code>192.168.1.200</code>.</p>
<h3 id="heading-dynamic-allocation-dhcp"><strong>Dynamic allocation (DHCP)</strong></h3>
<p>Your router assigns IPs automatically.</p>
<p>Most home networks use DHCP for laptops, phones, TV, etc.<br />Servers and Kubernetes nodes often use static IPs.</p>
<hr />
<h2 id="heading-how-kubernetes-uses-ip-addresses"><strong>How Kubernetes Uses IP Addresses</strong></h2>
<p>Kubernetes uses <strong>three layers</strong> of IP assignment:</p>
<h3 id="heading-1-node-ips"><strong>1. Node IPs</strong></h3>
<p>These are normal IPs assigned by your network (your router or your cloud).<br />Examples:</p>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.1</span><span class="hljs-number">.10</span>
<span class="hljs-number">192.168</span><span class="hljs-number">.1</span><span class="hljs-number">.11</span>
</code></pre>
<h3 id="heading-2-pod-ips"><strong>2. Pod IPs</strong></h3>
<p>Assigned by Kubernetes CNI (Container Network Interface).<br />Pods must be reachable from any node.<br />They belong to the cluster's internal network, such as:</p>
<pre><code class="lang-yaml"><span class="hljs-number">10.244</span><span class="hljs-number">.0</span><span class="hljs-number">.15</span>
<span class="hljs-number">10.244</span><span class="hljs-number">.1</span><span class="hljs-number">.9</span>
</code></pre>
<h3 id="heading-3-service-ips"><strong>3. Service IPs</strong></h3>
<p>These are stable virtual IPs created by Kubernetes for Services.<br />Examples:</p>
<pre><code class="lang-yaml"><span class="hljs-number">10.98</span><span class="hljs-number">.50</span><span class="hljs-number">.1</span>
<span class="hljs-number">10.109</span><span class="hljs-number">.22</span><span class="hljs-number">.18</span>
</code></pre>
<p>Pod IPs change.<br />Service IPs never change.</p>
<p>Services act as <strong>stable front doors</strong> that point to Pods.</p>
<hr />
<h2 id="heading-what-is-a-kubernetes-service"><strong>What Is a Kubernetes Service?</strong></h2>
<p>A Service groups Pods and exposes them in predictable ways.</p>
<h3 id="heading-types-of-services"><strong>Types of Services</strong></h3>
<ol>
<li><p><strong>ClusterIP</strong> (internal only)</p>
</li>
<li><p><strong>NodePort</strong> (opens ports on each node)</p>
</li>
<li><p><strong>LoadBalancer</strong> (gets a real external IP)</p>
</li>
<li><p><strong>Headless Service</strong> (no virtual IP, direct Pod DNS)</p>
</li>
</ol>
<p>For exposing applications outside the cluster, NodePort and LoadBalancer are relevant.</p>
<hr />
<h2 id="heading-what-is-a-load-balancer"><strong>What Is a Load Balancer?</strong></h2>
<p>A load balancer accepts traffic on one IP address and distributes it to backend servers (Pods).</p>
<p>There are two major kinds:</p>
<h3 id="heading-a-cloud-load-balancers-aws-azure-gcp"><strong>A. Cloud Load Balancers (AWS, Azure, GCP)</strong></h3>
<p>When you create:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
</code></pre>
<p>The cloud provider:</p>
<ul>
<li><p>Allocates a public IP</p>
</li>
<li><p>Creates a load balancer appliance</p>
</li>
<li><p>Forwards traffic to your Service</p>
</li>
</ul>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-number">52.14</span><span class="hljs-number">.222</span><span class="hljs-number">.8</span> <span class="hljs-string">→</span> <span class="hljs-string">Kubernetes</span> <span class="hljs-string">Service</span> <span class="hljs-string">→</span> <span class="hljs-string">Pods</span>
</code></pre>
<h3 id="heading-b-bare-metal-load-balancers-metallb"><strong>B. Bare-Metal Load Balancers (MetalLB)</strong></h3>
<p>On bare-metal or home labs you do not have a cloud provider.<br />Kubernetes cannot create an external LoadBalancer by itself.</p>
<p>This is where <strong>MetalLB</strong> comes in.</p>
<hr />
<h2 id="heading-what-is-metallb"><strong>What Is MetalLB?</strong></h2>
<p>MetalLB implements LoadBalancer behavior for bare-metal clusters.</p>
<p>When you create:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
</code></pre>
<p>MetalLB:</p>
<ul>
<li><p>Picks an IP from a configured pool</p>
</li>
<li><p>Assigns it to the Service</p>
</li>
<li><p>Announces the IP on your local network using ARP or BGP</p>
</li>
</ul>
<p>Your router and devices now believe that your cluster owns that IP.</p>
<hr />
<h2 id="heading-how-metallb-assigns-ips"><strong>How MetalLB Assigns IPs?</strong></h2>
<p>You create an IPAddressPool, for example:</p>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.200</span> <span class="hljs-bullet">-</span> <span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.220</span>
</code></pre>
<p>This is a range of <strong>21 IP addresses</strong>.</p>
<p>MetalLB can assign at most <strong>21 LoadBalancer Services</strong> at the same time.</p>
<p>This has nothing to do with the number of nodes.</p>
<p>Nodes = compute<br />Pool = number of external Services</p>
<p>You could have:</p>
<ul>
<li><p>1 node</p>
</li>
<li><p>100 nodes</p>
</li>
<li><p>500 nodes</p>
</li>
</ul>
<p>The number of nodes does not change your available LoadBalancer IPs.</p>
<h2 id="heading-what-metallb-actually-does"><strong>What MetalLB Actually Does</strong></h2>
<p>MetalLB is a load balancer implementation for bare-metal Kubernetes clusters.</p>
<p>MetalLB does <strong>not</strong> route traffic like a Layer 7 reverse proxy.<br />It does not do TLS termination or HTTP routing.<br />It simply:</p>
<ul>
<li><p>assigns external IPs for LoadBalancer services</p>
</li>
<li><p>makes nodes answer ARP/NDP for those IPs</p>
</li>
</ul>
<p>This allows devices on your LAN to send traffic to Kubernetes.</p>
<h3 id="heading-the-layer-2-mode"><strong>The Layer 2 Mode</strong></h3>
<p>Your ConfigMap:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">address-pools:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">default</span>
  <span class="hljs-attr">protocol:</span> <span class="hljs-string">layer2</span>
  <span class="hljs-attr">addresses:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.200</span><span class="hljs-number">-192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.220</span>
</code></pre>
<p>MetalLB will:</p>
<ul>
<li><p>take one IP from that range</p>
</li>
<li><p>respond on the network as if the node owns it</p>
</li>
<li><p>direct traffic toward the appropriate service</p>
</li>
</ul>
<p>If you have 21 IPs in that range, you can have <strong>21 LoadBalancer services</strong>, regardless of how many Kubernetes nodes exist.</p>
<hr />
<h2 id="heading-what-happens-when-a-loadbalancer-ip-is-assigned"><strong>What Happens When a LoadBalancer IP Is Assigned?</strong></h2>
<p>Example:</p>
<p>Your ingress-nginx Service gets:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">EXTERNAL-IP:</span> <span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.200</span>
</code></pre>
<p>MetalLB announces:</p>
<ul>
<li>"192.168.29.200 is located here"</li>
</ul>
<p>Either:</p>
<ul>
<li><p>the node where nginx is running (ARP mode), or</p>
</li>
<li><p>via BGP to your router (BGP mode)</p>
</li>
</ul>
<p>When a client sends traffic to:</p>
<pre><code class="lang-yaml"><span class="hljs-string">http://192.168.29.200</span>
</code></pre>
<p>It reaches the correct node, then:</p>
<ul>
<li><p>kube-proxy forwards it to the nginx Pod</p>
</li>
<li><p>nginx reads the host header</p>
</li>
<li><p>nginx forwards to the correct internal Service</p>
</li>
<li><p>Service load balances traffic to Pods</p>
</li>
<li><p>Pod responds back through the chain</p>
</li>
<li><p>The client receives the response</p>
</li>
</ul>
<h2 id="heading-why-the-number-of-ips-has-no-relation-to-the-number-of-nodes"><strong>Why the Number of IPs Has No Relation to the Number of Nodes</strong></h2>
<p>Nodes have their own IPs from your LAN or DHCP server.<br />MetalLB uses <strong>completely separate IPs</strong> from the pool.</p>
<p>For example:</p>
<ul>
<li><p>Your LAN might be <code>192.168.29.0/24</code></p>
</li>
<li><p>Your nodes may be <code>192.168.29.101</code>, <code>.102</code>, <code>.103</code></p>
</li>
<li><p>MetalLB pool might be <code>192.168.29.200-220</code></p>
</li>
</ul>
<p>These ranges do not overlap and have different purposes.</p>
<p><strong>Node IPs do not limit the number of LoadBalancer IPs.</strong><br /><strong>LoadBalancer IPs do not limit the number of nodes.</strong></p>
<p>They serve two different layers:</p>
<ul>
<li><p>Nodes = physical cluster machines</p>
</li>
<li><p>External IPs = addresses for exposing services</p>
</li>
</ul>
<p>In simpler terms,</p>
<p>Node count = how many machines run Pods<br />Pool size = how many external IPs are available for Services</p>
<p>You could have:</p>
<ul>
<li><p>100 nodes</p>
</li>
<li><p>But only 5 LoadBalancer IPs in MetalLB</p>
</li>
</ul>
<p>Then:</p>
<ul>
<li><p>At most 5 Services can have external IPs</p>
</li>
<li><p>But 100 nodes can run thousands of Pods inside the cluster</p>
</li>
</ul>
<p>They are <strong>independent resources</strong>.</p>
<hr />
<h2 id="heading-what-is-an-ingress-controller"><strong>What Is an Ingress Controller?</strong></h2>
<p>Ingress defines <em>routing rules</em>:</p>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-string">app1.example.com</span> <span class="hljs-string">→</span> <span class="hljs-attr">service:</span> <span class="hljs-string">app1</span>
<span class="hljs-string">app2.example.com</span> <span class="hljs-string">→</span> <span class="hljs-attr">service:</span> <span class="hljs-string">app2</span>
<span class="hljs-string">grafana.example.com</span> <span class="hljs-string">→</span> <span class="hljs-attr">service:</span> <span class="hljs-string">grafana</span>
</code></pre>
<p>However, Ingress <strong>does not</strong> provide an IP.<br />It needs a Service to expose it.</p>
<p>This is why ingress-nginx uses:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">type:</span> <span class="hljs-string">LoadBalancer</span>
</code></pre>
<p>MetalLB gives this Service an IP.<br />That single IP can route <em>unlimited</em> HTTP/HTTPS applications using hostnames.</p>
<p>This saves your IP pool.</p>
<p>Nginx Ingress Controller is essentially a Layer 7 HTTP/S reverse proxy inside your cluster.</p>
<p>You expose the Ingress Controller using a LoadBalancer service:</p>
<pre><code class="lang-yaml"><span class="hljs-string">ingress-nginx-controller</span>  <span class="hljs-string">→</span> <span class="hljs-string">MetalLB</span> <span class="hljs-string">assigns</span> <span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.200</span>
</code></pre>
<p>Your users hit <strong>192.168.29.200</strong>, and Nginx:</p>
<ul>
<li><p>receives HTTP/S traffic</p>
</li>
<li><p>routes it to services inside the cluster</p>
</li>
<li><p>handles hostnames, paths, TLS, rate limits, etc.</p>
</li>
</ul>
<h2 id="heading-full-end-to-end-traffic-flow"><strong>Full End-to-End Traffic Flow</strong></h2>
<h3 id="heading-step-1-a-client-requests"><strong>Step 1: A client requests:</strong></h3>
<pre><code class="lang-yaml"><span class="hljs-string">https://app1.example.com</span>
</code></pre>
<h3 id="heading-step-2-dns-resolves-it-to"><strong>Step 2: DNS resolves it to:</strong></h3>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.200</span>
</code></pre>
<p>(This IP is provided by MetalLB)</p>
<h3 id="heading-step-3-the-packet-reaches-the-node"><strong>Step 3: The packet reaches the node</strong></h3>
<p>MetalLB Speaker announced this IP via ARP or BGP.</p>
<h3 id="heading-step-4-ingress-nginx-pod-receives-traffic"><strong>Step 4: ingress-nginx Pod receives traffic</strong></h3>
<p>Its LoadBalancer Service forwards port 80 or 443 to nginx.</p>
<h3 id="heading-step-5-nginx-reads-routing-rules"><strong>Step 5: nginx reads routing rules</strong></h3>
<p>Defined in Kubernetes Ingress resources.</p>
<h3 id="heading-step-6-nginx-forwards-to-the-correct-service"><strong>Step 6: nginx forwards to the correct Service</strong></h3>
<p>Example:</p>
<pre><code class="lang-yaml"><span class="hljs-string">app1-service</span> <span class="hljs-string">→</span> <span class="hljs-string">Pod(s)</span>
</code></pre>
<h3 id="heading-step-7-pod-responds"><strong>Step 7: Pod responds</strong></h3>
<p>Traffic flows back through the same path to the client.</p>
<hr />
<h2 id="heading-why-ingress-saves-ip-addresses"><strong>Why Ingress Saves IP Addresses</strong></h2>
<p>Without ingress:</p>
<ul>
<li>10 Services exposed externally = 10 LoadBalancer IPs consumed</li>
</ul>
<p>With ingress:</p>
<ul>
<li><p>10 Services exposed externally = 1 LoadBalancer IP used</p>
</li>
<li><p>nginx handles routing internally</p>
</li>
</ul>
<p>This is the reason most production setups use ingress controllers.</p>
<hr />
<h2 id="heading-practical-example-ip-planning-home-lab"><strong>Practical Example IP Planning (Home Lab)</strong></h2>
<p>Suppose your subnet is:</p>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.0</span><span class="hljs-string">/24</span>
</code></pre>
<p>Your router uses:</p>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.1</span>
</code></pre>
<p>Your DHCP range:</p>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.50</span> <span class="hljs-bullet">-</span> <span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.150</span>
</code></pre>
<p>You can choose:</p>
<pre><code class="lang-yaml"><span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.200</span> <span class="hljs-bullet">-</span> <span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.220</span>
</code></pre>
<p>This is:</p>
<ul>
<li><p>outside the DHCP range</p>
</li>
<li><p>inside the same subnet</p>
</li>
<li><p>safe to use for MetalLB</p>
</li>
</ul>
<p>This gives 21 external IPs.</p>
<hr />
<h2 id="heading-flow-and-architecture"><strong>Flow and Architecture</strong></h2>
<h3 id="heading-a-network-overview"><strong>A. Network overview</strong></h3>
<pre><code class="lang-yaml">                 <span class="hljs-string">+----------------------+</span>
                 <span class="hljs-string">|</span>    <span class="hljs-string">Your</span> <span class="hljs-string">Router</span>       <span class="hljs-string">|
                 |   192.168.29.1       |
                 +----------+-----------+
                            |
                            |
                 Local Network (L2)
                            |
</span>           <span class="hljs-string">-----------------------------------</span>
           <span class="hljs-string">|</span>                 <span class="hljs-string">|</span>               <span class="hljs-string">|
    +-------------+   +-------------+   +-------------+
    | Kubernetes  |   | Kubernetes  |   | Kubernetes  |
    |   Node 1    |   |   Node 2    |   |   Node 3    |
    |192.168.29.10|   |192.168.29.11|   |192.168.29.12|
    +------+------+   +------+------+   +------ ------+
           |                 |               |
        MetalLB Speaker on all nodes
           |                 |               |
       Announces IPs such as 192.168.29.200</span>
</code></pre>
<h3 id="heading-b-ingress-routing"><strong>B. Ingress routing</strong></h3>
<pre><code class="lang-yaml"><span class="hljs-string">Client</span> <span class="hljs-string">→</span> <span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.200</span> <span class="hljs-string">→</span> <span class="hljs-string">ingress-nginx</span> <span class="hljs-string">→</span> <span class="hljs-string">app1-service</span> <span class="hljs-string">→</span> <span class="hljs-string">app1</span> <span class="hljs-string">Pods</span>
                                              <span class="hljs-string">↳</span> <span class="hljs-string">app2-service</span> <span class="hljs-string">→</span> <span class="hljs-string">app2</span> <span class="hljs-string">Pods</span>
                                              <span class="hljs-string">↳</span> <span class="hljs-string">grafana-service</span> <span class="hljs-string">→</span> <span class="hljs-string">grafana</span> <span class="hljs-string">Pods</span>
</code></pre>
<hr />
<h2 id="heading-understanding-load-balancers-in-general"><strong>Understanding Load Balancers in General</strong></h2>
<p>A load balancer distributes incoming network traffic across multiple targets.<br />There are two broad categories:</p>
<h3 id="heading-layer-4-load-balancers"><strong>Layer 4 Load Balancers</strong></h3>
<p>These operate at the connection level (TCP/UDP).<br />They see ports and IP addresses only.</p>
<p>Examples:</p>
<ul>
<li><p>MetalLB</p>
</li>
<li><p>AWS NLB</p>
</li>
<li><p>HAProxy in TCP mode</p>
</li>
</ul>
<h3 id="heading-layer-7-load-balancers"><strong>Layer 7 Load Balancers</strong></h3>
<p>These operate at the application layer (HTTP/S).<br />They understand paths, headers, cookies, hostnames.</p>
<p>Examples:</p>
<ul>
<li><p>Nginx Ingress Controller</p>
</li>
<li><p>Traefik</p>
</li>
<li><p>AWS ALB</p>
</li>
</ul>
<p>In Kubernetes, it is common to use both:</p>
<ul>
<li><p><strong>MetalLB for Layer 4 external IP allocation</strong></p>
</li>
<li><p><strong>Nginx for Layer 7 routing</strong></p>
</li>
</ul>
<hr />
<h2 id="heading-how-everything-connects-together"><strong>How Everything Connects Together</strong></h2>
<p>Here is the conceptual hierarchy:</p>
<h3 id="heading-level-0-the-network"><strong>Level 0 – The Network</strong></h3>
<ul>
<li><p>You have a subnet such as <code>192.168.29.0/24</code></p>
</li>
<li><p>The subnet defines the address space for your LAN</p>
</li>
<li><p>Nodes receive IPs from this range</p>
</li>
</ul>
<h3 id="heading-level-1-kubernetes"><strong>Level 1 – Kubernetes</strong></h3>
<ul>
<li><p>Pods get IPs from the Pod CIDR</p>
</li>
<li><p>Services get Cluster IPs</p>
</li>
<li><p>Nodes route traffic internally via CNI</p>
</li>
</ul>
<h3 id="heading-level-2-metallb"><strong>Level 2 – MetalLB</strong></h3>
<ul>
<li><p>Provides external IPs from a dedicated pool</p>
</li>
<li><p>These IPs map to LoadBalancer services</p>
</li>
<li><p>MetalLB advertises these IPs at Layer 2</p>
</li>
</ul>
<h3 id="heading-level-3-ingress"><strong>Level 3 – Ingress</strong></h3>
<ul>
<li><p>Receives HTTP/S traffic at the MetalLB IP</p>
</li>
<li><p>Routes requests to internal services and pods</p>
</li>
<li><p>Handles hostnames, TLS, etc.</p>
</li>
</ul>
<h3 id="heading-level-4-your-applications"><strong>Level 4 – Your Applications</strong></h3>
<ul>
<li>Finally receive traffic that originated outside the cluster</li>
</ul>
<p>This layered architecture is what makes Kubernetes networking powerful, scalable, and modular.</p>
<hr />
<h2 id="heading-summary-table"><strong>Summary Table</strong></h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Component</td><td>Purpose</td><td>IP Source</td><td>Layer</td></tr>
</thead>
<tbody>
<tr>
<td>Node IP</td><td>Identify physical machines</td><td>LAN/DHCP</td><td>Layer 3</td></tr>
<tr>
<td>Pod IP</td><td>Identify individual containers</td><td>Pod CIDR</td><td>Layer 3</td></tr>
<tr>
<td>Service IP</td><td>Internal virtual service endpoints</td><td>Cluster CIDR</td><td>Layer 3</td></tr>
<tr>
<td>MetalLB IP</td><td>External access IPs for services</td><td>MetalLB pool</td><td>Layer 2</td></tr>
<tr>
<td>Ingress Controller</td><td>Routes HTTP/S traffic</td><td>Behind MetalLB</td><td>Layer 7</td></tr>
</tbody>
</table>
</div><h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>Kubernetes networking is far easier to understand when each layer is viewed separately and then combined into a complete model. Nodes receive IPs from your network. Pods receive internal IPs from Kubernetes. Services act as stable access points. Load balancers provide external connectivity. MetalLB brings cloud-style load balancers to bare-metal clusters. It does not limit how many nodes you can have. Ingress controllers consolidate routing so that many applications can share one external IP. Your IP pool only limits the number of <strong>LoadBalancer services</strong> you can expose, not the number of worker nodes, pods, or applications.</p>
<p>With these concepts understood together, you gain complete control over how your workloads are exposed and how your cluster interacts with the outside world.</p>
]]></content:encoded></item><item><title><![CDATA[Building Your Own Home Kubernetes Cluster with k0s and Remote Access]]></title><description><![CDATA[Kubernetes is the powerhouse of modern container orchestration, but setting it up at home or on minimal infrastructure can feel daunting. In this blog, I will walk you through creating a lightweight, fully functional Kubernetes cluster using k0s, com...]]></description><link>https://blog.nyzex.in/building-your-own-home-kubernetes-cluster-with-k0s-and-remote-access</link><guid isPermaLink="true">https://blog.nyzex.in/building-your-own-home-kubernetes-cluster-with-k0s-and-remote-access</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Homelab]]></category><category><![CDATA[tunneling]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Mon, 17 Nov 2025 21:45:48 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1763415932228/ac277a71-3ef8-473e-9329-e84f0f82e670.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Kubernetes is the powerhouse of modern container orchestration, but setting it up at home or on minimal infrastructure can feel daunting. In this blog, I will walk you through creating a <strong>lightweight, fully functional Kubernetes cluster</strong> using <strong>k0s</strong>, complete with a <strong>control plane and a worker node</strong>, and make it accessible <strong>remotely via a Pangolin tunnel</strong>.</p>
<p>By the end, you will have a cluster you can experiment on from anywhere.</p>
<hr />
<h2 id="heading-why-k0s"><strong>Why k0s?</strong></h2>
<p>k0s is a <strong>lightweight, all-in-one Kubernetes distribution</strong> that simplifies the setup process:</p>
<ul>
<li><p>Single binary for control plane and worker.</p>
</li>
<li><p>Minimal resource usage: ideal for home servers or VMs.</p>
</li>
<li><p>Easy to manage, yet fully compliant with Kubernetes APIs.</p>
</li>
<li><p>Perfect for learning, experimentation, or small production projects.</p>
</li>
</ul>
<p>This makes it ideal for our goal: a <strong>home lab cluster</strong> with remote access.</p>
<hr />
<h2 id="heading-step-1-setting-up-the-control-plane"><strong>Step 1: Setting Up the Control Plane</strong></h2>
<p>The control plane is the “brain” of the Kubernetes cluster: it manages nodes, schedules workloads, and exposes the API server.</p>
<h3 id="heading-install-k0s"><strong>Install k0s</strong></h3>
<pre><code class="lang-yaml"><span class="hljs-string">sudo</span> <span class="hljs-string">apt</span> <span class="hljs-string">update</span> <span class="hljs-string">&amp;&amp;</span> <span class="hljs-string">sudo</span> <span class="hljs-string">apt</span> <span class="hljs-string">install</span> <span class="hljs-string">curl</span> <span class="hljs-string">-y</span>
<span class="hljs-string">curl</span> <span class="hljs-string">-sSLf</span> <span class="hljs-string">https://get.k0s.sh</span> <span class="hljs-string">|</span> <span class="hljs-string">sudo</span> <span class="hljs-string">bash</span>
<span class="hljs-string">k0s</span> <span class="hljs-string">version</span>
</code></pre>
<p>Next, install the <strong>controller</strong> and start it:</p>
<pre><code class="lang-yaml"><span class="hljs-string">sudo</span> <span class="hljs-string">k0s</span> <span class="hljs-string">install</span> <span class="hljs-string">controller</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">k0s</span> <span class="hljs-string">start</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">k0s</span> <span class="hljs-string">status</span>
</code></pre>
<p><strong>Output example:</strong></p>
<pre><code class="lang-yaml"><span class="hljs-attr">Version:</span> <span class="hljs-string">v1.34.1+k0s.1</span>
<span class="hljs-attr">Role:</span> <span class="hljs-string">controller</span>
<span class="hljs-attr">Workloads:</span> <span class="hljs-literal">false</span>
<span class="hljs-attr">SingleNode:</span> <span class="hljs-literal">false</span>
</code></pre>
<p>This confirms the control plane is running.</p>
<hr />
<h2 id="heading-step-2-configure-kubectl-on-the-control-plane"><strong>Step 2: Configure kubectl on the Control Plane</strong></h2>
<p>To interact with Kubernetes, we need <strong>kubectl</strong>, the CLI tool.</p>
<ol>
<li>Generate your kubeconfig:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">mkdir</span> <span class="hljs-string">-p</span> <span class="hljs-string">~/.kube</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">k0s</span> <span class="hljs-string">kubeconfig</span> <span class="hljs-string">admin</span> <span class="hljs-string">&gt;</span> <span class="hljs-string">~/.kube/config</span>
<span class="hljs-string">chmod</span> <span class="hljs-number">600</span> <span class="hljs-string">~/.kube/config</span>
</code></pre>
<ol start="2">
<li>Install kubectl:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">curl</span> <span class="hljs-string">-LO</span> <span class="hljs-string">"https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">install</span> <span class="hljs-string">-o</span> <span class="hljs-string">root</span> <span class="hljs-string">-g</span> <span class="hljs-string">root</span> <span class="hljs-string">-m</span> <span class="hljs-number">0755 </span><span class="hljs-string">kubectl</span> <span class="hljs-string">/usr/local/bin/kubectl</span>
</code></pre>
<ol start="3">
<li>Verify the cluster is reachable:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">nodes</span>
</code></pre>
<p>At this point, you have a <strong>single-node control plane</strong> ready.</p>
<hr />
<h2 id="heading-step-3-add-a-worker-node"><strong>Step 3: Add a Worker Node</strong></h2>
<p>The worker node is where your workloads (pods, deployments, services) will actually run.</p>
<ol>
<li>On the control plane, generate a token for the worker:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">sudo</span> <span class="hljs-string">k0s</span> <span class="hljs-string">token</span> <span class="hljs-string">create</span> <span class="hljs-string">--role=worker</span>
</code></pre>
<ol start="2">
<li>On the worker machine:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">sudo</span> <span class="hljs-string">apt</span> <span class="hljs-string">update</span> <span class="hljs-string">&amp;&amp;</span> <span class="hljs-string">sudo</span> <span class="hljs-string">apt</span> <span class="hljs-string">install</span> <span class="hljs-string">curl</span> <span class="hljs-string">-y</span>
<span class="hljs-string">curl</span> <span class="hljs-string">-sSLf</span> <span class="hljs-string">https://get.k0s.sh</span> <span class="hljs-string">|</span> <span class="hljs-string">sudo</span> <span class="hljs-string">bash</span>
<span class="hljs-string">nano</span> <span class="hljs-string">tokenfile</span>  <span class="hljs-comment"># paste the token from control plane</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">k0s</span> <span class="hljs-string">install</span> <span class="hljs-string">worker</span> <span class="hljs-string">--token-file</span> <span class="hljs-string">tokenfile</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">k0s</span> <span class="hljs-string">start</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">systemctl</span> <span class="hljs-string">enable</span> <span class="hljs-string">--now</span> <span class="hljs-string">k0sworker</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">journalctl</span> <span class="hljs-string">-fu</span> <span class="hljs-string">k0sworker</span>
</code></pre>
<ol start="3">
<li>Back on the control plane, verify the worker joined:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">nodes</span> <span class="hljs-string">-o</span> <span class="hljs-string">wide</span>
</code></pre>
<p><strong>Output:</strong></p>
<pre><code class="lang-yaml"><span class="hljs-string">NAME</span>                <span class="hljs-string">STATUS</span>   <span class="hljs-string">ROLES</span>    <span class="hljs-string">AGE</span>   <span class="hljs-string">VERSION</span>       <span class="hljs-string">INTERNAL-IP</span>     <span class="hljs-string">EXTERNAL-IP</span>   <span class="hljs-string">OS-IMAGE</span>             <span class="hljs-string">KERNEL-VERSION</span>     <span class="hljs-string">CONTAINER-RUNTIME</span>
<span class="hljs-string">ubuntu-workernode</span>   <span class="hljs-string">Ready</span>    <span class="hljs-string">&lt;none&gt;</span>   <span class="hljs-string">20m</span>   <span class="hljs-string">v1.34.1+k0s</span>   <span class="hljs-number">192.168</span><span class="hljs-number">.29</span><span class="hljs-number">.39</span>   <span class="hljs-string">&lt;none&gt;</span>        <span class="hljs-string">Ubuntu</span> <span class="hljs-number">24.04</span><span class="hljs-number">.3</span> <span class="hljs-string">LTS</span>   <span class="hljs-number">6.8</span><span class="hljs-number">.0</span><span class="hljs-number">-87</span><span class="hljs-string">-generic</span>   <span class="hljs-string">containerd://1.7.28</span>
</code></pre>
<hr />
<h2 id="heading-step-4-accessing-your-cluster-remotely"><strong>Step 4: Accessing Your Cluster Remotely</strong></h2>
<p>One of the most exciting parts is accessing your cluster from <strong>outside your network</strong>. For this, we will use a <strong>Pangolin tunnel</strong> to expose the control plane.</p>
<p>You can follow my previous blog regarding Pangolin setup:</p>
<p><a target="_blank" href="https://blog.nyzex.in/self-hosting-pangolin-newt-on-your-own-server">https://blog.nyzex.in/self-hosting-pangolin-newt-on-your-own-server</a></p>
<ol>
<li>Copy your kubeconfig to your remote machine.</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-comment">#first create the kubeconfig file in the controlplane vm</span>
<span class="hljs-string">sudo</span> <span class="hljs-string">k0s</span> <span class="hljs-string">kubeconfig</span> <span class="hljs-string">admin</span> <span class="hljs-string">&gt;</span> <span class="hljs-string">~/.kube/config</span>
<span class="hljs-string">chmod</span> <span class="hljs-number">600</span> <span class="hljs-string">~/.kube/config</span>
<span class="hljs-string">cat</span> <span class="hljs-string">~/.kube/config</span>
</code></pre>
<ol start="2">
<li>Update the <code>server:</code> field to your Pangolin hostname (after copying this config to our remote machine):</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-attr">clusters:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">cluster:</span>
    <span class="hljs-attr">server:</span> <span class="hljs-string">https://tunnel.nyzex.in:6443</span>
    <span class="hljs-attr">insecure-skip-tls-verify:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">local</span>
</code></pre>
<blockquote>
<p>Note: <code>insecure-skip-tls-verify: true</code> bypasses the TLS hostname check since our certificate is for internal names. This is fine for personal labs, but not recommended for production.</p>
</blockquote>
<ol start="3">
<li>Set your kubeconfig and verify:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">export</span> <span class="hljs-string">KUBECONFIG=$(pwd)/kubeconfig</span>
<span class="hljs-string">kubectl</span> <span class="hljs-string">get</span> <span class="hljs-string">nodes</span> <span class="hljs-string">-o</span> <span class="hljs-string">wide</span>
</code></pre>
<p>You should see both the <strong>control plane and worker node</strong>, now accessible remotely.</p>
<hr />
<h2 id="heading-step-5-how-it-works"><strong>Step 5: How It Works</strong></h2>
<p>Here’s a simple view of the setup:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763415103427/452915c0-479b-4c63-8d2c-28952ef56ac9.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p><strong>Control Plane</strong>: API server and cluster management.</p>
</li>
<li><p><strong>Worker Node</strong>: Runs workloads.</p>
</li>
<li><p><strong>Remote Machine</strong>: Access via Pangolin tunnel.</p>
</li>
</ul>
<hr />
<h2 id="heading-what-to-do-next"><strong>What to do next?</strong></h2>
<ul>
<li><p>For production-grade security, generate a <strong>certificate that includes your external hostname</strong> instead of skipping TLS verification.</p>
</li>
<li><p>Add more workers to scale your cluster.</p>
</li>
<li><p>Deploy your first workloads and explore Kubernetes features.</p>
</li>
</ul>
<hr />
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>With a few steps, you now have a <strong>home Kubernetes lab</strong>:</p>
<ul>
<li><p>Control plane + worker node cluster.</p>
</li>
<li><p>Remote kubectl access via Pangolin tunnel.</p>
</li>
<li><p>Fully functional, ready to deploy workloads.</p>
</li>
</ul>
<p>This setup is perfect for experimenting with Kubernetes, testing CI/CD pipelines, or just learning cluster management hands-on.</p>
]]></content:encoded></item><item><title><![CDATA[Talos OS: A Hard Earned Understanding Of Storage, Certificates, And Access]]></title><description><![CDATA[Talos OS promises a fully immutable, API driven Kubernetes experience. It removes the idea of logging into nodes, changing files manually, or performing maintenance through traditional means. This design brings a high level of security and predictabi...]]></description><link>https://blog.nyzex.in/talos-os-a-hard-earned-understanding-of-storage-certificates-and-access</link><guid isPermaLink="true">https://blog.nyzex.in/talos-os-a-hard-earned-understanding-of-storage-certificates-and-access</guid><category><![CDATA[Homelab]]></category><category><![CDATA[Security]]></category><category><![CDATA[Devops]]></category><category><![CDATA[talos-linux]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[operating system]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Mon, 17 Nov 2025 13:54:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1763387659592/7b24e21f-7a20-4508-afea-88f6e8f54988.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Talos OS promises a fully immutable, API driven Kubernetes experience. It removes the idea of logging into nodes, changing files manually, or performing maintenance through traditional means. This design brings a high level of security and predictability. It also brings a set of challenges that many users, including myself, only discover once Talos becomes part of a real cluster.</p>
<p>During my recent effort to run PostgreSQL on a Talos cluster, I experienced several failures, unexpected behaviours, and some difficult recovery situations. This post documents that entire experience. I want this to help anyone who is trying to use Talos in a small cluster or a homelab environment, because the learning curve is very steep.</p>
<h2 id="heading-understanding-why-storage-becomes-difficult"><strong>Understanding Why Storage Becomes Difficult</strong></h2>
<p>Talos is an immutable operating system. This sounds ideal until you try to mount storage. Many Kubernetes setups allow you to create directories directly on the node using simple commands. Talos does not allow this approach.</p>
<p>These are the limitations that immediately matter:</p>
<ul>
<li><p>You cannot create directories on the host manually</p>
</li>
<li><p>You cannot change permissions manually</p>
</li>
<li><p>You cannot rely on paths that do not already exist</p>
</li>
<li><p>You cannot SSH into the node to fix things</p>
</li>
<li><p>You cannot depend on anything that is not part of the machine configuration</p>
</li>
</ul>
<p>My PostgreSQL deployment required a PersistentVolume backed by local storage. I created a PersistentVolume that pointed to a path like <code>/var/lib/postgres</code> or <code>/mnt/postgres</code>. Each attempt failed with the same error.</p>
<pre><code class="lang-yaml"><span class="hljs-string">MountVolume.NewMounter</span> <span class="hljs-string">initialization</span> <span class="hljs-string">failed</span> <span class="hljs-string">for</span> <span class="hljs-string">volume</span> <span class="hljs-string">"pv-postgres"</span> <span class="hljs-string">:</span> <span class="hljs-string">path</span> <span class="hljs-string">"/var/lib/postgres"</span> <span class="hljs-string">does</span> <span class="hljs-string">not</span> <span class="hljs-string">exist</span>
</code></pre>
<p>Talos refuses to mount a path that does not exist. Since I was unable to create that directory manually, I needed a path that already existed.</p>
<p>The solution was surprisingly simple!</p>
<p>I used a directory that Talos already creates by default.</p>
<pre><code class="lang-yaml"><span class="hljs-string">/var/local</span>
</code></pre>
<p>The moment I pointed my PersistentVolume to <code>/var/local</code>, PostgreSQL started successfully. The directory existed, the kubelet accepted it, and the pod finally mounted the data volume.</p>
<p>This taught me the most important lesson about Talos and storage. Any stateful workload that needs a hostPath or a local PersistentVolume must rely on a path that exists at boot through the machine configuration. If the path is not created by Talos, then the pod will fail.</p>
<p>I then discovered a default directory that Talos already creates: <code>/var/local</code>. Pointing my PV to this path worked immediately:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolume</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">pv-postgres</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">storageClassName:</span> <span class="hljs-string">localstorage</span>
  <span class="hljs-attr">capacity:</span>
    <span class="hljs-attr">storage:</span> <span class="hljs-string">10Gi</span>
  <span class="hljs-attr">accessModes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
  <span class="hljs-attr">local:</span>
    <span class="hljs-attr">path:</span> <span class="hljs-string">/var/local</span>
  <span class="hljs-attr">nodeAffinity:</span>
    <span class="hljs-attr">required:</span>
      <span class="hljs-attr">nodeSelectorTerms:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">matchExpressions:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">kubernetes.io/hostname</span>
          <span class="hljs-attr">operator:</span> <span class="hljs-string">In</span>
          <span class="hljs-attr">values:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-string">talos-1to-jsz</span>
</code></pre>
<p>The corresponding PVC was:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolumeClaim</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">data-postgresql</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">postgresql</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">accessModes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
  <span class="hljs-attr">storageClassName:</span> <span class="hljs-string">localstorage</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">storage:</span> <span class="hljs-string">10Gi</span>
</code></pre>
<p>Then I applied my deployment manifest, alongside svc:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">postgres</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">postgresql</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">postgres</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">postgres</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">containers:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">postgres</span>
          <span class="hljs-attr">image:</span> <span class="hljs-string">postgres:16</span>
          <span class="hljs-attr">imagePullPolicy:</span> <span class="hljs-string">IfNotPresent</span>
          <span class="hljs-attr">ports:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">containerPort:</span> <span class="hljs-number">5432</span>
          <span class="hljs-attr">env:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">POSTGRES_DB</span>
              <span class="hljs-attr">value:</span> <span class="hljs-string">mydb</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">POSTGRES_USER</span>
              <span class="hljs-attr">value:</span> <span class="hljs-string">myuser</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">POSTGRES_PASSWORD</span>
              <span class="hljs-attr">value:</span> <span class="hljs-string">mypassword</span>
          <span class="hljs-attr">volumeMounts:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">postgres-data</span>
              <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/var/lib/postgresql/data</span>
      <span class="hljs-attr">volumes:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">postgres-data</span>
          <span class="hljs-attr">persistentVolumeClaim:</span>
            <span class="hljs-attr">claimName:</span> <span class="hljs-string">data-postgresql</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">postgres</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">postgresql</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">postgres</span>
  <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">postgres</span>
      <span class="hljs-attr">protocol:</span> <span class="hljs-string">TCP</span>
      <span class="hljs-attr">port:</span> <span class="hljs-number">5432</span>
      <span class="hljs-attr">targetPort:</span> <span class="hljs-number">5432</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">ClusterIP</span>
</code></pre>
<p>After this, the PostgreSQL pod successfully started and mounted the volume. I was able to connect to it using a test pod:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">run</span> <span class="hljs-string">psql-test</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--rm</span> <span class="hljs-string">-it</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--image=postgres:16</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--namespace</span> <span class="hljs-string">postgresql</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--env="PGPASSWORD=mypassword"</span> <span class="hljs-string">\</span>
  <span class="hljs-string">--</span> <span class="hljs-string">psql</span> <span class="hljs-string">-h</span> <span class="hljs-string">postgres</span> <span class="hljs-string">-U</span> <span class="hljs-string">myuser</span> <span class="hljs-string">-d</span> <span class="hljs-string">mydb</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763387499925/0d219043-d1a7-4090-b3bb-70e79cfbdb79.png" alt class="image--center mx-auto" /></p>
<p>This experience taught me a critical lesson about Talos and storage: any stateful workload requiring a hostPath or local PV must rely on directories that exist at boot through the machine configuration. Trying to use arbitrary directories will fail. While <code>/var/local</code> worked, this is not a best practice and should be avoided in production.</p>
<h2 id="heading-why-using-existing-directories-like-varlocal-works-but-is-not-a-good-practice"><strong>Why Using Existing Directories Like</strong> <code>/var/local</code> Works But Is Not A Good Practice</h2>
<p>When I struggled to mount storage for PostgreSQL on Talos, I eventually discovered that <code>/var/local</code> already existed on every node. Talos creates this directory during early boot. As soon as I pointed my PersistentVolume to <code>/var/local</code>, the database pod started without any issues. The directory already existed, the kubelet was satisfied, and the pod finally mounted the data volume.</p>
<p>This seems like a convenient solution, but it is not a recommended approach. It introduces several long term problems and operational risks.</p>
<p>Here is why it is not a good practice.</p>
<h3 id="heading-1-this-directory-is-not-meant-for-application-data"><strong>1. This directory is not meant for application data</strong></h3>
<p><code>/var/local</code> is an internal Talos directory and is not designed for stateful workloads. Talos can modify or use this directory for its own purposes in future releases. Talos does not document or guarantee that this directory will always exist, or that it will behave the same way across upgrades.</p>
<p>You are depending on behaviour that is a side effect of the operating system rather than a stable feature.</p>
<h3 id="heading-2-all-applications-will-share-the-same-storage-location"><strong>2. All applications will share the same storage location</strong></h3>
<p>If you reuse <code>/var/local</code> for every PersistentVolume, then every stateful application on that node will write data into the same directory. This will cause:</p>
<ul>
<li><p>A lack of isolation between applications</p>
</li>
<li><p>Possible permission conflicts</p>
</li>
<li><p>A risk of one application filling the entire directory and breaking the others</p>
</li>
<li><p>Difficulty with debugging and storage visibility</p>
</li>
</ul>
<p>It also becomes impossible to safely delete or migrate individual application data.</p>
<h3 id="heading-3-talos-does-not-enforce-size-limits"><strong>3. Talos does not enforce size limits</strong></h3>
<p>Kubernetes does not enforce the size declared in a PersistentVolume. Since <code>/var/local</code> is just a directory, any application can exceed the declared ten gibibytes. There is no guarantee that the node will not run out of space.</p>
<p>In a worst case scenario, the node can crash due to disk pressure.</p>
<h3 id="heading-4-this-breaks-the-philosophy-of-talos"><strong>4. This breaks the philosophy of Talos</strong></h3>
<p>Talos is meant to be fully declarative. Anything that exists should be defined through the MachineConfig, not through accidental filesystem structure that happens to be present.</p>
<p>If the directory is not created explicitly by configuration, then it is not an intentional part of your infrastructure. It is risky to build application level storage on top of something that was not designed for this purpose.</p>
<h3 id="heading-5-upgrades-and-reinstallations-can-remove-the-directory"><strong>5. Upgrades and reinstallations can remove the directory</strong></h3>
<p>During major upgrades or node reinstalls, Talos can reset or restructure internal directories. If <code>/var/local</code> is removed, renamed, or reformatted, then every application that relies on it will lose its data.</p>
<p>Your data becomes fragile and tied to undocumented filesystem details.</p>
<hr />
<h2 id="heading-the-correct-talos-approved-way"><strong>The Correct Talos Approved Way</strong></h2>
<p>The recommended way to create storage paths in Talos is:</p>
<p><strong>Use a MachineConfig patch to create directories with explicit permissions and ownership.</strong></p>
<p>For example, you can declare this:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">machine:</span>
  <span class="hljs-attr">files:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">/var/data/postgres</span>
      <span class="hljs-attr">permissions:</span> <span class="hljs-string">0o755</span>
      <span class="hljs-attr">owner:</span> <span class="hljs-number">0</span>
      <span class="hljs-attr">group:</span> <span class="hljs-number">0</span>
      <span class="hljs-attr">directory:</span> <span class="hljs-literal">true</span>
</code></pre>
<p>You can create as many directories as you need. This is the clean, controlled, and safe method.</p>
<p>You then point your PersistentVolumes to these paths with confidence that:</p>
<ul>
<li><p>They will exist on every node</p>
</li>
<li><p>They will survive upgrades</p>
</li>
<li><p>They are dedicated to the correct application</p>
</li>
<li><p>They will not conflict with Talos internals</p>
</li>
</ul>
<p>This is the long term maintainable approach, but I faced the certificate error here, which made life difficult!!</p>
<h2 id="heading-expanding-persistent-volumes">Expanding Persistent Volumes</h2>
<p>In the example above, I claimed the full 10Gi of available storage on the node. If I want to expand storage later, I must first ensure more disk space is available at the path and then patch the PVC to request the larger size. Kubernetes will handle resizing if the StorageClass allows it.</p>
<p>For multiple services requiring persistent storage, I can technically reuse <code>/var/local</code>, but this is also not recommended. Each workload should ideally have its own dedicated storage path or volume managed through a proper storage provider.</p>
<hr />
<h2 id="heading-why-longhorn-does-not-work-easily"><strong>Why Longhorn Does Not Work Easily</strong></h2>
<p>Longhorn requires certain kernel modules, directory mounts, and filesystem behaviours that Talos does not support by default. Talos aims for very minimal host configuration. Longhorn expects the opposite. The two conflict in many ways.</p>
<p>As a result, most users who attempt Longhorn on Talos experience failures. This includes random crashes, volume mount issues, replica failures, and inability to start the Longhorn UI.</p>
<p>The safer alternative is to use Talos MachineConfig patches to create custom paths and then rely on local PersistentVolumes. This reduces flexibility but increases stability.</p>
<hr />
<h2 id="heading-the-strange-behaviour-of-persistentvolumes-in-talos"><strong>The Strange Behaviour Of PersistentVolumes In Talos</strong></h2>
<p>When I created a 10 gigabyte PersistentVolume and a matching PersistentVolumeClaim, it worked immediately on <code>/var/local</code>. This made me curious about what was actually happening.</p>
<p>Here is the explanation.</p>
<ul>
<li><p>The size declared in a PersistentVolume does not actually allocate disk space</p>
</li>
<li><p>The directory simply points to the host filesystem</p>
</li>
<li><p>Talos does not perform any reservations</p>
</li>
<li><p>Kubelet does not enforce storage consumption limits</p>
</li>
<li><p>The declared capacity only informs Kubernetes scheduling</p>
</li>
</ul>
<p>This means that even if you declare ten gibibytes, the actual host directory is unbounded. PostgreSQL can consume much more than the advertised size if it needs to. The responsibility of storage growth sits entirely on you as the administrator.</p>
<p>If you want to expand a PersistentVolume in Talos:</p>
<ol>
<li><p>You need a larger physical directory available</p>
</li>
<li><p>You resize the PersistentVolumeClaim</p>
</li>
<li><p>The underlying filesystem must support online expansion</p>
</li>
</ol>
<p>This works with filesystems like ext4 and xfs if configured correctly by the node.</p>
<hr />
<h2 id="heading-certificates-and-node-access">Certificates and Node Access</h2>
<p>Another issue I faced was related to Talos certificates. After initial node creation, when I tried to apply a new worker configuration (machine files path to create directory), I received errors like:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">error applying new configuration: rpc error:</span> <span class="hljs-string">code</span> <span class="hljs-string">=</span> <span class="hljs-string">Unavailable</span> <span class="hljs-string">desc</span> <span class="hljs-string">=</span> <span class="hljs-attr">connection error:</span> <span class="hljs-string">desc</span> <span class="hljs-string">=</span> <span class="hljs-string">"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority"</span>
</code></pre>
<p>Even though I used the same <code>worker.yaml</code> file, Talos refused the connection. The lesson here is that Talos tightly couples node certificates with the cluster PKI. If a certificate becomes invalid or untrusted, you can lose direct access.</p>
<p>There are several scenarios where this failure can occur:</p>
<ul>
<li><p>You regenerate a machine configuration file</p>
</li>
<li><p>You recreate a node with a slightly different configuration</p>
</li>
<li><p>You lose your talosconfig file</p>
</li>
<li><p>The cluster CA is overridden when bootstrap runs again</p>
</li>
<li><p>Node IP addresses change</p>
</li>
<li><p>You accidentally mix configuration files from different clusters</p>
</li>
</ul>
<p>When this happens, you cannot log into the node. You cannot fix files manually. You cannot mount a debug shell. This is both a security feature and a very serious operational risk.</p>
<p>This is the moment when Talos begins to feel unforgiving. A lost certificate means that you lose control of the node unless you have backups of your original configuration, certificates, and bootstrap secrets.</p>
<h2 id="heading-the-difficulty-of-troubleshooting-and-recovery"><strong>The Difficulty Of Troubleshooting And Recovery</strong></h2>
<p>Troubleshooting Talos is not like troubleshooting a normal Linux server. There is no SSH access, no direct shell, and no persistent filesystem to examine. Everything flows through the Talos API. If the Talos API is broken because of mismatched certificates, then you are locked out completely.</p>
<p>The only reliable recovery options are:</p>
<ul>
<li><p>Reboot into maintanence mode</p>
</li>
<li><p>Reapply a machine configuration</p>
</li>
<li><p>Restore backed up secrets</p>
</li>
<li><p>Reinstall the node if nothing else works</p>
</li>
</ul>
<p>This can feel very restrictive. It requires a mindset shift. Talos does not want administrators to fix issues manually. Talos wants everything to be declared in configuration files from the start.</p>
<p>This is powerful, but very easy to break if any detail is forgotten, and if you need to reset the node, consider your previous work gone :(</p>
<h3 id="heading-recovering-node-access">Recovering Node Access</h3>
<p>To regain access to a Talos node when the certificate fails:</p>
<ol>
<li>Use the Talos bootstrap token for insecure communication:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">talosctl</span> <span class="hljs-string">apply-config</span> <span class="hljs-string">--insecure</span> <span class="hljs-string">-n</span> <span class="hljs-string">&lt;node-ip&gt;</span> <span class="hljs-string">--file</span> <span class="hljs-string">worker.yaml</span>
</code></pre>
<ol start="2">
<li>Alternatively, download a fresh Talos configuration using:</li>
</ol>
<pre><code class="lang-yaml"><span class="hljs-string">talosctl</span> <span class="hljs-string">-n</span> <span class="hljs-string">&lt;node-ip&gt;</span> <span class="hljs-string">kubeconfig</span> <span class="hljs-string">-f</span> <span class="hljs-string">kubeconfig.yaml</span>
</code></pre>
<ol start="3">
<li><p>Use <code>talosctl --insecure</code> carefully, as it bypasses certificate validation.</p>
</li>
<li><p>Always back up your <code>talosconfig</code> files and certificates. Losing them makes access recovery difficult.</p>
</li>
</ol>
<h2 id="heading-key-lessons">Key Lessons</h2>
<ul>
<li><p>Talos does not allow arbitrary hostPath directories. Only paths present at boot or created via machine configuration are valid for PVs.</p>
</li>
<li><p>Use <code>/var/local</code> as a temporary solution, but do not rely on it for production workloads.</p>
</li>
<li><p>Always backup Talos certificates and configuration to avoid losing access.</p>
</li>
<li><p>Stateful workloads require careful planning of persistent storage in Talos.</p>
</li>
</ul>
<p>Talos is secure and minimal by design, but these features make working with storage and configuration more challenging than standard Linux nodes.</p>
<hr />
<h2 id="heading-my-conclusion-after-working-through-these-issues"><strong>My Conclusion After Working Through These Issues</strong></h2>
<p>Talos OS is impressive. It is secure and consistent. At the same time, it brings operational challenges that are not obvious until you experience them directly.</p>
<p>The three biggest issues I faced were:</p>
<ol>
<li><p>Storage paths that cannot be created manually</p>
</li>
<li><p>Certificates that break node access</p>
</li>
<li><p>Recovery procedures that depend entirely on correct configuration files</p>
</li>
</ol>
<p>Talos is ideal for large production environments where every configuration is version controlled, stable, and tested. It can be difficult for homelab environments where experimentation is common and nodes often change.</p>
<p>In the end I created working PersistentVolumes, understood the certificate failures, recovered node access, and built functional PostgreSQL storage. This journey helped me understand Talos in a much deeper way. I now appreciate how strict and predictable it is, even though that strictness caused many of the problems.</p>
<p>If you are planning to learn Talos, try to keep configuration backups from the very beginning. It will save you much more time later!</p>
<p>I am still learning and perhaps there are simpler ways to have Talos in homelab experiments, but for now I will stick to more experiment friendly kubernetes.</p>
]]></content:encoded></item><item><title><![CDATA[Running Jenkins on Kubernetes – Complete Setup Experience]]></title><description><![CDATA[Running Jenkins on Kubernetes is one of those tasks that seems simple in theory but teaches a lot once you go through it. I wanted to host Jenkins in my cluster for CI workloads and learn how it behaves with persistent volumes, service accounts, and ...]]></description><link>https://blog.nyzex.in/running-jenkins-on-kubernetes-complete-setup-experience</link><guid isPermaLink="true">https://blog.nyzex.in/running-jenkins-on-kubernetes-complete-setup-experience</guid><category><![CDATA[ci-cd]]></category><category><![CDATA[Jenkins]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[talos-linux]]></category><category><![CDATA[pangolin]]></category><dc:creator><![CDATA[Sanjeev Kumar Bharadwaj]]></dc:creator><pubDate>Thu, 13 Nov 2025 18:22:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1763057983747/6115e023-4ffa-4f89-b953-1644c6405112.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Running Jenkins on Kubernetes is one of those tasks that seems simple in theory but teaches a lot once you go through it. I wanted to host Jenkins in my cluster for CI workloads and learn how it behaves with persistent volumes, service accounts, and ingress exposure. This is how I set it up step by step.</p>
<hr />
<h3 id="heading-understanding-the-goal">Understanding the Goal</h3>
<p>My objective was clear:</p>
<ul>
<li><p>Deploy Jenkins in a dedicated namespace.</p>
</li>
<li><p>Persist Jenkins data using a local PersistentVolume.</p>
</li>
<li><p>Run Jenkins on a worker node, not on the control plane.</p>
</li>
<li><p>Expose Jenkins externally through an NGINX ingress using my domain.</p>
</li>
</ul>
<p>Since I already had an ingress setup with MetalLB and a domain configured through Pangolin tunnel, Jenkins exposure had to follow the same model as my existing services.</p>
<p>You can check out my previous blog for this setup!<br /><a target="_blank" href="https://blog.nyzex.in/exposing-kubernetes-services-over-the-internet-using-metallb-nginx-ingress-and-pangolin">https://blog.nyzex.in/exposing-kubernetes-services-over-the-internet-using-metallb-nginx-ingress-and-pangolin</a></p>
<hr />
<h3 id="heading-setting-up-the-namespace-and-storage">Setting up the Namespace and Storage</h3>
<p>The first step was to prepare the storage layer. Jenkins requires persistent data for plugins, jobs, and configurations, so I decided to use a <strong>local PersistentVolume</strong> that maps to a directory on one of my worker nodes.</p>
<p>Below is the YAML I used for the storage setup:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">kind:</span> <span class="hljs-string">StorageClass</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">storage.k8s.io/v1</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">local-storage</span>
<span class="hljs-attr">provisioner:</span> <span class="hljs-string">kubernetes.io/no-provisioner</span>
<span class="hljs-attr">volumeBindingMode:</span> <span class="hljs-string">WaitForFirstConsumer</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolume</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">pv-jenkins</span>
  <span class="hljs-attr">labels:</span>
    <span class="hljs-attr">type:</span> <span class="hljs-string">local</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">storageClassName:</span> <span class="hljs-string">local-storage</span>
  <span class="hljs-attr">capacity:</span>
    <span class="hljs-attr">storage:</span> <span class="hljs-string">20Gi</span>
  <span class="hljs-attr">accessModes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
  <span class="hljs-attr">local:</span>
    <span class="hljs-attr">path:</span> <span class="hljs-string">/mnt</span>
  <span class="hljs-attr">nodeAffinity:</span>
    <span class="hljs-attr">required:</span>
      <span class="hljs-attr">nodeSelectorTerms:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">matchExpressions:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">kubernetes.io/hostname</span>
              <span class="hljs-attr">operator:</span> <span class="hljs-string">In</span>
              <span class="hljs-attr">values:</span>
                <span class="hljs-bullet">-</span> <span class="hljs-string">talos-1to-jsz</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">PersistentVolumeClaim</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">pvc-jenkins</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">jenkins</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">storageClassName:</span> <span class="hljs-string">local-storage</span>
  <span class="hljs-attr">accessModes:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">ReadWriteOnce</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">requests:</span>
      <span class="hljs-attr">storage:</span> <span class="hljs-string">10Gi</span>
</code></pre>
<p>Here, <code>talos-1to-jsz</code> is my worker node where the Jenkins pod must run. The PV uses the local path <code>/mnt</code>, and the PVC binds to it successfully. Using <code>WaitForFirstConsumer</code> ensures that the PVC only binds when the pod is scheduled.</p>
<hr />
<h3 id="heading-configuring-the-service-account-and-rbac">Configuring the Service Account and RBAC</h3>
<p>Jenkins often needs to interact with the Kubernetes API for jobs and dynamic agent provisioning. To make sure it had sufficient access, I created a service account with a ClusterRole and a ClusterRoleBinding.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">rbac.authorization.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ClusterRole</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">admin-jenkins</span>
<span class="hljs-attr">rules:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">apiGroups:</span> [<span class="hljs-string">""</span>]
    <span class="hljs-attr">resources:</span> [<span class="hljs-string">"*"</span>]
    <span class="hljs-attr">verbs:</span> [<span class="hljs-string">"*"</span>]
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceAccount</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">admin-jenkins</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">jenkins</span>
<span class="hljs-meta">---</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">rbac.authorization.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ClusterRoleBinding</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">admin-jenkins</span>
<span class="hljs-attr">roleRef:</span>
  <span class="hljs-attr">apiGroup:</span> <span class="hljs-string">rbac.authorization.k8s.io</span>
  <span class="hljs-attr">kind:</span> <span class="hljs-string">ClusterRole</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">admin-jenkins</span>
<span class="hljs-attr">subjects:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">kind:</span> <span class="hljs-string">ServiceAccount</span>
    <span class="hljs-attr">name:</span> <span class="hljs-string">admin-jenkins</span>
    <span class="hljs-attr">namespace:</span> <span class="hljs-string">jenkins</span>
</code></pre>
<p>This gave Jenkins full administrative access to the cluster, which is acceptable for a controlled environment. In production, it is recommended to restrict permissions according to actual needs.</p>
<hr />
<h3 id="heading-deploying-jenkins">Deploying Jenkins</h3>
<p>With storage and RBAC in place, I created the deployment for Jenkins. The image used was <code>jenkins/jenkins:lts</code>. I ensured that the pod always runs on the worker node and uses the persistent volume claim for <code>/var/jenkins_home</code>.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">apps/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Deployment</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">deployment-jenkins</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">jenkins</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">1</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">matchLabels:</span>
      <span class="hljs-attr">app:</span> <span class="hljs-string">server-jenkins</span>
  <span class="hljs-attr">template:</span>
    <span class="hljs-attr">metadata:</span>
      <span class="hljs-attr">labels:</span>
        <span class="hljs-attr">app:</span> <span class="hljs-string">server-jenkins</span>
    <span class="hljs-attr">spec:</span>
      <span class="hljs-attr">securityContext:</span>
        <span class="hljs-attr">fsGroup:</span> <span class="hljs-number">1000</span>
        <span class="hljs-attr">runAsUser:</span> <span class="hljs-number">1000</span>
      <span class="hljs-attr">serviceAccountName:</span> <span class="hljs-string">admin-jenkins</span>
      <span class="hljs-attr">nodeSelector:</span>
        <span class="hljs-attr">kubernetes.io/hostname:</span> <span class="hljs-string">talos-1to-jsz</span>
      <span class="hljs-attr">containers:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">deployment-jenkins</span>
          <span class="hljs-attr">image:</span> <span class="hljs-string">jenkins/jenkins:lts</span>
          <span class="hljs-attr">resources:</span>
            <span class="hljs-attr">limits:</span>
              <span class="hljs-attr">memory:</span> <span class="hljs-string">"2Gi"</span>
              <span class="hljs-attr">cpu:</span> <span class="hljs-string">"1000m"</span>
            <span class="hljs-attr">requests:</span>
              <span class="hljs-attr">memory:</span> <span class="hljs-string">"500Mi"</span>
              <span class="hljs-attr">cpu:</span> <span class="hljs-string">"500m"</span>
          <span class="hljs-attr">ports:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">httpport</span>
              <span class="hljs-attr">containerPort:</span> <span class="hljs-number">8080</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">jnlpport</span>
              <span class="hljs-attr">containerPort:</span> <span class="hljs-number">50000</span>
          <span class="hljs-attr">livenessProbe:</span>
            <span class="hljs-attr">httpGet:</span>
              <span class="hljs-attr">path:</span> <span class="hljs-string">"/login"</span>
              <span class="hljs-attr">port:</span> <span class="hljs-number">8080</span>
            <span class="hljs-attr">initialDelaySeconds:</span> <span class="hljs-number">90</span>
            <span class="hljs-attr">periodSeconds:</span> <span class="hljs-number">10</span>
          <span class="hljs-attr">readinessProbe:</span>
            <span class="hljs-attr">httpGet:</span>
              <span class="hljs-attr">path:</span> <span class="hljs-string">"/login"</span>
              <span class="hljs-attr">port:</span> <span class="hljs-number">8080</span>
            <span class="hljs-attr">initialDelaySeconds:</span> <span class="hljs-number">60</span>
            <span class="hljs-attr">periodSeconds:</span> <span class="hljs-number">10</span>
          <span class="hljs-attr">volumeMounts:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">data-jenkins</span>
              <span class="hljs-attr">mountPath:</span> <span class="hljs-string">/var/jenkins_home</span>
      <span class="hljs-attr">volumes:</span>
        <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">data-jenkins</span>
          <span class="hljs-attr">persistentVolumeClaim:</span>
            <span class="hljs-attr">claimName:</span> <span class="hljs-string">pvc-jenkins</span>
</code></pre>
<p>Once deployed, the pod initially went into a <strong>Pending</strong> state because the PersistentVolume node affinity did not match. After correcting the hostname to <code>talos-1to-jsz</code>, it started running successfully.</p>
<hr />
<h3 id="heading-creating-the-service">Creating the Service</h3>
<p>To expose Jenkins internally, I created a simple ClusterIP service. This would later be used by the ingress controller.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Service</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">service-jenkins</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">jenkins</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">selector:</span>
    <span class="hljs-attr">app:</span> <span class="hljs-string">server-jenkins</span>
  <span class="hljs-attr">type:</span> <span class="hljs-string">ClusterIP</span>
  <span class="hljs-attr">ports:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">port:</span> <span class="hljs-number">8080</span>
      <span class="hljs-attr">targetPort:</span> <span class="hljs-number">8080</span>
</code></pre>
<p>At this point, I verified that the service correctly routed traffic to the pod. For a quick test, I used port forwarding:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">port-forward</span> <span class="hljs-string">-n</span> <span class="hljs-string">jenkins</span> <span class="hljs-string">deployment/deployment-jenkins</span> <span class="hljs-number">8080</span><span class="hljs-string">:8080</span>
</code></pre>
<p>Opening <a target="_blank" href="http://localhost:8080"><code>http://localhost:8080</code></a> brought up the Jenkins setup page.</p>
<hr />
<h3 id="heading-unlocking-jenkins">Unlocking Jenkins</h3>
<p>During the first startup, Jenkins requires an administrator password stored inside the container. The message on the web interface pointed to the file <code>/var/jenkins_home/secrets/initialAdminPassword</code>. I retrieved it using:</p>
<pre><code class="lang-yaml"><span class="hljs-string">kubectl</span> <span class="hljs-string">exec</span> <span class="hljs-string">-it</span> <span class="hljs-string">-n</span> <span class="hljs-string">jenkins</span> <span class="hljs-string">deployment/deployment-jenkins</span> <span class="hljs-string">--</span> <span class="hljs-string">cat</span> <span class="hljs-string">/var/jenkins_home/secrets/initialAdminPassword</span>
</code></pre>
<p>After entering this password in the web interface, Jenkins allowed me to continue the setup and install the recommended plugins.</p>
<hr />
<h3 id="heading-exposing-jenkins-through-ingress">Exposing Jenkins through Ingress</h3>
<p>Once Jenkins was fully functional, I exposed it externally using my NGINX ingress controller. Since I already had MetalLB and a working ingress for another service (kubenav), I followed the same pattern.</p>
<pre><code class="lang-yaml"><span class="hljs-attr">apiVersion:</span> <span class="hljs-string">networking.k8s.io/v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Ingress</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">jenkins</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">jenkins</span>
  <span class="hljs-attr">annotations:</span>
    <span class="hljs-attr">nginx.ingress.kubernetes.io/rewrite-target:</span> <span class="hljs-string">/</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">ingressClassName:</span> <span class="hljs-string">nginx</span>
  <span class="hljs-attr">rules:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">host:</span> <span class="hljs-string">jenkins.nyzex.in</span>
      <span class="hljs-attr">http:</span>
        <span class="hljs-attr">paths:</span>
          <span class="hljs-bullet">-</span> <span class="hljs-attr">path:</span> <span class="hljs-string">/</span>
            <span class="hljs-attr">pathType:</span> <span class="hljs-string">Prefix</span>
            <span class="hljs-attr">backend:</span>
              <span class="hljs-attr">service:</span>
                <span class="hljs-attr">name:</span> <span class="hljs-string">service-jenkins</span>
                <span class="hljs-attr">port:</span>
                  <span class="hljs-attr">number:</span> <span class="hljs-number">8080</span>
</code></pre>
<p>After applying this and pointing my DNS entry <a target="_blank" href="http://jenkins.knveyr.io"><code>jenkins.nyzex.in</code></a> to the MetalLB IP of my ingress controller, I was able to access Jenkins directly at:</p>
<pre><code class="lang-yaml"><span class="hljs-string">http://jenkins.nyzex.in</span>
</code></pre>
<p>It loaded perfectly through the ingress, confirming that the setup worked as intended.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763057090356/ffe983e5-075b-48fb-bb15-235000644da6.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763057104997/c8888f4c-6194-4033-a2cd-b9927e91ad76.png" alt class="image--center mx-auto" /></p>
<p>Now we have jenkins ready to use!</p>
<hr />
<h3 id="heading-conclusion">Conclusion</h3>
<p>This exercise helped me understand how Jenkins interacts with Kubernetes components such as PersistentVolumes, ServiceAccounts, and Ingress controllers. It also emphasized the importance of node affinity in local storage setups, especially when working with Talos nodes.</p>
<p>With this configuration, Jenkins runs reliably on my worker node, stores its data persistently, and is accessible through my domain managed via MetalLB and Pangolin. The next step will be to integrate Jenkins with GitHub and container registries to build a complete CI workflow.</p>
]]></content:encoded></item></channel></rss>