Building a Production-Ready Kubernetes Platform on Azure with GitOps: A Deep Dive

DevOps & Cloud Engineer — building scalable, automated, and intelligent systems. Developer of sorts | Automator | Innovator
When I started working on this project, the goal sounded simple:
deploy a microservices application stack on Azure, make it secure, scalable, observable, and easy for developers to ship changes.
In reality, it touched almost every part of cloud-native architecture: networking, databases, CI/CD, monitoring, and Kubernetes internals. This blog is my walkthrough of how everything fit together, what decisions we made, and the lessons I learned along the way.
The stack included:
Azure Kubernetes Service (AKS)
Azure Postgres Flexible Server (Private Access)
ACR, GitHub Actions, ArgoCD GitOps
RabbitMQ, DragonflyDB, ClickHouse, Prometheus, Loki, Jaeger
The goal was to create a secure, scalable, and observable microservices deployment environment: while maintaining developer-friendly workflows.
Why We Chose Azure + AKS
We needed:
A managed Kubernetes platform → Azure Kubernetes Service (AKS) made sense.
A managed PostgreSQL database with the ability to keep it private → Azure Postgres Flexible Server.
Native container registry integration → Azure Container Registry (ACR).
A CI/CD setup that didn't depend on manual triggers → GitHub Actions + ArgoCD GitOps.
But the key requirement was:
The Postgres database must not be publicly accessible.
It must only be reachable from inside the same Azure network.
This requirement shaped many of the decisions in the architecture.
Infrastructure with Terraform
All Azure resources were created using Terraform:
Resource Groups
Virtual Network + Subnets
AKS Cluster (with Managed Identity & Node Pools)
Azure Container Registry
Postgres Flexible Server (Private Networking Mode)
Terraform handled provisioning:
/infra ├── main.tf ├── variables.tf ├── outputs.tf └── modules/ ├── network/ ├── aks/ ├── postgres/ └── acr/Network Module
Created a VNet with multiple subnets to isolate components:
| Subnet | Purpose | Notes | | --- | --- | --- | |
aks-subnet| Worker nodes + Pods | Delegated to Microsoft.ContainerService | |db-subnet| Postgres Flexible Server | Service endpoint + private DNS required |Resource Group
This is the foundational unit in Azure.
Everything: VNet, AKS, DB, ACR, is placed inside this resource group.resource "azurerm_resource_group" "rg" { name = var.resource_group_name location = var.location }Example variables:
variable "resource_group_name" { description = "Name of the Azure Resource Group" type = string } variable "location" { description = "Azure Region" type = string default = "eastus" }All other resources will reference
azurerm_resource_group.rg.nameand.location.Virtual Network & Subnets (with Isolation)
Resource Group └── Virtual Network ├── aks-subnet (for cluster nodes & pods) └── db-subnet (for private Postgres)Terraform:
resource "azurerm_virtual_network" "vnet" { name = "main-vnet" location = azurerm_resource_group.rg.location resource_group_name = azurerm_resource_group.rg.name address_space = ["10.0.0.0/16"] }AKS Subnet
resource "azurerm_subnet" "aks" { name = "aks-subnet" resource_group_name = azurerm_resource_group.rg.name virtual_network_name = azurerm_virtual_network.vnet.name address_prefixes = ["10.0.1.0/24"] delegation { name = "aks-delegation" service_delegation { name = "Microsoft.ContainerService/managedClusters" } } }Postgres Flexible Server (Private)
resource "azurerm_postgresql_flexible_server" "postgres" { network { delegated_subnet_id = azurerm_subnet.db.id private_dns_zone_id = azurerm_private_dns_zone.postgres.id } }AKS Cluster
Used Managed Identity (no service principal passwords).
resource "azurerm_kubernetes_cluster" "aks" { identity { type = "SystemAssigned" } }Takeaway
The private DNS zone linking is what makes AKS → Postgres connectivity work.
If you miss that, queries fail even though networking looks correct.The above are just snippets, and not exactly what I used, but you get the idea!
CI: GitHub Actions → ACR
Here is a sample workflow file (one I often used) :
CI Pipeline: GitHub Actions → Azure Container Registry (ACR)
name: Build and Push to ACR
on:
push:
branches:
- preprod
env:
IMAGE_NAME: my-backend
permissions:
id-token: write
contents: read
jobs:
build-and-push:
runs-on: ubuntu-latest
steps:
- name: Checkout Source Code
uses: actions/checkout@v4
- name: Azure Login (OIDC)
uses: azure/login@v1
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Verify Azure Session
run: az account show
- name: Docker Login to ACR
uses: azure/docker-login@v1
with:
login-server: ${{ secrets.AZURE_URL }}
username: ${{ secrets.ACR_USERNAME }}
password: ${{ secrets.ACR_PASSWORD }}
- name: Determine Latest Tag and Bump Version
id: versioning
run: |
echo "Fetching latest image tag..."
latest=$(az acr repository show-tags \
--name ${{ secrets.AZURE_REGISTRY_NAME }} \
--repository ${{ env.IMAGE_NAME }} \
--orderby time_desc \
--output tsv --query '[0]') || latest="0.0.0"
echo "Latest version: $latest"
IFS='.' read -r major minor patch <<< "$latest"
new_patch=$((patch + 1))
new_tag="$major.$minor.$new_patch"
echo "New version: $new_tag"
echo "IMAGE_TAG=$new_tag" >> $GITHUB_ENV
- name: Build & Push Image to ACR
uses: docker/build-push-action@v2
with:
push: true
tags: |
${{ secrets.AZURE_URL }}/${{ env.IMAGE_NAME }}:${{ env.IMAGE_TAG }}
- name: Notify Discord
uses: sarisia/actions-status-discord@v1
if: always()
with:
webhook: ${{ secrets.DISCORD_WEBHOOK }}
message: |
*Build & Push Completed*
Repository: `${{ env.IMAGE_NAME }}`
Version: `${{ env.IMAGE_TAG }}`
Registry: `${{ secrets.AZURE_URL }}`
Note: You have to set your secrets in GitHub, have your Federated Token issued from Azure for the above CI to work!
When developers pushed code to the preprod branch, GitHub Actions:
Logged into Azure using service principal credentials.
Pulled the latest image tag from ACR.
Incremented the patch version (
1.2.7 → 1.2.8).Built the Docker image.
Pushed it back to ACR.
Sent a Discord notification.
Why auto-increment tags?
No manual versioning needed.
ArgoCD can detect updated tags and redeploy automatically.
Why not use latest tag?
Because in Kubernetes:
latest = uncontrolled deployments and broken rollbacks.
Important Improvement (Recommended)
Replace username/password auth with OIDC token-based login:
permissions:
id-token: write
Then:
- uses: azure/login@v1
with:
federated-token: ${{ steps.auth.outputs.idToken }}
This part was surprisingly fun!
There are also things like Federated Token setup to be done with Azure, otherwise we cannot use GitHub CI for pushing to ACR! Perhaps, that will be covered in an in-depth Azure CI/CD article! So, stay tuned!
CD: GitOps with ArgoCD
This was where the deployment pipeline got elegant.
Instead of GitHub Actions pushing to Kubernetes, ArgoCD pulls from Git.
Each microservice had its Kubernetes manifests in a separate repo.
ArgoCD continuously monitored those repos.
When an image tag changed → ArgoCD automatically synced the workload to AKS.
This gave us:
Version-controlled cluster state:
- Easy rollbacks
- Drift detection (“actual cluster ≠ what Git says it should be”)
- No kubectl on engineers’ laptops needed
ArgoCD continuously:
Watches Git
Detects manifest change or image tag update
Syncs cluster state to match
What this enabled:
| Feature | Result |
| Rollback | argocd app rollback service-1 |
| Drift Detection | Alerts when someone manually kubectl apply's |
| Deployment History | Full auditability |
This eliminated the need for engineers to use kubectl in most cases.
This is when the system started feeling production-grade.
Internal Shared Services Inside AKS
We deployed multiple platform services inside the cluster:
| Component | Purpose |
| RabbitMQ | async messaging |
| DragonflyDB | Redis alternative (lower memory usage) |
| ClickHouse | real-time analytics storage |
| Uptime Kuma | service uptime monitoring |
| Prometheus + Grafana | metrics + dashboards |
| Loki | log aggregation (stored logs in Azure Blob → cheaper) |
| Jaeger | distributed tracing |
Some workloads were stateful, so we used:
Persistent Volume Claims (PVCs)
Backed by Azure Managed Disks
This ensured durability across pod restarts.
How Services Were Exposed (Ingress Setup)
We used:
NGINX Ingress Controller
Cert-Manager for auto HTTPS
Example ingress rule:
host: api.<domain>
service: my-backend
port: 8080
A Cool Trick: Internal Env-Config Service
We built a small internal admin service that lets us update environment variables in namespaces, but it’s not exposed publicly, we call it the “InfraAdmin”
We only access it by:
kubectl port-forward service/infra-admin 8080:8080
This helped avoid:
editing Secrets manually
restarts in uncontrolled ways
exposing admin endpoints to the world
It feels “small”, but it made developer workflows smooth.
Observability: Seeing Everything
To run production systems confidently, we needed visibility:
Grafana dashboards for application performance
Loki + Blob Storage for cost-efficient logs
Jaeger to trace slow requests across microservices
This helped us answer questions like:
Why is checkout slow?
Which service is causing the bottleneck?
Is latency due to DB, network, or internal API calls?
This is where Kubernetes truly felt like a platform, not “containers glued together”.
Challenges Faced
| Problem | Solution |
| Private Postgres connectivity failures | Correct subnet delegation + DNS zone linking |
| Image version conflicts | Automated semantic version bump script |
| Log storage cost growing quickly | Moved Loki storage from disk → Azure Blob backend |
| Too many manual deployments | Adopted full GitOps with ArgoCD |
Every issue taught something valuable about how cloud platforms behave in real environments.
Debugging & Troubleshooting Lessons
| Issue | Fix |
| Pod to DB connection errors | Postgres Private DNS Zone linking |
| RabbitMQ memory pressure | Switched memory allocator + tuned consumer prefetch |
| ClickHouse OOM | Set max_memory_usage & used ZSTD table compression |
| Loki disk pressure | Switched to Azure Blob |
What I Learned
Kubernetes is not difficult: networking and identity are.
GitOps dramatically simplifies CI/CD pipelines.
Observability stack is mandatory, not optional.
Cost controls must be built from day one.
Perhaps, we will talk about Azure setups in-depth in a different article also the monitoring setups (including BotKube and uptime-kuma), this one is to provide the steps and things done during my project which was on Azure, and it was a fun + learning experience!






