Building a Production-Ready Kubernetes Platform on Azure with GitOps

When I started working on this project, the goal sounded simple:
deploy a microservices application stack on Azure, make it secure, scalable, observable, and easy for developers to ship changes.

In reality, it touched almost every part of cloud-native architecture: networking, databases, CI/CD, monitoring, and Kubernetes internals. This blog is my walkthrough of how everything fit together, what decisions we made, and the lessons I learned along the way.

The stack included:

Azure Kubernetes Service (AKS)
Azure Postgres Flexible Server (Private Access)
ACR, GitHub Actions, ArgoCD GitOps
RabbitMQ, DragonflyDB, ClickHouse, Prometheus, Loki, Jaeger

The goal was to create a secure, scalable, and observable microservices deployment environment: while maintaining developer-friendly workflows.

Why We Chose Azure + AKS

We needed:

A managed Kubernetes platform → Azure Kubernetes Service (AKS) made sense.
A managed PostgreSQL database with the ability to keep it private → Azure Postgres Flexible Server.
Native container registry integration → Azure Container Registry (ACR).
A CI/CD setup that didn't depend on manual triggers → GitHub Actions + ArgoCD GitOps.

But the key requirement was:

The Postgres database must not be publicly accessible.
It must only be reachable from inside the same Azure network.

This requirement shaped many of the decisions in the architecture.

Infrastructure with Terraform

All Azure resources were created using Terraform:

Resource Groups
Virtual Network + Subnets
AKS Cluster (with Managed Identity & Node Pools)
Azure Container Registry
Postgres Flexible Server (Private Networking Mode)

Terraform handled provisioning:

  /infra
  ├── main.tf
  ├── variables.tf
  ├── outputs.tf
  └── modules/
      ├── network/
      ├── aks/
      ├── postgres/
      └── acr/

Network Module

Created a VNet with multiple subnets to isolate components:

| Subnet | Purpose | Notes | | --- | --- | --- | | aks-subnet | Worker nodes + Pods | Delegated to Microsoft.ContainerService | | db-subnet | Postgres Flexible Server | Service endpoint + private DNS required |

Resource Group

This is the foundational unit in Azure.
Everything: VNet, AKS, DB, ACR, is placed inside this resource group.

  resource "azurerm_resource_group" "rg" {
    name     = var.resource_group_name
    location = var.location
  }

Example variables:

  variable "resource_group_name" {
    description = "Name of the Azure Resource Group"
    type        = string
  }

  variable "location" {
    description = "Azure Region"
    type        = string
    default     = "eastus"
  }

All other resources will reference azurerm_resource_group.rg.name and .location.

Virtual Network & Subnets (with Isolation)

  Resource Group
     └── Virtual Network
          ├── aks-subnet (for cluster nodes & pods)
          └── db-subnet (for private Postgres)

Terraform:

  resource "azurerm_virtual_network" "vnet" {
    name                = "main-vnet"
    location            = azurerm_resource_group.rg.location
    resource_group_name = azurerm_resource_group.rg.name
    address_space       = ["10.0.0.0/16"]
  }

AKS Subnet

  resource "azurerm_subnet" "aks" {
    name                 = "aks-subnet"
    resource_group_name  = azurerm_resource_group.rg.name
    virtual_network_name = azurerm_virtual_network.vnet.name
    address_prefixes     = ["10.0.1.0/24"]

    delegation {
      name = "aks-delegation"
      service_delegation {
        name = "Microsoft.ContainerService/managedClusters"
      }
    }
  }

Postgres Flexible Server (Private)

  resource "azurerm_postgresql_flexible_server" "postgres" {
    network {
      delegated_subnet_id = azurerm_subnet.db.id
      private_dns_zone_id = azurerm_private_dns_zone.postgres.id
    }
  }

AKS Cluster

Used Managed Identity (no service principal passwords).

  resource "azurerm_kubernetes_cluster" "aks" {
    identity {
      type = "SystemAssigned"
    }
  }

Takeaway

The private DNS zone linking is what makes AKS → Postgres connectivity work.
If you miss that, queries fail even though networking looks correct.

The above are just snippets, and not exactly what I used, but you get the idea!

CI: GitHub Actions → ACR

Here is a sample workflow file (one I often used) :
CI Pipeline: GitHub Actions → Azure Container Registry (ACR)

name: Build and Push to ACR 

on:
  push:
    branches:
      - preprod

env:
  IMAGE_NAME: my-backend

permissions:
  id-token: write
  contents: read

jobs:
  build-and-push:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout Source Code
        uses: actions/checkout@v4

      - name: Azure Login (OIDC)
        uses: azure/login@v1
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Verify Azure Session
        run: az account show

      - name: Docker Login to ACR
        uses: azure/docker-login@v1
        with:
          login-server: ${{ secrets.AZURE_URL }}       
          username: ${{ secrets.ACR_USERNAME }}
          password: ${{ secrets.ACR_PASSWORD }}

      - name: Determine Latest Tag and Bump Version
        id: versioning
        run: |
          echo "Fetching latest image tag..."
          latest=$(az acr repository show-tags \
            --name ${{ secrets.AZURE_REGISTRY_NAME }} \
            --repository ${{ env.IMAGE_NAME }} \
            --orderby time_desc \
            --output tsv --query '[0]') || latest="0.0.0"

          echo "Latest version: $latest"

          IFS='.' read -r major minor patch <<< "$latest"
          new_patch=$((patch + 1))
          new_tag="$major.$minor.$new_patch"

          echo "New version: $new_tag"
          echo "IMAGE_TAG=$new_tag" >> $GITHUB_ENV

      - name: Build & Push Image to ACR
        uses: docker/build-push-action@v2
        with:
          push: true
          tags: |
            ${{ secrets.AZURE_URL }}/${{ env.IMAGE_NAME }}:${{ env.IMAGE_TAG }}

      - name: Notify Discord
        uses: sarisia/actions-status-discord@v1
        if: always()
        with:
          webhook: ${{ secrets.DISCORD_WEBHOOK }}
          message: |
            *Build & Push Completed*
            Repository: `${{ env.IMAGE_NAME }}`
            Version: `${{ env.IMAGE_TAG }}`
            Registry: `${{ secrets.AZURE_URL }}`

Note: You have to set your secrets in GitHub, have your Federated Token issued from Azure for the above CI to work!

When developers pushed code to the preprod branch, GitHub Actions:

Logged into Azure using service principal credentials.
Pulled the latest image tag from ACR.
Incremented the patch version (1.2.7 → 1.2.8).
Built the Docker image.
Pushed it back to ACR.
Sent a Discord notification.

Why auto-increment tags?

No manual versioning needed.
ArgoCD can detect updated tags and redeploy automatically.

Why not use latest tag?
Because in Kubernetes:

latest = uncontrolled deployments and broken rollbacks.

Important Improvement (Recommended)

Replace username/password auth with OIDC token-based login:

permissions:
  id-token: write

Then:

- uses: azure/login@v1
  with:
    federated-token: ${{ steps.auth.outputs.idToken }}

This part was surprisingly fun!

There are also things like Federated Token setup to be done with Azure, otherwise we cannot use GitHub CI for pushing to ACR! Perhaps, that will be covered in an in-depth Azure CI/CD article! So, stay tuned!

CD: GitOps with ArgoCD

This was where the deployment pipeline got elegant.

Instead of GitHub Actions pushing to Kubernetes, ArgoCD pulls from Git.

Each microservice had its Kubernetes manifests in a separate repo.
ArgoCD continuously monitored those repos.
When an image tag changed → ArgoCD automatically synced the workload to AKS.

This gave us:

Version-controlled cluster state:
- Easy rollbacks
- Drift detection (“actual cluster ≠ what Git says it should be”)
- No kubectl on engineers’ laptops needed

ArgoCD continuously:

Watches Git
Detects manifest change or image tag update
Syncs cluster state to match

What this enabled:

Feature	Result
Rollback	`argocd app rollback service-1`
Drift Detection	Alerts when someone manually `kubectl apply`'s
Deployment History	Full auditability

This eliminated the need for engineers to use kubectl in most cases.

This is when the system started feeling production-grade.

Internal Shared Services Inside AKS

We deployed multiple platform services inside the cluster:

Component	Purpose
RabbitMQ	async messaging
DragonflyDB	Redis alternative (lower memory usage)
ClickHouse	real-time analytics storage
Uptime Kuma	service uptime monitoring
Prometheus + Grafana	metrics + dashboards
Loki	log aggregation (stored logs in Azure Blob → cheaper)
Jaeger	distributed tracing

Some workloads were stateful, so we used:

Persistent Volume Claims (PVCs)
Backed by Azure Managed Disks

This ensured durability across pod restarts.

How Services Were Exposed (Ingress Setup)

We used:

NGINX Ingress Controller
Cert-Manager for auto HTTPS

Example ingress rule:

host: api.<domain>
service: my-backend
port: 8080

A Cool Trick: Internal Env-Config Service

We built a small internal admin service that lets us update environment variables in namespaces, but it’s not exposed publicly, we call it the “InfraAdmin”

We only access it by:

kubectl port-forward service/infra-admin 8080:8080

This helped avoid:

editing Secrets manually
restarts in uncontrolled ways
exposing admin endpoints to the world

It feels “small”, but it made developer workflows smooth.

Observability: Seeing Everything

To run production systems confidently, we needed visibility:

Grafana dashboards for application performance
Loki + Blob Storage for cost-efficient logs
Jaeger to trace slow requests across microservices

This helped us answer questions like:

Why is checkout slow?
Which service is causing the bottleneck?
Is latency due to DB, network, or internal API calls?

This is where Kubernetes truly felt like a platform, not “containers glued together”.

Challenges Faced

Problem	Solution
Private Postgres connectivity failures	Correct subnet delegation + DNS zone linking
Image version conflicts	Automated semantic version bump script
Log storage cost growing quickly	Moved Loki storage from disk → Azure Blob backend
Too many manual deployments	Adopted full GitOps with ArgoCD

Every issue taught something valuable about how cloud platforms behave in real environments.

Debugging & Troubleshooting Lessons

Issue	Fix
Pod to DB connection errors	Postgres Private DNS Zone linking
RabbitMQ memory pressure	Switched memory allocator + tuned consumer prefetch
ClickHouse OOM	Set `max_memory_usage` & used ZSTD table compression
Loki disk pressure	Switched to Azure Blob

What I Learned

Kubernetes is not difficult: networking and identity are.
GitOps dramatically simplifies CI/CD pipelines.
Observability stack is mandatory, not optional.
Cost controls must be built from day one.

Perhaps, we will talk about Azure setups in-depth in a different article also the monitoring setups (including BotKube and uptime-kuma), this one is to provide the steps and things done during my project which was on Azure, and it was a fun + learning experience!

Building a Production-Ready Kubernetes Platform on Azure with GitOps: A Deep Dive

Why We Chose Azure + AKS

Infrastructure with Terraform

Network Module

Resource Group

Virtual Network & Subnets (with Isolation)

AKS Subnet

Postgres Flexible Server (Private)

AKS Cluster

Takeaway

CI: GitHub Actions → ACR

Important Improvement (Recommended)

CD: GitOps with ArgoCD

What this enabled:

Internal Shared Services Inside AKS

How Services Were Exposed (Ingress Setup)

A Cool Trick: Internal Env-Config Service

Observability: Seeing Everything

Challenges Faced

Debugging & Troubleshooting Lessons

What I Learned

More from this blog

When SSL Lies: Debugging PostgreSQL “server does not support SSL” in Kubernetes

A Real World Journey Building on Tencent Cloud

Lessons Learned Building a CI Pipeline That Auto-Tags and Deploys Docker Images

What I Learned Migrating a Real App from Docker Compose to Kubernetes

Running Apache Flink on Kubernetes: From Zero to a Fully Utilized Cluster

Command Palette

Why We Chose Azure + AKS

Infrastructure with Terraform

Network Module

Resource Group

Virtual Network & Subnets (with Isolation)

AKS Subnet

Postgres Flexible Server (Private)

AKS Cluster

Takeaway

CI: GitHub Actions → ACR

Important Improvement (Recommended)

CD: GitOps with ArgoCD

What this enabled:

Internal Shared Services Inside AKS

How Services Were Exposed (Ingress Setup)

A Cool Trick: Internal Env-Config Service

Observability: Seeing Everything

Challenges Faced

Debugging & Troubleshooting Lessons

What I Learned

More from this blog