Skip to main content

Command Palette

Search for a command to run...

Building a Production-Ready Kubernetes Platform on Azure with GitOps: A Deep Dive

Updated
8 min read
Building a Production-Ready Kubernetes Platform on Azure with GitOps: A Deep Dive
S

DevOps & Cloud Engineer — building scalable, automated, and intelligent systems. Developer of sorts | Automator | Innovator

When I started working on this project, the goal sounded simple:
deploy a microservices application stack on Azure, make it secure, scalable, observable, and easy for developers to ship changes.

In reality, it touched almost every part of cloud-native architecture: networking, databases, CI/CD, monitoring, and Kubernetes internals. This blog is my walkthrough of how everything fit together, what decisions we made, and the lessons I learned along the way.

The stack included:

  • Azure Kubernetes Service (AKS)

  • Azure Postgres Flexible Server (Private Access)

  • ACR, GitHub Actions, ArgoCD GitOps

  • RabbitMQ, DragonflyDB, ClickHouse, Prometheus, Loki, Jaeger

The goal was to create a secure, scalable, and observable microservices deployment environment: while maintaining developer-friendly workflows.

Why We Chose Azure + AKS

We needed:

  • A managed Kubernetes platform → Azure Kubernetes Service (AKS) made sense.

  • A managed PostgreSQL database with the ability to keep it privateAzure Postgres Flexible Server.

  • Native container registry integration → Azure Container Registry (ACR).

  • A CI/CD setup that didn't depend on manual triggers → GitHub Actions + ArgoCD GitOps.

But the key requirement was:

The Postgres database must not be publicly accessible.
It must only be reachable from inside the same Azure network.

This requirement shaped many of the decisions in the architecture.

Infrastructure with Terraform

All Azure resources were created using Terraform:

  • Resource Groups

  • Virtual Network + Subnets

  • AKS Cluster (with Managed Identity & Node Pools)

  • Azure Container Registry

  • Postgres Flexible Server (Private Networking Mode)

  • Terraform handled provisioning:

      /infra
      ├── main.tf
      ├── variables.tf
      ├── outputs.tf
      └── modules/
          ├── network/
          ├── aks/
          ├── postgres/
          └── acr/
    

    Network Module

    Created a VNet with multiple subnets to isolate components:

    | Subnet | Purpose | Notes | | --- | --- | --- | | aks-subnet | Worker nodes + Pods | Delegated to Microsoft.ContainerService | | db-subnet | Postgres Flexible Server | Service endpoint + private DNS required |

    Resource Group

    This is the foundational unit in Azure.
    Everything: VNet, AKS, DB, ACR, is placed inside this resource group.

      resource "azurerm_resource_group" "rg" {
        name     = var.resource_group_name
        location = var.location
      }
    

    Example variables:

      variable "resource_group_name" {
        description = "Name of the Azure Resource Group"
        type        = string
      }
    
      variable "location" {
        description = "Azure Region"
        type        = string
        default     = "eastus"
      }
    

    All other resources will reference azurerm_resource_group.rg.name and .location.

    Virtual Network & Subnets (with Isolation)

      Resource Group
         └── Virtual Network
              ├── aks-subnet (for cluster nodes & pods)
              └── db-subnet (for private Postgres)
    

    Terraform:

      resource "azurerm_virtual_network" "vnet" {
        name                = "main-vnet"
        location            = azurerm_resource_group.rg.location
        resource_group_name = azurerm_resource_group.rg.name
        address_space       = ["10.0.0.0/16"]
      }
    

    AKS Subnet

      resource "azurerm_subnet" "aks" {
        name                 = "aks-subnet"
        resource_group_name  = azurerm_resource_group.rg.name
        virtual_network_name = azurerm_virtual_network.vnet.name
        address_prefixes     = ["10.0.1.0/24"]
    
        delegation {
          name = "aks-delegation"
          service_delegation {
            name = "Microsoft.ContainerService/managedClusters"
          }
        }
      }
    

    Postgres Flexible Server (Private)

      resource "azurerm_postgresql_flexible_server" "postgres" {
        network {
          delegated_subnet_id = azurerm_subnet.db.id
          private_dns_zone_id = azurerm_private_dns_zone.postgres.id
        }
      }
    

    AKS Cluster

    Used Managed Identity (no service principal passwords).

      resource "azurerm_kubernetes_cluster" "aks" {
        identity {
          type = "SystemAssigned"
        }
      }
    

    Takeaway

    The private DNS zone linking is what makes AKS → Postgres connectivity work.
    If you miss that, queries fail even though networking looks correct.

    The above are just snippets, and not exactly what I used, but you get the idea!

CI: GitHub Actions → ACR

Here is a sample workflow file (one I often used) :
CI Pipeline: GitHub Actions → Azure Container Registry (ACR)

name: Build and Push to ACR 

on:
  push:
    branches:
      - preprod

env:
  IMAGE_NAME: my-backend

permissions:
  id-token: write
  contents: read

jobs:
  build-and-push:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout Source Code
        uses: actions/checkout@v4

      - name: Azure Login (OIDC)
        uses: azure/login@v1
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Verify Azure Session
        run: az account show

      - name: Docker Login to ACR
        uses: azure/docker-login@v1
        with:
          login-server: ${{ secrets.AZURE_URL }}       
          username: ${{ secrets.ACR_USERNAME }}
          password: ${{ secrets.ACR_PASSWORD }}

      - name: Determine Latest Tag and Bump Version
        id: versioning
        run: |
          echo "Fetching latest image tag..."
          latest=$(az acr repository show-tags \
            --name ${{ secrets.AZURE_REGISTRY_NAME }} \
            --repository ${{ env.IMAGE_NAME }} \
            --orderby time_desc \
            --output tsv --query '[0]') || latest="0.0.0"

          echo "Latest version: $latest"

          IFS='.' read -r major minor patch <<< "$latest"
          new_patch=$((patch + 1))
          new_tag="$major.$minor.$new_patch"

          echo "New version: $new_tag"
          echo "IMAGE_TAG=$new_tag" >> $GITHUB_ENV

      - name: Build & Push Image to ACR
        uses: docker/build-push-action@v2
        with:
          push: true
          tags: |
            ${{ secrets.AZURE_URL }}/${{ env.IMAGE_NAME }}:${{ env.IMAGE_TAG }}

      - name: Notify Discord
        uses: sarisia/actions-status-discord@v1
        if: always()
        with:
          webhook: ${{ secrets.DISCORD_WEBHOOK }}
          message: |
            *Build & Push Completed*
            Repository: `${{ env.IMAGE_NAME }}`
            Version: `${{ env.IMAGE_TAG }}`
            Registry: `${{ secrets.AZURE_URL }}`

Note: You have to set your secrets in GitHub, have your Federated Token issued from Azure for the above CI to work!

When developers pushed code to the preprod branch, GitHub Actions:

  1. Logged into Azure using service principal credentials.

  2. Pulled the latest image tag from ACR.

  3. Incremented the patch version (1.2.7 → 1.2.8).

  4. Built the Docker image.

  5. Pushed it back to ACR.

  6. Sent a Discord notification.

Why auto-increment tags?

  • No manual versioning needed.

  • ArgoCD can detect updated tags and redeploy automatically.

Why not use latest tag?
Because in Kubernetes:

latest = uncontrolled deployments and broken rollbacks.

Replace username/password auth with OIDC token-based login:

permissions:
  id-token: write

Then:

- uses: azure/login@v1
  with:
    federated-token: ${{ steps.auth.outputs.idToken }}

This part was surprisingly fun!

There are also things like Federated Token setup to be done with Azure, otherwise we cannot use GitHub CI for pushing to ACR! Perhaps, that will be covered in an in-depth Azure CI/CD article! So, stay tuned!


CD: GitOps with ArgoCD

This was where the deployment pipeline got elegant.

Instead of GitHub Actions pushing to Kubernetes, ArgoCD pulls from Git.

  • Each microservice had its Kubernetes manifests in a separate repo.

  • ArgoCD continuously monitored those repos.

  • When an image tag changed → ArgoCD automatically synced the workload to AKS.

This gave us:

Version-controlled cluster state:
- Easy rollbacks
- Drift detection (“actual cluster ≠ what Git says it should be”)
- No kubectl on engineers’ laptops needed

ArgoCD continuously:

  • Watches Git

  • Detects manifest change or image tag update

  • Syncs cluster state to match

What this enabled:

FeatureResult
Rollbackargocd app rollback service-1
Drift DetectionAlerts when someone manually kubectl apply's
Deployment HistoryFull auditability

This eliminated the need for engineers to use kubectl in most cases.

This is when the system started feeling production-grade.

Internal Shared Services Inside AKS

We deployed multiple platform services inside the cluster:

ComponentPurpose
RabbitMQasync messaging
DragonflyDBRedis alternative (lower memory usage)
ClickHousereal-time analytics storage
Uptime Kumaservice uptime monitoring
Prometheus + Grafanametrics + dashboards
Lokilog aggregation (stored logs in Azure Blob → cheaper)
Jaegerdistributed tracing

Some workloads were stateful, so we used:

  • Persistent Volume Claims (PVCs)

  • Backed by Azure Managed Disks

This ensured durability across pod restarts.

How Services Were Exposed (Ingress Setup)

We used:

  • NGINX Ingress Controller

  • Cert-Manager for auto HTTPS

Example ingress rule:

host: api.<domain>
service: my-backend
port: 8080

A Cool Trick: Internal Env-Config Service

We built a small internal admin service that lets us update environment variables in namespaces, but it’s not exposed publicly, we call it the “InfraAdmin”

We only access it by:

kubectl port-forward service/infra-admin 8080:8080

This helped avoid:

  • editing Secrets manually

  • restarts in uncontrolled ways

  • exposing admin endpoints to the world

It feels “small”, but it made developer workflows smooth.


Observability: Seeing Everything

To run production systems confidently, we needed visibility:

  • Grafana dashboards for application performance

  • Loki + Blob Storage for cost-efficient logs

  • Jaeger to trace slow requests across microservices

This helped us answer questions like:

Why is checkout slow?
Which service is causing the bottleneck?
Is latency due to DB, network, or internal API calls?

This is where Kubernetes truly felt like a platform, not “containers glued together”.


Challenges Faced

ProblemSolution
Private Postgres connectivity failuresCorrect subnet delegation + DNS zone linking
Image version conflictsAutomated semantic version bump script
Log storage cost growing quicklyMoved Loki storage from disk → Azure Blob backend
Too many manual deploymentsAdopted full GitOps with ArgoCD

Every issue taught something valuable about how cloud platforms behave in real environments.

Debugging & Troubleshooting Lessons

IssueFix
Pod to DB connection errorsPostgres Private DNS Zone linking
RabbitMQ memory pressureSwitched memory allocator + tuned consumer prefetch
ClickHouse OOMSet max_memory_usage & used ZSTD table compression
Loki disk pressureSwitched to Azure Blob

What I Learned

  1. Kubernetes is not difficult: networking and identity are.

  2. GitOps dramatically simplifies CI/CD pipelines.

  3. Observability stack is mandatory, not optional.

  4. Cost controls must be built from day one.

Perhaps, we will talk about Azure setups in-depth in a different article also the monitoring setups (including BotKube and uptime-kuma), this one is to provide the steps and things done during my project which was on Azure, and it was a fun + learning experience!

More from this blog

C

CodeOps Studies

39 posts

Simple write-ups on day to day code or devops experiments, tests etc.