Talos OS: Storage, Certificates, And Access

Talos OS promises a fully immutable, API driven Kubernetes experience. It removes the idea of logging into nodes, changing files manually, or performing maintenance through traditional means. This design brings a high level of security and predictability. It also brings a set of challenges that many users, including myself, only discover once Talos becomes part of a real cluster.

During my recent effort to run PostgreSQL on a Talos cluster, I experienced several failures, unexpected behaviours, and some difficult recovery situations. This post documents that entire experience. I want this to help anyone who is trying to use Talos in a small cluster or a homelab environment, because the learning curve is very steep.

Understanding Why Storage Becomes Difficult

Talos is an immutable operating system. This sounds ideal until you try to mount storage. Many Kubernetes setups allow you to create directories directly on the node using simple commands. Talos does not allow this approach.

These are the limitations that immediately matter:

You cannot create directories on the host manually
You cannot change permissions manually
You cannot rely on paths that do not already exist
You cannot SSH into the node to fix things
You cannot depend on anything that is not part of the machine configuration

My PostgreSQL deployment required a PersistentVolume backed by local storage. I created a PersistentVolume that pointed to a path like /var/lib/postgres or /mnt/postgres. Each attempt failed with the same error.

MountVolume.NewMounter initialization failed for volume "pv-postgres" : path "/var/lib/postgres" does not exist

Talos refuses to mount a path that does not exist. Since I was unable to create that directory manually, I needed a path that already existed.

The solution was surprisingly simple!

I used a directory that Talos already creates by default.

/var/local

The moment I pointed my PersistentVolume to /var/local, PostgreSQL started successfully. The directory existed, the kubelet accepted it, and the pod finally mounted the data volume.

This taught me the most important lesson about Talos and storage. Any stateful workload that needs a hostPath or a local PersistentVolume must rely on a path that exists at boot through the machine configuration. If the path is not created by Talos, then the pod will fail.

I then discovered a default directory that Talos already creates: /var/local. Pointing my PV to this path worked immediately:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-postgres
spec:
  storageClassName: localstorage
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  local:
    path: /var/local
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - talos-1to-jsz

The corresponding PVC was:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-postgresql
  namespace: postgresql
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: localstorage
  resources:
    requests:
      storage: 10Gi

Then I applied my deployment manifest, alongside svc:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: postgresql
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres:16
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 5432
          env:
            - name: POSTGRES_DB
              value: mydb
            - name: POSTGRES_USER
              value: myuser
            - name: POSTGRES_PASSWORD
              value: mypassword
          volumeMounts:
            - name: postgres-data
              mountPath: /var/lib/postgresql/data
      volumes:
        - name: postgres-data
          persistentVolumeClaim:
            claimName: data-postgresql
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: postgresql
spec:
  selector:
    app: postgres
  ports:
    - name: postgres
      protocol: TCP
      port: 5432
      targetPort: 5432
  type: ClusterIP

After this, the PostgreSQL pod successfully started and mounted the volume. I was able to connect to it using a test pod:

kubectl run psql-test \
  --rm -it \
  --image=postgres:16 \
  --namespace postgresql \
  --env="PGPASSWORD=mypassword" \
  -- psql -h postgres -U myuser -d mydb

This experience taught me a critical lesson about Talos and storage: any stateful workload requiring a hostPath or local PV must rely on directories that exist at boot through the machine configuration. Trying to use arbitrary directories will fail. While /var/local worked, this is not a best practice and should be avoided in production.

Why Using Existing Directories Like `/var/local` Works But Is Not A Good Practice

When I struggled to mount storage for PostgreSQL on Talos, I eventually discovered that /var/local already existed on every node. Talos creates this directory during early boot. As soon as I pointed my PersistentVolume to /var/local, the database pod started without any issues. The directory already existed, the kubelet was satisfied, and the pod finally mounted the data volume.

This seems like a convenient solution, but it is not a recommended approach. It introduces several long term problems and operational risks.

Here is why it is not a good practice.

1. This directory is not meant for application data

/var/local is an internal Talos directory and is not designed for stateful workloads. Talos can modify or use this directory for its own purposes in future releases. Talos does not document or guarantee that this directory will always exist, or that it will behave the same way across upgrades.

You are depending on behaviour that is a side effect of the operating system rather than a stable feature.

If you reuse /var/local for every PersistentVolume, then every stateful application on that node will write data into the same directory. This will cause:

A lack of isolation between applications
Possible permission conflicts
A risk of one application filling the entire directory and breaking the others
Difficulty with debugging and storage visibility

It also becomes impossible to safely delete or migrate individual application data.

3. Talos does not enforce size limits

Kubernetes does not enforce the size declared in a PersistentVolume. Since /var/local is just a directory, any application can exceed the declared ten gibibytes. There is no guarantee that the node will not run out of space.

In a worst case scenario, the node can crash due to disk pressure.

4. This breaks the philosophy of Talos

Talos is meant to be fully declarative. Anything that exists should be defined through the MachineConfig, not through accidental filesystem structure that happens to be present.

If the directory is not created explicitly by configuration, then it is not an intentional part of your infrastructure. It is risky to build application level storage on top of something that was not designed for this purpose.

5. Upgrades and reinstallations can remove the directory

During major upgrades or node reinstalls, Talos can reset or restructure internal directories. If /var/local is removed, renamed, or reformatted, then every application that relies on it will lose its data.

Your data becomes fragile and tied to undocumented filesystem details.

The Correct Talos Approved Way

The recommended way to create storage paths in Talos is:

Use a MachineConfig patch to create directories with explicit permissions and ownership.

For example, you can declare this:

machine:
  files:
    - path: /var/data/postgres
      permissions: 0o755
      owner: 0
      group: 0
      directory: true

You can create as many directories as you need. This is the clean, controlled, and safe method.

You then point your PersistentVolumes to these paths with confidence that:

They will exist on every node
They will survive upgrades
They are dedicated to the correct application
They will not conflict with Talos internals

This is the long term maintainable approach, but I faced the certificate error here, which made life difficult!!

Expanding Persistent Volumes

In the example above, I claimed the full 10Gi of available storage on the node. If I want to expand storage later, I must first ensure more disk space is available at the path and then patch the PVC to request the larger size. Kubernetes will handle resizing if the StorageClass allows it.

For multiple services requiring persistent storage, I can technically reuse /var/local, but this is also not recommended. Each workload should ideally have its own dedicated storage path or volume managed through a proper storage provider.

Why Longhorn Does Not Work Easily

Longhorn requires certain kernel modules, directory mounts, and filesystem behaviours that Talos does not support by default. Talos aims for very minimal host configuration. Longhorn expects the opposite. The two conflict in many ways.

As a result, most users who attempt Longhorn on Talos experience failures. This includes random crashes, volume mount issues, replica failures, and inability to start the Longhorn UI.

The safer alternative is to use Talos MachineConfig patches to create custom paths and then rely on local PersistentVolumes. This reduces flexibility but increases stability.

The Strange Behaviour Of PersistentVolumes In Talos

When I created a 10 gigabyte PersistentVolume and a matching PersistentVolumeClaim, it worked immediately on /var/local. This made me curious about what was actually happening.

Here is the explanation.

The size declared in a PersistentVolume does not actually allocate disk space
The directory simply points to the host filesystem
Talos does not perform any reservations
Kubelet does not enforce storage consumption limits
The declared capacity only informs Kubernetes scheduling

This means that even if you declare ten gibibytes, the actual host directory is unbounded. PostgreSQL can consume much more than the advertised size if it needs to. The responsibility of storage growth sits entirely on you as the administrator.

If you want to expand a PersistentVolume in Talos:

You need a larger physical directory available
You resize the PersistentVolumeClaim
The underlying filesystem must support online expansion

This works with filesystems like ext4 and xfs if configured correctly by the node.

Certificates and Node Access

Another issue I faced was related to Talos certificates. After initial node creation, when I tried to apply a new worker configuration (machine files path to create directory), I received errors like:

error applying new configuration: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority"

Even though I used the same worker.yaml file, Talos refused the connection. The lesson here is that Talos tightly couples node certificates with the cluster PKI. If a certificate becomes invalid or untrusted, you can lose direct access.

There are several scenarios where this failure can occur:

You regenerate a machine configuration file
You recreate a node with a slightly different configuration
You lose your talosconfig file
The cluster CA is overridden when bootstrap runs again
Node IP addresses change
You accidentally mix configuration files from different clusters

When this happens, you cannot log into the node. You cannot fix files manually. You cannot mount a debug shell. This is both a security feature and a very serious operational risk.

This is the moment when Talos begins to feel unforgiving. A lost certificate means that you lose control of the node unless you have backups of your original configuration, certificates, and bootstrap secrets.

The Difficulty Of Troubleshooting And Recovery

Troubleshooting Talos is not like troubleshooting a normal Linux server. There is no SSH access, no direct shell, and no persistent filesystem to examine. Everything flows through the Talos API. If the Talos API is broken because of mismatched certificates, then you are locked out completely.

The only reliable recovery options are:

Reboot into maintanence mode
Reapply a machine configuration
Restore backed up secrets
Reinstall the node if nothing else works

This can feel very restrictive. It requires a mindset shift. Talos does not want administrators to fix issues manually. Talos wants everything to be declared in configuration files from the start.

This is powerful, but very easy to break if any detail is forgotten, and if you need to reset the node, consider your previous work gone :(

Recovering Node Access

To regain access to a Talos node when the certificate fails:

Use the Talos bootstrap token for insecure communication:

talosctl apply-config --insecure -n <node-ip> --file worker.yaml

Alternatively, download a fresh Talos configuration using:

talosctl -n <node-ip> kubeconfig -f kubeconfig.yaml

Use talosctl --insecure carefully, as it bypasses certificate validation.
Always back up your talosconfig files and certificates. Losing them makes access recovery difficult.

Key Lessons

Talos does not allow arbitrary hostPath directories. Only paths present at boot or created via machine configuration are valid for PVs.
Use /var/local as a temporary solution, but do not rely on it for production workloads.
Always backup Talos certificates and configuration to avoid losing access.
Stateful workloads require careful planning of persistent storage in Talos.

Talos is secure and minimal by design, but these features make working with storage and configuration more challenging than standard Linux nodes.

My Conclusion After Working Through These Issues

Talos OS is impressive. It is secure and consistent. At the same time, it brings operational challenges that are not obvious until you experience them directly.

The three biggest issues I faced were:

Storage paths that cannot be created manually
Certificates that break node access
Recovery procedures that depend entirely on correct configuration files

Talos is ideal for large production environments where every configuration is version controlled, stable, and tested. It can be difficult for homelab environments where experimentation is common and nodes often change.

In the end I created working PersistentVolumes, understood the certificate failures, recovered node access, and built functional PostgreSQL storage. This journey helped me understand Talos in a much deeper way. I now appreciate how strict and predictable it is, even though that strictness caused many of the problems.

If you are planning to learn Talos, try to keep configuration backups from the very beginning. It will save you much more time later!

I am still learning and perhaps there are simpler ways to have Talos in homelab experiments, but for now I will stick to more experiment friendly kubernetes.

Talos OS: A Hard Earned Understanding Of Storage, Certificates, And Access

Understanding Why Storage Becomes Difficult

Why Using Existing Directories Like `/var/local` Works But Is Not A Good Practice

1. This directory is not meant for application data

3. Talos does not enforce size limits

4. This breaks the philosophy of Talos

5. Upgrades and reinstallations can remove the directory

The Correct Talos Approved Way

Expanding Persistent Volumes

Why Longhorn Does Not Work Easily

The Strange Behaviour Of PersistentVolumes In Talos

Certificates and Node Access

The Difficulty Of Troubleshooting And Recovery

Recovering Node Access

Key Lessons

My Conclusion After Working Through These Issues

More from this blog

When SSL Lies: Debugging PostgreSQL “server does not support SSL” in Kubernetes

A Real World Journey Building on Tencent Cloud

Lessons Learned Building a CI Pipeline That Auto-Tags and Deploys Docker Images

What I Learned Migrating a Real App from Docker Compose to Kubernetes

Running Apache Flink on Kubernetes: From Zero to a Fully Utilized Cluster

Command Palette

Understanding Why Storage Becomes Difficult

Why Using Existing Directories Like /var/local Works But Is Not A Good Practice

1. This directory is not meant for application data

2. All applications will share the same storage location

3. Talos does not enforce size limits

4. This breaks the philosophy of Talos

5. Upgrades and reinstallations can remove the directory

The Correct Talos Approved Way

Expanding Persistent Volumes

Why Longhorn Does Not Work Easily

The Strange Behaviour Of PersistentVolumes In Talos

Certificates and Node Access

The Difficulty Of Troubleshooting And Recovery

Recovering Node Access

Key Lessons

My Conclusion After Working Through These Issues

More from this blog

Why Using Existing Directories Like `/var/local` Works But Is Not A Good Practice