HashiCorp Vault on Kubernetes: 5 Production Mistakes
Vault on Kubernetes is one of those "looks easy in the demo, breaks in production" stories. The Helm chart deploys cleanly, the UI is reachable, secrets work in dev. Then the cluster reboots and Vault never comes back. Or auto-unseal fails. Or pod auth quietly stops working and apps cannot retrieve secrets. Five specific mistakes are responsible for almost every production Vault-on-Kubernetes incident.
Mistake 1: using the dev mode listener in non-dev environments
The default Helm chart values include a server.dev.enabled flag. Set it to true and you get a Vault that runs in memory, has a fixed root token, and unseals itself on every restart.
This is great for development. It is catastrophic in production:
- Every secret stored is lost on pod restart.
- The root token is well-known.
- There is no actual encryption of data.
The mistake: someone set server.dev.enabled=true during the initial setup, never changed it, and the team has been "using Vault in production" while losing secrets on every node rotation.
Fix: production setup needs server.dev.enabled=false, an explicit storage backend (Raft for HA, Consul for legacy), and a real persistent volume. Verify by checking pod logs for "Vault server started" without "running in dev mode."
Mistake 2: not configuring auto-unseal, then losing the unseal keys
By default, Vault starts sealed. To unseal, you provide 3 of 5 unseal keys (or whatever your threshold is). After every pod restart, every Vault upgrade, every cluster maintenance, you have to manually unseal.
Teams that do not configure auto-unseal end up:
- Storing unseal keys in 1Password / Slack / a wiki ("temporarily")
- Losing track of which engineer holds which key
- Having Vault sealed for hours after a node failover because the on-call engineer cannot find the keys
- Engineering team turnover means nobody alive at the company knows where the keys are
Fix: configure auto-unseal using your cloud provider's KMS. For AWS, use AWS KMS auto-unseal:
seal "awskms" {
region = "us-east-1"
kms_key_id = "alias/vault-unseal"
}
Vault now uses a KMS key to encrypt its master key. On startup, Vault calls KMS to decrypt and self-unseal. The unseal keys become recovery keys (rarely used) instead of operational keys (needed every restart). Same pattern for Azure Key Vault auto-unseal, GCP KMS auto-unseal, and HSM auto-unseal.
This single change eliminates the most common Vault outage cause.
Mistake 3: storage backend on ephemeral or non-replicated storage
This kills you on the first node failure.
Vault stores its encrypted data on disk. If you use the Raft storage backend (recommended for K8s), the data is replicated across Vault pods. But the per-pod storage must be persistent: PersistentVolumeClaim with a real StorageClass.
Common mistakes:
- Using
emptyDirfor the data path. Pod restarts wipe data. - Using a StorageClass with
reclaimPolicy: Delete. PVC deletion deletes the actual data forever. - Using a single PV across multiple Vault replicas (RWX). Raft requires per-pod storage.
- Using a non-replicated storage class (single AZ EBS). Loss of the AZ loses the data on that node.
Fix: each Vault replica needs its own PVC backed by a multi-AZ-replicated storage class. For AWS EKS, this typically means EBS gp3 or io2 with EBS snapshots backing up off-cluster. Set reclaimPolicy: Retain so accidental PVC deletion does not destroy data. Run regular Raft snapshots (vault operator raft snapshot save) and ship them to S3 with versioning enabled.
Mistake 4: pod authentication that silently stops working
The Vault Agent Injector or the Kubernetes auth method lets pods authenticate to Vault using their ServiceAccount JWT. Pods present the JWT, Vault verifies with the Kubernetes API, and issues a Vault token.
This breaks in subtle ways:
- Vault's K8s auth role binds to a specific ServiceAccount name. Someone renames a service. The old SA name is in the Vault role. New pods with the new SA name fail auth. Cryptic 403 errors.
- JWT verification key changes during cluster upgrade. EKS occasionally rotates the JWT signing key during managed upgrades. If Vault was configured with a static
kubernetes_ca_certinstead of usingkubernetes_host+ dynamic discovery, auth breaks until you update Vault. - Network policy blocks Vault from reaching K8s API. Vault verifies JWTs by calling the K8s API. If a NetworkPolicy or security group restricts Vault's egress, the API call fails and all auth attempts return 500.
- ServiceAccount token rotation (TokenRequest API): K8s 1.22+ rotates SA tokens regularly. Older Vault versions do not handle the rotated tokens correctly.
Fix:
# Configure K8s auth with token reviewer JWT (uses Vault's own SA token)
vault write auth/kubernetes/config \\
token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \\
kubernetes_host="https://kubernetes.default.svc" \\
kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \\
disable_iss_validation=true \\
disable_local_ca_jwt=false
The disable_iss_validation=true flag (Vault 1.9+) handles the issuer URL changes that break auth on K8s 1.21+. Update Vault to at least 1.13 for the cleanest behavior with TokenRequest.
Mistake 5: Helm chart upgrades that break on the next chart revision
The HashiCorp Vault Helm chart has had multiple breaking changes over its life. Upgrading from one major chart version to the next without reading the upgrade notes has caused:
- Service selectors changing, causing the LoadBalancer to point at the wrong pods.
- StatefulSet PVC retention behavior changes wiping data.
- Auto-injector configuration moving between top-level keys and sub-keys.
- Vault image tag defaulting to the latest, pulling a new Vault version that is incompatible with stored data without a manual
vault operator step-downstep.
Fix:
- Pin the Vault image version explicitly in your values file. Never let it default to "latest."
- Pin the Helm chart version. Never let CD pick up new chart versions automatically.
- Read every release note for the chart and Vault binary between your version and the new one.
- Take a Raft snapshot before any Vault upgrade.
- Upgrade in dev → staging → prod with at least 24 hours between each stage.
- For Vault binary upgrades on a Raft cluster, follow the documented step-down procedure: upgrade standby nodes first, then step down the leader, then upgrade the former leader. Skipping this is the most common cause of Vault data corruption during upgrade.
The production-ready values.yaml fragment
Combining the fixes above, here is a production-ready Vault Helm values fragment:
server:
dev:
enabled: false # ← Mistake 1
ha:
enabled: true
replicas: 5 # 5 nodes for production HA Raft
raft:
enabled: true
setNodeId: true
config: |
ui = true
listener "tcp" {
tls_disable = false
address = "[::]:8200"
cluster_address = "[::]:8201"
tls_cert_file = "/vault/userconfig/vault-tls/tls.crt"
tls_key_file = "/vault/userconfig/vault-tls/tls.key"
}
storage "raft" {
path = "/vault/data"
}
seal "awskms" { # ← Mistake 2: auto-unseal
region = "us-east-1"
kms_key_id = "alias/vault-unseal"
}
service_registration "kubernetes" {}
dataStorage: # ← Mistake 3: persistent storage
enabled: true
size: 50Gi
storageClass: ebs-gp3
accessMode: ReadWriteOnce
image:
repository: hashicorp/vault
tag: 1.18.2 # ← Mistake 5: pin version
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/vault-irsa-role
injector:
enabled: true
image:
repository: hashicorp/vault-k8s
tag: 1.4.2 # pin injector version too
Pre-production checklist
- Vault is NOT in dev mode (
vault statusshows version, not "running in dev mode"). - Auto-unseal is configured against your cloud KMS.
- Each Vault pod has its own PVC backed by a multi-AZ replicated storage class.
- Raft snapshots run on schedule (cron or operator) and ship to S3.
- Kubernetes auth method works for at least one test pod from each namespace.
- Vault image tag is pinned, chart version is pinned, GitOps does not auto-pull updates.
- Documented runbook for: Vault unseal, Vault leader step-down, Raft snapshot restore, breaking upgrade procedure.
- Recovery keys stored in two separate physical/logical locations (1Password vault + offline paper backup, for example).
- Audit log enabled and shipped to centralized logging.
- Alerting on Vault sealed status, Raft leader changes, and high error rates.
Securely share Vault recovery keys
Recovery keys must be split among multiple administrators per Shamir's Secret Sharing. Share key shards through zero-knowledge encryption with auto-expiring links instead of email or chat.
Create Encrypted PasteThe bottom line
Vault on Kubernetes works well in production once you avoid five specific mistakes: never run in dev mode, configure auto-unseal, use real persistent storage, harden the K8s auth integration, and pin every version explicitly. The Helm chart's defaults are tuned for demos, not production. Take an afternoon to review the values file, take a snapshot before any upgrade, and document the runbook before the first time you need it.
Related reading: Vault vs Secrets Manager vs Doppler, Kubernetes Secrets Management, Secrets Management for DevOps Teams, Kubernetes Security Best Practices, and Helm Charts Tutorial.