The Bridge to Vault: Tailscale, Talos, and the Art of One-Shot Rebirth in Kubernetes#
The evolution of the TazLab infrastructure continues according to a well-defined roadmap. The final goal is clear: abandon the static secret management offered by the free tier of Infisical to embrace dynamic secrets and the automated rotation guaranteed by HashiCorp Vault. As I documented in previous articles, Vault has already been successfully migrated to an isolated and protected cloud machine. The problem that arose in this specific session was no longer related to Vault itself, but to how to make the local Kubernetes cluster (based on Proxmox and Talos OS) communicate with that remote Vault, in a secure, deterministic way, and, above all, without exposing ports on the Internet.
The obvious answer to this networking need is Tailscale, a WireGuard-based mesh VPN that allows you to create encrypted private networks (Tailnets) between nodes distributed anywhere. But in the TazLab ecosystem, the “obvious answer” must always deal with a non-negotiable architectural principle: the philosophy of the Ephemeral Castle.
The local Kubernetes cluster is not a sacred entity to be preserved at all costs. It is a disposable compute resource. A “One-Shot Rebirth” process already existed: a script capable of razing virtual machines to the ground and rebuilding the entire environment from scratch, automatically restoring the state via GitOps (Flux) and the data via database restore from S3. The insertion of Tailscale as a bridge to Vault was not supposed to invalidate this automatic regeneration capability in any way. On the contrary, it had to integrate organically, demonstrating that the destruction and reconstruction process (the Rebirth) continued to work perfectly, bringing with it the new connectivity as well.
This infrastructure session therefore developed on a double track: on the one hand, the technical implementation of the Tailscale-Talos bridge; on the other, the in-depth investigation and resolution of a series of fascinating “race conditions” that the addition of these new components brought out during the automated bootstrap cycle.
Managing Secrets in RAM: The Tailscale-Talos Bridge#
The first technical hurdle was joining the Talos nodes to the Tailnet. Talos is a minimal, immutable, API-driven operating system specifically designed for Kubernetes. It has no shell, no SSH, and does not allow the installation of traditional packages. Its features are extended through System Extensions, pre-compiled modules that are “baked” into the operating system image.
Using System Extensions in Talos#
For those unfamiliar with this approach, a System Extension in Talos is not a daemon that is installed at runtime. It is an integrated low-level component. To have Tailscale, I had to update the Terraform configuration (the proxmox-talos module) to point the nodes to a specific schematic (an image generated by the Talos Image Factory) that included the siderolabs/tailscale extension.
This architectural choice is superior to deploying Tailscale as a DaemonSet within Kubernetes. By running Tailscale at the operating system (OS) layer, the nodes acquire connectivity before Kubernetes even performs the bootstrap. This ensures that the API server and the fundamental components of the control plane are immediately protected and routed through the mesh VPN, eliminating the networking complexities and overlaps between the Kubernetes CNI (Container Network Interface) and the VPN network interfaces.
The AuthKey Challenge and RAM Injection#
The real operational problem, however, lay in authentication. How do I make a newly born node automatically join the Tailnet? Tailscale offers AuthKeys, pre-generated keys that allow automatic device enrollment. But the golden rule of TazLab categorically forbids writing plain text secrets to disk or saving them in Terraform state files.
If I had passed the AuthKey as a variable to Terraform, it would inevitably have ended up in the terraform.tfstate, violating the security requirements. The solution required a more surgical approach.
I decided to implement AuthKey generation directly in the bootstrap orchestration script (create.sh), requesting it dynamically via the Tailscale APIs just seconds before the nodes are created. This key, generated “on the fly”, lives exclusively in memory (in the RAM of the bash process) and is injected into the Talos nodes only after Terraform has concluded the provisioning of the basic infrastructure.
To do this, I used the talosctl apply-config command to apply a patch to the machine configuration (Machine Config) using an ExtensionServiceConfig.
patch_talos_tailscale_extension() {
# [...] Preliminary setup and checks
echo "š§ Applying RAM-only Tailscale ExtensionServiceConfig patches via talosctl..."
TALOSCONFIG="$talosconfig" python3 - "$env_file" <<'EOF'
# [...] Node inventory parsing logic
for ordinal, node_ip in entries:
hostname = f"{cluster_name}-{role}-{ordinal}"
extension_doc = {
"apiVersion": "v1alpha1",
"kind": "ExtensionServiceConfig",
"name": "tailscale",
"environment": [
f"TS_AUTHKEY={os.environ['TS_AUTHKEY']}",
f"TS_HOSTNAME={hostname}",
"TS_EXTRA_ARGS=--advertise-tags=tag:tazlab-k8s --accept-routes=false",
"TS_STATE_DIR=/var/lib/tailscale",
],
}
# [...] Temporary yaml generation on /dev/shm and application via talosctl
EOF
echo "ā
Tailscale ExtensionServiceConfig patches applied without persisting TS_AUTHKEY."
}The choice to use a Python construct inside the bash script allowed for clean and secure manipulation of YAML outputs and environment variables. The temporary file with the patch is created in /dev/shm (a pseudo-filesystem residing in RAM) and deleted immediately after application, ensuring that no trace of the key remains long-term.
AuthKey Debugging: Reusable vs Ephemeral#
During the initial tests of this implementation, I encountered anomalous behavior. The control plane (the first node) regularly joined the Tailnet, but the worker node systematically failed authentication. Examining the Talos node system logs via talosctl dmesg, I noticed a Tailscale daemon error: invalid key.
The mental process in these cases must isolate the variables. Was the network working? Yes, the control plane had entered. Was the patching correct? Yes, the configuration reached its destination. The problem had to lie in the nature of the key itself.
Consulting the Tailscale API documentation, I analyzed the JSON payload I was sending to generate the key. Initially, I had set the flags ephemeral: true (so that nodes would be automatically deregistered if inactive) and reusable: false. The intent was maximum security: a single-use key to prevent abuse.
However, the design of my create.sh script envisioned generating a single key for the entire bootstrap cycle and sharing it among all nodes. Being reusable: false, as soon as the control plane used it, the key was invalidated on the server side. When the worker node attempted to authenticate with the same key a few moments later, Tailscale correctly rejected it.
The solution was to change the flag to reusable: true, maintaining ephemerality. The key is valid only for the duration of the bootstrap cycle (with a forced short-term expiration), but can be used by N nodes in that specific time window. This resolved the error and brought both nodes into the Tailnet with the correct tags (tag:tazlab-k8s), ready for future dialogue with Vault.
Physics vs Logic: Teaching Patience to GitOps#
With the Tailscale bridge up and running, the cluster was theoretically ready. But the real test, as established by the prerequisites, was the “One-Shot Rebirth”. The orchestration script didn’t just have to bring up the VMs, it had to run Terraform to install the core components (like the GitOps operator FluxCD, MetalLB for IPs, and Longhorn for distributed storage) and then let Flux download the manifests from GitHub to rebuild the entire application fleet.
The One-Shot Rebirth already existed, but the introduction of new logic revealed how fragile it could be in the face of the infrastructure’s physical timing.
The Concept of GitOps and Eventual Consistency#
GitOps, operated in my case by Flux, is based on the principle of Eventual Consistency. You declare the desired state in a Git repository, and the operator inside the cluster works in continuous cycles (reconciliation loops) to make the real state converge towards the desired one.
In theory, this model is invulnerable to execution order. If I ask Flux to create a Deployment before the Namespace that must contain it exists, the controller fails the first attempt, waits, and retries. When the Namespace finally appears, the Deployment is created.
In practice, however, a bootstrap from scratch stresses this model to the extreme. The create.sh script is an imperative tool that triggers a declarative process. The problem arises when the imperative tool declares victory too soon, abandoning monitoring before the declarative ecosystem has actually finished its settling work.
The Illusion of Network Readiness (CNI)#
The first symptom of this dichotomy emerged with the cascading failures of the HelmReleases. Flux started to download and apply Helm charts (like Traefik or External-Secrets), but the pods timed out or remained in the ContainerCreating state.
An analysis with kubectl describe pod revealed the snag: failed to setup network for sandbox... plugin type="flannel" failed.
What was going on? Terraform had completed its phase. Kubernetes nodes were reported as Ready. The script proceeded to trigger Flux. Flux asked the API server to schedule the pods. The API server ordered kubelet to start them. But the CNI (Container Network Interface), in my case Flannel, had not yet finished distributing the pods onto the nodes, allocating subnets, and configuring iptables rules. The containers were being born in a pneumatic vacuum without network connectivity.
The solution was not to brutally block the bootstrap script with wait loops (an “imperative” and fragile approach), but to teach patience directly to Flux, adopting a more “Enterprise” strategy. I modified the base Kustomization (the one defining the fundamental layers) by adding native healthChecks.
healthChecks:
- apiVersion: apps/v1
kind: DaemonSet
name: kube-flannel
namespace: kube-system
- apiVersion: apps/v1
kind: Deployment
name: coredns
namespace: kube-systemThis approach guarantees that when the high-level orchestrator (Flux) starts requesting resources for subsequent Kustomizations (like Traefik or External-Secrets), the low-level physical and logical layer is ready to answer, cascadingly freezing the entire GitOps dependency tree until the network is genuinely solid.
The Paradox of the Grafana Secret in Disaster Recovery#
The final chapter of this infrastructure investigation concerned a purely application-level misalignment that emerges only during a disaster (real or simulated).
In the cluster, Grafana is used to display metrics and requires access to its own database hosted on PostgreSQL. Initial password management happens via Infisical. An ExternalSecrets operator connects to Infisical, grabs the static password (GRAFANA_DB_PASSWORD) defined upfront, and injects it into the monitoring namespace in the form of a Kubernetes Secret. Grafana boots up by reading this secret.
This flow works perfectly on day-zero (the very first absolute boot). But the One-Shot Rebirth is not a day-zero; it is a Disaster Recovery.
The Divergence of States#
When the PGO operator restores the PostgreSQL cluster drawing from S3 backups, it doesn’t just copy the bytes of data. It restores the entire logical ecosystem of the database, including users and their encrypted credentials, recreating the necessary Kubernetes Secrets in the tazlab-db namespace (where the database resides).
This dynamic generates a formidable state divergence:
- On one hand, in the
monitoringnamespace, there is anExternalSecretforcing the presence of an “old” password (the bootstrap one on Infisical). - On the other hand, the restored Postgres database contains the “new” password, rotated in the past and historicized in the S3 backup. The PGO operator, furthermore, has just issued an updated Secret in the
tazlab-dbnamespace.
Inevitable result: Grafana starts up with the Infisical password, attempts to log into the restored DB, and is rejected with an eloquent pq: password authentication failed for user "grafana". Grafana enters CrashLoopBackOff, monitoring fails, the cluster is not completely healthy.
Post-Restore Synchronization as a Bridge to Vault#
Addressing this paradox required a major design decision. I could have manually updated Infisical with the new password, but this would have violated the cardinal principle of total automation. Or I could have created yet another synchronization tool.
The truth is that this specific problem (a secret that changes cycle after cycle and must be propagated) is the exact symptom of an architectural limit: the dependency on static secrets. As long as I rely on Infisical in its free tier, I don’t have automatic rotation and full lifecycle management of the secret, but only its distribution.
To unlock the One-Shot Rebirth, I implemented a sync_runtime_secrets function within the bootstrap script. This function has a very specific task: it waits patiently for the PGO operator to declare the formal completion of the restore (PostgresDataInitialized=True), after which it suppresses the flawed ExternalSecret and replaces it, injecting the current truth captured directly from the database namespace.
sync_runtime_secrets() {
echo "š Syncing runtime-generated secrets needed after restore..."
# 1. Wait for the secret generated by PGO to be available
local secret_json
until secret_json=$(kubectl get secret -n tazlab-db tazlab-db-pguser-grafana -o json 2>/dev/null) && [[ -n "$secret_json" ]]; do
sleep 5
done
# 2. Remove static management by Infisical to avoid overwriting data
kubectl delete externalsecret -n monitoring tazlab-db-pguser-grafana --ignore-not-found >/dev/null 2>&1 || true
# 3. Transform the PGO secret, change namespace, and apply in 'monitoring'
echo "$secret_json" | python3 -c '
import json, sys
obj=json.load(sys.stdin)
obj["metadata"]["namespace"]="monitoring"
obj["metadata"]["name"]="tazlab-db-pguser-grafana"
[obj["metadata"].pop(k,None) for k in ["uid","resourceVersion","creationTimestamp","managedFields","ownerReferences","annotations"]]
obj["metadata"].setdefault("labels",{})
obj["metadata"]["labels"]["synced-by"]="ephemeral-castle-create"
print(json.dumps(obj))
' | kubectl apply -f -
# 4. Restart Grafana to force rereading the correct secret
kubectl delete pod -n monitoring -l app.kubernetes.io/name=grafana --ignore-not-found >/dev/null 2>&1 || true
echo " -> synced tazlab-db-pguser-grafana into monitoring namespace"
}This JSON pipeline manipulation (performed with a small Python inline to ensure proper cleanup of the original Kubernetes metadata) safely transports the Grafana secret from the DB namespace to the monitoring namespace. The subsequent brutal restart of the Grafana pods forces them to boot up mounting the updated Secret. The login succeeds, and monitoring is operational again.
The Architectural Revelation#
The crucial point is that this technical patch is not an elegant solution, it is a painful workaround. It forced me to write imperative logic (the Python script to move the secret) to solve a limitation of my declarative stack.
This workaround was exactly the spark that ignited the entire subsequent project. The inability to natively and fluidly manage database credentials generated post-restore is the precise reason why I decided to abandon Infisical and instantiate my personal Vault. I had initially planned to host it on Oracle Cloud in the Always Free tier, but as I recounted in the article dedicated to the “pivot” of Lushy Corp, the availability and stability limits quickly pushed me to opt for a solid VPS on Hetzner.
Regardless of where it is physically located, the goal does not change: with the dynamic secrets of Vault, I will no longer need Python scripts moving passwords. Vault will dynamically inject a temporary user into PostgreSQL every time Grafana (or any other app) requests it, eliminating the problem of desynchronized static secrets at the root.
This hack on Grafana is, to all intents and purposes, the conceptual bridge that made it imperative to complete the Tailscale-Talos bridge.
Conclusions: Design Always Pays Off#
At the end of this engineering session, I launched the final test. I ran ./destroy.sh in the background, watching the Proxmox APIs pulverize the virtual machines and clearing every trace of local state. Immediately after, I ran ./create.sh.
I stepped away from the terminal.
Fourteen minutes later, the script finished execution, reporting the successful retrieval of the LoadBalancer IP address and the successful restoration of the database. A manual verification on the cluster confirmed an immaculate environment: nodes correctly associated with the Tailnet, no local secrets on disk, a fully convergent GitOps stack, Longhorn volumes attached, PostgreSQL in a healthy replica, and Grafana operational. Zero manual interventions.
This result demonstrates once again the validity of the working framework I am adopting. The time invested in theoretical design (the Design phase) ensured that, during practical implementation, architectural problems never arose. None of the paradigms (Tailscale on Talos, GitOps, remote backup via S3) was questioned.
The problems facedāsingle-use authentication, network timeouts, misalignment of monitoring secrets post-restoreāwere exclusively of a chronological and integrative nature. Operational bugs that, in a solid architecture, are faced and defeated by isolating them methodically, transforming a chaotic cascading failure into a predictable and orderly rebirth.
The bridge is built. The Ephemeral Castle is standing once again and, thanks to the mesh VPN, it is finally ready to speak, in the next step, with the isolated instance of HashiCorp Vault.


