Fintech startup
Repatriating from Azure AKS to bare-metal RKE2 — 73% lower spend, roughly 2× the performance
A growth-stage Fintech was spending about €22k/month on AKS and managed Azure databases. We moved production to a HA bare-metal RKE2 platform with a separate dev cluster, dedicated database tier, off-site backups, observability, and CI on-prem. Monthly infrastructure spend dropped to about €6k, while comparable workloads ran roughly twice as fast.
- Duration
- 12 weeks
- Industry
- Fintech startup
- Stack
- RKE2 Metallb Longhorn MariaDB Galera MariaDB MongoDB Velero Prometheus Grafana Loki GitLab CE ArgoCD Ansible
The problem
A growth-stage Fintech team was running production on Azure AKS with managed MariaDB and MongoDB, paying around €22k per month. Cost was climbing faster than revenue, AKS upgrade cycles had repeatedly bitten them, and per-vCPU performance on managed nodes was visibly worse than it should have been on real silicon. They had budget for a hardware purchase, but no internal capacity to run a cluster on it.
What I did
Twelve-week engagement covering hardware sizing, two RKE2 clusters, a separate database tier, backups, and the supporting platform.
Foundation
- Sized hardware against the previous twelve months of cluster usage (CPU/memory p95, IOPS, network), specced three racks in a local data centre under a co-location arrangement
- Built a HA RKE2 control-plane (three control with Corosync/Pacemaker) and a worker pool sized for current load plus 30% headroom
- Stood up a parallel HA dev RKE2 cluster on smaller hardware so the team could rehearse upgrades and platform changes against a real cluster, not a kind/k3d toy
- Longhorn for in-cluster persistent volumes; Velero with off-site object storage for cluster state and PV snapshots
Database tier
- Dedicated bare-metal database cluster, kept separate from the Kubernetes nodes — simpler operations, smaller blast radius
- Three engines on the same fleet: MariaDB Galera (multi-master for the OLTP path), MariaDB primary/replica (for reporting workloads where replication lag is acceptable), MongoDB replica set (for the document store)
- Backups taken at the database layer (logical dumps plus binlog/oplog), shipped off-site nightly — independent of cluster-level Velero backups
Supporting platform
- Self-hosted GitLab CE for source and CI; runners on dedicated nodes outside the prod cluster so a CI surge can never pressure production
- Observability stack on the cluster itself: Prometheus, Grafana, Loki — SLO-based alerting, runbooks linked from every alert
- ArgoCD for GitOps: every change to the cluster, the platform, and the workloads goes through a git review
Migration
- Per service: replicated state onto the new database tier, stood the workload up on RKE2 in shadow mode behind the existing Azure ingress, cut traffic over with a DNS flip, kept the AKS instance running for two weeks as a fallback
- Migrated in dependency order, leaf services first to build operational confidence
- The Azure subscription was only cancelled after two clean billing cycles on the new platform
The result
- Monthly infra spend €22k → €6k, sustained across the first three billing cycles
- ~2× workload performance at equivalent vCPU count vs the previous AKS + managed-DB setup — most of the gain came from removing virtualisation overhead and noisy-neighbour effects on managed nodes
- HA prod cluster, HA dev cluster, dedicated DB tier with off-site backups, on-prem GitLab, and full observability — all on a single Ansible-managed Linux baseline
- The team has been running both clusters without me for the four months since handover
What was deliberately left alone
I did not move email, the corporate identity provider, or anything that genuinely benefits from being managed. The point of repatriation is to keep what you can run well in-house and pay for the rest — not to score points by removing every cloud line item.