Fintech startup

Repatriating from Azure AKS to bare-metal RKE2 — 73% lower spend, roughly 2× the performance

A growth-stage Fintech was spending about €22k/month on AKS and managed Azure databases. We moved production to a HA bare-metal RKE2 platform with a separate dev cluster, dedicated database tier, off-site backups, observability, and CI on-prem. Monthly infrastructure spend dropped to about €6k, while comparable workloads ran roughly twice as fast.

Duration: 12 weeks
Industry: Fintech startup
Stack: RKE2 Metallb Longhorn MariaDB Galera MariaDB MongoDB Velero Prometheus Grafana Loki GitLab CE ArgoCD Ansible

The problem

A growth-stage Fintech team was running production on Azure AKS with managed MariaDB and MongoDB, paying around €22k per month. Cost was climbing faster than revenue, AKS upgrade cycles had repeatedly bitten them, and per-vCPU performance on managed nodes was visibly worse than it should have been on real silicon. They had budget for a hardware purchase, but no internal capacity to run a cluster on it.

What I did

Twelve-week engagement covering hardware sizing, two RKE2 clusters, a separate database tier, backups, and the supporting platform.

Foundation

Sized hardware against the previous twelve months of cluster usage (CPU/memory p95, IOPS, network), specced three racks in a local data centre under a co-location arrangement
Built a HA RKE2 control-plane (three control with Corosync/Pacemaker) and a worker pool sized for current load plus 30% headroom
Stood up a parallel HA dev RKE2 cluster on smaller hardware so the team could rehearse upgrades and platform changes against a real cluster, not a kind/k3d toy
Longhorn for in-cluster persistent volumes; Velero with off-site object storage for cluster state and PV snapshots

Database tier

Dedicated bare-metal database cluster, kept separate from the Kubernetes nodes — simpler operations, smaller blast radius
Three engines on the same fleet: MariaDB Galera (multi-master for the OLTP path), MariaDB primary/replica (for reporting workloads where replication lag is acceptable), MongoDB replica set (for the document store)
Backups taken at the database layer (logical dumps plus binlog/oplog), shipped off-site nightly — independent of cluster-level Velero backups

Supporting platform

Self-hosted GitLab CE for source and CI; runners on dedicated nodes outside the prod cluster so a CI surge can never pressure production
Observability stack on the cluster itself: Prometheus, Grafana, Loki — SLO-based alerting, runbooks linked from every alert
ArgoCD for GitOps: every change to the cluster, the platform, and the workloads goes through a git review

Migration

Per service: replicated state onto the new database tier, stood the workload up on RKE2 in shadow mode behind the existing Azure ingress, cut traffic over with a DNS flip, kept the AKS instance running for two weeks as a fallback
Migrated in dependency order, leaf services first to build operational confidence
The Azure subscription was only cancelled after two clean billing cycles on the new platform

The result

Monthly infra spend €22k → €6k, sustained across the first three billing cycles
~2× workload performance at equivalent vCPU count vs the previous AKS + managed-DB setup — most of the gain came from removing virtualisation overhead and noisy-neighbour effects on managed nodes
HA prod cluster, HA dev cluster, dedicated DB tier with off-site backups, on-prem GitLab, and full observability — all on a single Ansible-managed Linux baseline
The team has been running both clusters without me for the four months since handover

What was deliberately left alone

I did not move email, the corporate identity provider, or anything that genuinely benefits from being managed. The point of repatriation is to keep what you can run well in-house and pay for the rest — not to score points by removing every cloud line item.