Skip to content
$ invictus-solutions
EN SK
← Back to all case studies

Fintech startup

Repatriating from Azure AKS to bare-metal RKE2 — 73% lower spend, roughly 2× the performance

A growth-stage Fintech was spending about €22k/month on AKS and managed Azure databases. We moved production to a HA bare-metal RKE2 platform with a separate dev cluster, dedicated database tier, off-site backups, observability, and CI on-prem. Monthly infrastructure spend dropped to about €6k, while comparable workloads ran roughly twice as fast.

Duration
12 weeks
Industry
Fintech startup
Stack
RKE2 Metallb Longhorn MariaDB Galera MariaDB MongoDB Velero Prometheus Grafana Loki GitLab CE ArgoCD Ansible

The problem

A growth-stage Fintech team was running production on Azure AKS with managed MariaDB and MongoDB, paying around €22k per month. Cost was climbing faster than revenue, AKS upgrade cycles had repeatedly bitten them, and per-vCPU performance on managed nodes was visibly worse than it should have been on real silicon. They had budget for a hardware purchase, but no internal capacity to run a cluster on it.

What I did

Twelve-week engagement covering hardware sizing, two RKE2 clusters, a separate database tier, backups, and the supporting platform.

Foundation

  • Sized hardware against the previous twelve months of cluster usage (CPU/memory p95, IOPS, network), specced three racks in a local data centre under a co-location arrangement
  • Built a HA RKE2 control-plane (three control with Corosync/Pacemaker) and a worker pool sized for current load plus 30% headroom
  • Stood up a parallel HA dev RKE2 cluster on smaller hardware so the team could rehearse upgrades and platform changes against a real cluster, not a kind/k3d toy
  • Longhorn for in-cluster persistent volumes; Velero with off-site object storage for cluster state and PV snapshots

Database tier

  • Dedicated bare-metal database cluster, kept separate from the Kubernetes nodes — simpler operations, smaller blast radius
  • Three engines on the same fleet: MariaDB Galera (multi-master for the OLTP path), MariaDB primary/replica (for reporting workloads where replication lag is acceptable), MongoDB replica set (for the document store)
  • Backups taken at the database layer (logical dumps plus binlog/oplog), shipped off-site nightly — independent of cluster-level Velero backups

Supporting platform

  • Self-hosted GitLab CE for source and CI; runners on dedicated nodes outside the prod cluster so a CI surge can never pressure production
  • Observability stack on the cluster itself: Prometheus, Grafana, Loki — SLO-based alerting, runbooks linked from every alert
  • ArgoCD for GitOps: every change to the cluster, the platform, and the workloads goes through a git review

Migration

  • Per service: replicated state onto the new database tier, stood the workload up on RKE2 in shadow mode behind the existing Azure ingress, cut traffic over with a DNS flip, kept the AKS instance running for two weeks as a fallback
  • Migrated in dependency order, leaf services first to build operational confidence
  • The Azure subscription was only cancelled after two clean billing cycles on the new platform

The result

  • Monthly infra spend €22k → €6k, sustained across the first three billing cycles
  • ~2× workload performance at equivalent vCPU count vs the previous AKS + managed-DB setup — most of the gain came from removing virtualisation overhead and noisy-neighbour effects on managed nodes
  • HA prod cluster, HA dev cluster, dedicated DB tier with off-site backups, on-prem GitLab, and full observability — all on a single Ansible-managed Linux baseline
  • The team has been running both clusters without me for the four months since handover

What was deliberately left alone

I did not move email, the corporate identity provider, or anything that genuinely benefits from being managed. The point of repatriation is to keep what you can run well in-house and pay for the rest — not to score points by removing every cloud line item.