Skip to content
$ invictus-solutions
EN SK
← Back to all case studies

Software house

Replatforming a virtualised Kubernetes cluster — k3s to RKE2, GitOps, SSO, and modern observability

An internal k3s cluster on virtualised infrastructure had quietly become production: hand-rolled storage, ad-hoc kubectl deploys, scattered metrics and logs, shared kubeconfigs, and no clean RBAC. Over nine weeks we moved it to RKE2, Rook-Ceph, ArgoCD GitOps, Harbor, a Grafana observability stack, Keycloak SSO, namespace RBAC from group claims, and Renovate-managed platform updates.

Duration
9 weeks
Industry
Software house
Stack
RKE2 Proxmox VE Ansible Rook-Ceph ArgoCD Harbor Prometheus Mimir Grafana Loki Tempo Keycloak Renovate

The problem

A k3s cluster on a virtualised estate had quietly grown into the team's de-facto internal platform — a dozen production apps — without ever being treated as one. ceph-csi against an external Ceph cluster handled persistent volumes. A patchy Prometheus and Grafana existed; logs were scattered; there were no traces. Deployments were ad-hoc kubectl apply from people's laptops. Cluster login was kubeconfigs handed around; app login was per-app users and passwords. RBAC was effectively "cluster-admin or no access". Tool versions had drifted, and "we'll upgrade next quarter" was two years old.

What I did

Nine-week engagement. The sequence mattered: cluster control plane and storage first, then GitOps and the registry, then the observability rebuild, then identity and RBAC, then automation.

k3s → RKE2, in place

  • Wrote Ansible playbooks that provision the cluster VMs on Proxmox (template clone, cloud-init, network and disk layout) and then initialise a HA RKE2 control plane and join the workers — full rebuild from scratch is now a single playbook run
  • Stood up the new RKE2 cluster on the same virtualisation, sized off twelve months of k3s usage data
  • Migrated workloads namespace-by-namespace via cordon/drain on k3s and re-apply on RKE2; PVCs followed via the storage migration below
  • The k3s cluster ran in parallel for a week of soak before being decommissioned

ceph-csi → Rook-Ceph

  • Moved Ceph integration in-cluster under Rook — same external Ceph cluster for the persistent data, but managed via Rook operators rather than a hand-rolled ceph-csi setup
  • PVCs migrated using rbd snapshot → restore against the new StorageClass; per-app downtime measured in single-digit minutes
  • Storage configuration now lives in the same GitOps stream as everything else

GitOps with ArgoCD

  • ArgoCD installed cluster-side, configured with one app-of-apps per environment
  • Every cluster workload — system add-ons, observability stack, internal apps — moved into a git repo with per-environment overlays
  • The "I ran kubectl apply on my laptop" pattern stopped being a thing

Harbor

  • Self-hosted Harbor as the cluster's pull-through proxy and the team's source-of-truth registry
  • Image signing, vulnerability scanning, and per-project RBAC tied to the same OIDC identities used everywhere else

Modernised observability

  • Prometheus on the cluster; Mimir behind it for long-term, multi-tenant metric storage
  • Loki for logs, Tempo for traces — one Grafana fronts all four backends
  • Exemplars wired across the stack: a metric line jumps to the trace, a trace span jumps to the log line, a log line jumps back to the metric — the on-call doesn't have to switch tools mid-incident
  • SLO-based alerting with runbooks linked from each alert; the legacy Prometheus rules audited and either kept, fixed, or deleted

Identity and RBAC

  • Keycloak as the single OIDC provider; kube-apiserver wired to it for kubectl login — no more shared kubeconfigs
  • Same Keycloak realm wired into ArgoCD, Harbor, Grafana and every internal app on the cluster — one login, one place to revoke
  • Granular per-namespace RBAC: roles defined in git, group→namespace bindings driven by Keycloak group claims, so onboarding/offboarding is a Keycloak group change rather than a cluster patch

Renovate

  • Self-hosted Renovate scanning the GitOps repo, the Kustomize bases, and the platform tool versions (RKE2, Rook, ArgoCD, Harbor, Grafana, Prometheus, Mimir, Loki, Tempo, Keycloak)
  • Renovate proposes the version bumps, GitOps re-applies them once the MR is merged — versioning stops being "next quarter's problem"

The result

  • Cluster on supported RKE2, storage on Rook-Ceph, every workload reconciled by ArgoCD
  • A single Grafana surface across metrics, logs and traces with exemplars wiring them together
  • One identity provider for kubectl, ArgoCD, Harbor, Grafana, and every internal app on the cluster
  • Per-namespace RBAC defined in git and driven from Keycloak group membership — onboarding/offboarding is a single change in the right place
  • Renovate keeps the platform stack current; tool versions stop drifting between engagements

What was deliberately left alone

The team's existing CI was kept — it produced images that were already pushed to the new Harbor instance and the engagement was about the platform, not the build pipeline. Application-level features and cost optimisation were also out of scope; those are their own conversations.