Software house
Replatforming a virtualised Kubernetes cluster — k3s to RKE2, GitOps, SSO, and modern observability
An internal k3s cluster on virtualised infrastructure had quietly become production: hand-rolled storage, ad-hoc kubectl deploys, scattered metrics and logs, shared kubeconfigs, and no clean RBAC. Over nine weeks we moved it to RKE2, Rook-Ceph, ArgoCD GitOps, Harbor, a Grafana observability stack, Keycloak SSO, namespace RBAC from group claims, and Renovate-managed platform updates.
- Duration
- 9 weeks
- Industry
- Software house
- Stack
- RKE2 Proxmox VE Ansible Rook-Ceph ArgoCD Harbor Prometheus Mimir Grafana Loki Tempo Keycloak Renovate
The problem
A k3s cluster on a virtualised estate had quietly grown into the team's de-facto internal platform — a dozen production apps — without ever being treated as one. ceph-csi against an external Ceph cluster handled persistent volumes. A patchy Prometheus and Grafana existed; logs were scattered; there were no traces. Deployments were ad-hoc kubectl apply from people's laptops. Cluster login was kubeconfigs handed around; app login was per-app users and passwords. RBAC was effectively "cluster-admin or no access". Tool versions had drifted, and "we'll upgrade next quarter" was two years old.
What I did
Nine-week engagement. The sequence mattered: cluster control plane and storage first, then GitOps and the registry, then the observability rebuild, then identity and RBAC, then automation.
k3s → RKE2, in place
- Wrote Ansible playbooks that provision the cluster VMs on Proxmox (template clone, cloud-init, network and disk layout) and then initialise a HA RKE2 control plane and join the workers — full rebuild from scratch is now a single playbook run
- Stood up the new RKE2 cluster on the same virtualisation, sized off twelve months of k3s usage data
- Migrated workloads namespace-by-namespace via cordon/drain on k3s and re-apply on RKE2; PVCs followed via the storage migration below
- The k3s cluster ran in parallel for a week of soak before being decommissioned
ceph-csi → Rook-Ceph
- Moved Ceph integration in-cluster under Rook — same external Ceph cluster for the persistent data, but managed via Rook operators rather than a hand-rolled ceph-csi setup
- PVCs migrated using rbd snapshot → restore against the new StorageClass; per-app downtime measured in single-digit minutes
- Storage configuration now lives in the same GitOps stream as everything else
GitOps with ArgoCD
- ArgoCD installed cluster-side, configured with one app-of-apps per environment
- Every cluster workload — system add-ons, observability stack, internal apps — moved into a git repo with per-environment overlays
- The "I ran kubectl apply on my laptop" pattern stopped being a thing
Harbor
- Self-hosted Harbor as the cluster's pull-through proxy and the team's source-of-truth registry
- Image signing, vulnerability scanning, and per-project RBAC tied to the same OIDC identities used everywhere else
Modernised observability
- Prometheus on the cluster; Mimir behind it for long-term, multi-tenant metric storage
- Loki for logs, Tempo for traces — one Grafana fronts all four backends
- Exemplars wired across the stack: a metric line jumps to the trace, a trace span jumps to the log line, a log line jumps back to the metric — the on-call doesn't have to switch tools mid-incident
- SLO-based alerting with runbooks linked from each alert; the legacy Prometheus rules audited and either kept, fixed, or deleted
Identity and RBAC
- Keycloak as the single OIDC provider; kube-apiserver wired to it for kubectl login — no more shared kubeconfigs
- Same Keycloak realm wired into ArgoCD, Harbor, Grafana and every internal app on the cluster — one login, one place to revoke
- Granular per-namespace RBAC: roles defined in git, group→namespace bindings driven by Keycloak group claims, so onboarding/offboarding is a Keycloak group change rather than a cluster patch
Renovate
- Self-hosted Renovate scanning the GitOps repo, the Kustomize bases, and the platform tool versions (RKE2, Rook, ArgoCD, Harbor, Grafana, Prometheus, Mimir, Loki, Tempo, Keycloak)
- Renovate proposes the version bumps, GitOps re-applies them once the MR is merged — versioning stops being "next quarter's problem"
The result
- Cluster on supported RKE2, storage on Rook-Ceph, every workload reconciled by ArgoCD
- A single Grafana surface across metrics, logs and traces with exemplars wiring them together
- One identity provider for kubectl, ArgoCD, Harbor, Grafana, and every internal app on the cluster
- Per-namespace RBAC defined in git and driven from Keycloak group membership — onboarding/offboarding is a single change in the right place
- Renovate keeps the platform stack current; tool versions stop drifting between engagements
What was deliberately left alone
The team's existing CI was kept — it produced images that were already pushed to the new Harbor instance and the engagement was about the platform, not the build pipeline. Application-level features and cost optimisation were also out of scope; those are their own conversations.