Fintech startup
From single-VM web app to HA Kubernetes — feature-branch CI/CD and zero-downtime deploys
A web platform was split across two Linux VMs: one app server, one database server. Every deploy meant downtime, staging was a queue, and local development did not match production. We moved it to HA Kubernetes with production Docker images, feature-branch environments, GitLab CI/CD, MariaDB Galera, observability, and a docker-compose setup that mirrors production.
- Duration
- 10 weeks
- Industry
- Fintech startup
- Stack
- Kubernetes Docker docker-compose GitLab CI GitLab Agent (KAS) Kustomize MariaDB Galera Prometheus Grafana Loki Elasticsearch Kibana
The problem
A web platform running on two standalone Linux VMs — one application server, one database server. Every release required downtime, because the single web VM was production. Multiple developers could not work on different features in parallel: there was one shared staging environment and a queue. Local development used a different stack from production, so a class of bugs only surfaced after deploy. Both VMs were single points of failure.
What I did
Ten-week engagement covering containerisation, the Kubernetes platform, the CI/CD pipeline, the database move, and the observability stack.
Containerisation and local parity
- Wrote production Docker images for the application: same base, same runtime versions, same configuration shape that production uses
- The same images run locally via docker-compose so a developer on their laptop sees the same runtime they ship to production
- Sessions, file uploads and other shared state moved out of the application container into backing services, so any pod is interchangeable
Kubernetes platform
- HA cluster with rolling-update Deployments behind an Ingress, PodDisruptionBudgets, and HPA on the application tier
- Workloads sized using usage data from the existing VMs, not guesses
- Old VMs ran in parallel for two weeks before being decommissioned
Database
- MariaDB Galera multi-master replacing the single DB VM
- Logical and binlog backups taken at the database layer, replication-lag and slow-query alerts wired up from day one
- Schema-changing releases stay in a planned window — the only releases that take downtime — because online schema change with Galera is its own discipline and was scoped out of this engagement
CI/CD with feature branches
- GitLab CI builds the image once per commit; the same artefact runs in every environment
- GitLab Agent (KAS) pulls per-environment Kustomize overlays from the repo — prod, staging and per-feature ephemeral envs — and reconciles them into the cluster on every change; developers ship by merging
- Each feature branch gets an ephemeral namespace and a DNS subdomain on push — two developers can work on two features at once without stepping on each other
- Merge-request pipelines run integration tests against the actual built image, not the developer's host
Observability
- Prometheus, Grafana and Loki on the cluster for metrics and application logs
- Application metrics + golden signals + DB-tier metrics (replication lag, slow queries, connection saturation)
- Web and ingress access logs ship to Elasticsearch and are visualised in pre-built Kibana dashboards — per-status, per-endpoint, latency histograms, top clients
- SLO-based alerting with runbooks linked from every alert — no spam, no orphan pages
The result
- Zero-downtime rolling deploys for application changes — a typical release goes out unnoticed
- DB-migration releases still take a planned window, honestly acknowledged in the release runbook
- Two single-VM SPOFs gone — the web tier scales horizontally, the database is a Galera cluster
- Parallel feature work: each merge request gets its own preview environment; queue contention and merge-conflict pain dropped substantially
- Local stack matches production: the docker-compose dev environment surfaces config and runtime issues before they ever reach a real environment
What was deliberately left alone
The team's existing source-of-truth (GitLab) and identity stack stayed in place — this engagement was about moving the application onto a HA container platform, not rebuilding everything around it.