Fintech startup

From single-VM web app to HA Kubernetes — feature-branch CI/CD and zero-downtime deploys

A web platform was split across two Linux VMs: one app server, one database server. Every deploy meant downtime, staging was a queue, and local development did not match production. We moved it to HA Kubernetes with production Docker images, feature-branch environments, GitLab CI/CD, MariaDB Galera, observability, and a docker-compose setup that mirrors production.

Duration: 10 weeks
Industry: Fintech startup
Stack: Kubernetes Docker docker-compose GitLab CI GitLab Agent (KAS) Kustomize MariaDB Galera Prometheus Grafana Loki Elasticsearch Kibana

The problem

A web platform running on two standalone Linux VMs — one application server, one database server. Every release required downtime, because the single web VM was production. Multiple developers could not work on different features in parallel: there was one shared staging environment and a queue. Local development used a different stack from production, so a class of bugs only surfaced after deploy. Both VMs were single points of failure.

What I did

Ten-week engagement covering containerisation, the Kubernetes platform, the CI/CD pipeline, the database move, and the observability stack.

Containerisation and local parity

Wrote production Docker images for the application: same base, same runtime versions, same configuration shape that production uses
The same images run locally via docker-compose so a developer on their laptop sees the same runtime they ship to production
Sessions, file uploads and other shared state moved out of the application container into backing services, so any pod is interchangeable

Kubernetes platform

HA cluster with rolling-update Deployments behind an Ingress, PodDisruptionBudgets, and HPA on the application tier
Workloads sized using usage data from the existing VMs, not guesses
Old VMs ran in parallel for two weeks before being decommissioned

Database

MariaDB Galera multi-master replacing the single DB VM
Logical and binlog backups taken at the database layer, replication-lag and slow-query alerts wired up from day one
Schema-changing releases stay in a planned window — the only releases that take downtime — because online schema change with Galera is its own discipline and was scoped out of this engagement

CI/CD with feature branches

GitLab CI builds the image once per commit; the same artefact runs in every environment
GitLab Agent (KAS) pulls per-environment Kustomize overlays from the repo — prod, staging and per-feature ephemeral envs — and reconciles them into the cluster on every change; developers ship by merging
Each feature branch gets an ephemeral namespace and a DNS subdomain on push — two developers can work on two features at once without stepping on each other
Merge-request pipelines run integration tests against the actual built image, not the developer's host

Observability

Prometheus, Grafana and Loki on the cluster for metrics and application logs
Application metrics + golden signals + DB-tier metrics (replication lag, slow queries, connection saturation)
Web and ingress access logs ship to Elasticsearch and are visualised in pre-built Kibana dashboards — per-status, per-endpoint, latency histograms, top clients
SLO-based alerting with runbooks linked from every alert — no spam, no orphan pages

The result

Zero-downtime rolling deploys for application changes — a typical release goes out unnoticed
DB-migration releases still take a planned window, honestly acknowledged in the release runbook
Two single-VM SPOFs gone — the web tier scales horizontally, the database is a Galera cluster
Parallel feature work: each merge request gets its own preview environment; queue contention and merge-conflict pain dropped substantially
Local stack matches production: the docker-compose dev environment surfaces config and runtime issues before they ever reach a real environment

What was deliberately left alone

The team's existing source-of-truth (GitLab) and identity stack stayed in place — this engagement was about moving the application onto a HA container platform, not rebuilding everything around it.