Repository / Platforms and Virtualization /Architecture Pattern /How Much of Your Infrastructure Can You Really Containerize?

How Much of Your Infrastructure Can You Really Containerize?

Type:Architecture Pattern

Domain:

Platforms and Virtualization

Status:published

Last Updated:2026-02-08

Tags:

kubernetescontainersvirtualizationmigrationarchitecture

An honest assessment of what belongs in containers, what doesn't, and where the line actually falls -- based on migrating production workloads from VMs to Kubernetes.

How Much of Your Infrastructure Can You Really Containerize?

The container pitch is compelling: package everything into images, deploy to Kubernetes, scale horizontally, update with zero downtime. The reality is messier. Some workloads containerize beautifully. Some containerize with caveats. And some should stay on VMs for years to come.

This isn't a theoretical exercise. This is based on the process of migrating real production workloads -- identity providers, CI/CD platforms, monitoring stacks, databases, web applications, and AI workers -- from traditional VMs to Kubernetes on bare metal.

What containerizes well

Stateless web applications

This is the easy win. Web frontends, REST APIs, microservices -- anything that reads configuration from environment variables, stores state in a database, and can run as multiple identical replicas.

Characteristics:

No local filesystem state (or ephemeral only)
Configuration via environment variables or mounted ConfigMaps
Horizontal scaling by adding replicas
Health checks are straightforward (HTTP endpoint returns 200)

Examples from production: Next.js websites, API services, portal frontends. These run as Deployments with 2+ replicas behind a load balancer. Rolling updates with zero downtime. This is exactly what Kubernetes was built for.

Background workers and job processors

Workers that pull tasks from a queue, process them, and write results to a database or object store. They're stateless by nature -- any worker can handle any task.

If a worker crashes, Kubernetes restarts it. If you need more throughput, increase the replica count. The queue absorbs load spikes without manual intervention.

Monitoring and observability

The entire monitoring stack containerizes cleanly:

Prometheus -- Runs as a StatefulSet with persistent storage for metrics. Horizontal scaling via sharding or Thanos.
Grafana -- Stateless (dashboards stored in a database or Git). Runs as a Deployment.
Loki -- Log aggregation. Runs as a StatefulSet with chunked storage.
Promtail -- Log collector. Runs as a DaemonSet (one per node). This is a perfect Kubernetes pattern -- guaranteed to run on every node without manual scheduling.

Node exporters as DaemonSets, alertmanager with replicated state, and the whole stack managed by Helm charts. Monitoring on Kubernetes is genuinely better than monitoring on VMs because the scheduling model (DaemonSets, node affinity) maps perfectly to monitoring requirements.

CI/CD tools

GitLab, ArgoCD, and container registries like Harbor are designed to run on Kubernetes. They ship official Helm charts, support horizontal scaling, and store state in databases.

GitLab's Kubernetes deployment includes Gitaly (Git storage), Sidekiq (background jobs), web service replicas, and a runner for CI jobs -- all as separate scalable components. On VMs, GitLab is a monolith. On Kubernetes, it's decomposed into independently scalable services.

What containerizes with caveats

Databases

This is the controversial one. The old wisdom was "never run databases in containers." That was valid in 2018 when Kubernetes storage was immature and operators didn't exist. In 2026, the calculus has changed.

What works:

PostgreSQL via CloudNativePG (or similar operators). HA clusters with streaming replication, automated failover, continuous WAL archiving. The operator handles what a DBA would handle manually on VMs.
Redis as a StatefulSet with persistence. Straightforward.
MongoDB with the community operator.

What the operator buys you:

Automated failover when the primary dies
Rolling updates that promote a replica before stopping the old primary
Backup integration with object storage
Connection pooling and read replica routing

The caveats:

Performance -- Container networking adds latency compared to a database running directly on bare metal with dedicated network interfaces. For most workloads, the difference is negligible. For latency-critical OLTP at scale, benchmark first.
Storage -- The Persistent Volume must be backed by fast, reliable storage. Ceph RBD works. NFS does not (for databases). Local SSDs give the best performance but sacrifice the ability to reschedule the pod to another node.
Operational knowledge -- When the operator handles failover, your team still needs to understand what's happening underneath. A PostgreSQL primary/replica topology doesn't become simpler because it's in Kubernetes -- it just becomes automated. When the automation fails, someone needs to know PostgreSQL.

Bottom line on databases: If you have a Kubernetes operator for your database engine and your storage layer is reliable (Ceph, local NVMe with replication, cloud block storage), containerized databases are production-ready. If you're running MySQL on NFS-backed PVCs, don't.

Identity providers

Authentik and Keycloak both run on Kubernetes. The application itself is stateless (state lives in PostgreSQL). But identity is special because every other service depends on it.

The caveat: if Kubernetes has a scheduling problem and the identity provider pods get evicted or stuck in Pending, every service behind SSO becomes inaccessible simultaneously. Run identity pods with high priority classes and resource guarantees (requests equal to limits) so they're never evicted.

Message queues and caches

Redis, RabbitMQ, and NATS all run on Kubernetes. The caveat is persistence:

Redis -- If you're using it as a cache (ephemeral data), run it without persistence. If it restarts, the cache rebuilds. If you're using it as a primary data store (job queues, session storage), enable AOF persistence and use a StatefulSet with stable network identifiers.
RabbitMQ -- Runs as a StatefulSet with the Kubernetes peer discovery plugin. HA queues require at least 3 nodes. The operator simplifies this significantly.

What should stay on VMs (for now)

Windows Server workloads

Active Directory Domain Controllers, Windows-based application servers, anything requiring Windows Server. Windows containers exist but they're a different universe -- limited base image support, no Linux container interoperability, and a fraction of the ecosystem tooling.

If you're running AD, DNS, DHCP, SCCM, or Exchange on-premises, these stay on VMs. Containerizing Windows Server workloads is technically possible and practically pointless for most organizations.

Appliance-style software

Commercial software that ships as a VM image or requires a specific OS configuration: firewalls (virtual Palo Alto, Fortinet), network appliances, vendor-specific tools that assume they own the entire OS.

These products weren't designed for containers. They expect static IPs, persistent filesystem layouts, and specific kernel modules. Forcing them into containers adds complexity without benefit.

Workloads requiring GPU passthrough

GPU-accelerated workloads (machine learning training, inference, video transcoding) can run on Kubernetes with the NVIDIA device plugin or AMD ROCm. But the setup is complex:

GPU drivers on the host
Device plugin DaemonSet
Resource requests for nvidia.com/gpu or amd.com/gpu
Node affinity to schedule GPU pods on GPU-equipped nodes

It works, and at scale it's the right approach. But if you have one GPU node running a few inference workloads, the overhead of Kubernetes GPU scheduling may not be worth it compared to running containers directly on the host with --gpus flags.

Legacy applications with filesystem dependencies

Applications that store state in local files, expect specific directory structures, or use file-based locking mechanisms. These can sometimes be adapted with PersistentVolumes, but if the application assumes it's the only process accessing /opt/app/data, running multiple replicas will cause data corruption.

The migration path for these is usually: containerize as a single-replica StatefulSet first, then refactor the application to externalize state to a database or object store. The first step gets you container packaging benefits (image versioning, consistent deployment). The second step gets you horizontal scaling.

The migration decision framework

For each workload, ask:

Is the state externalized? (database, object store, or ephemeral) -- If yes, containerize. If no, evaluate the effort to externalize it.
Does it scale horizontally? -- If yes, Kubernetes adds immediate value (auto-scaling, rolling updates). If no, Kubernetes still works but the benefit is reduced to packaging and scheduling.
Is there a Kubernetes operator or Helm chart? -- Operators encode operational knowledge. A well-maintained operator makes the containerized version easier to run than the VM version. Without an operator, you're writing the operational logic yourself.
Does it require kernel-level access? -- Custom kernel modules, specific kernel versions, iptables manipulation, or raw device access. These are possible in Kubernetes (privileged containers, host networking) but they break the isolation model.
Is the team ready? -- The most overlooked factor. Containerizing a database is pointless if nobody on the team understands StatefulSets, PVC lifecycle, or how to recover from a failed pod.

The honest answer

How much can you containerize? In a modern infrastructure with proper storage (Ceph, cloud block storage) and a mature Kubernetes platform: 70-80% of workloads. The remaining 20-30% is Windows Server, legacy appliances, and applications that need deep OS integration.

The goal isn't 100% containerization. The goal is putting the right workloads on the right platform. Kubernetes for stateless services, scheduled jobs, and operator-managed stateful workloads. VMs for everything that doesn't fit that model.

The worst outcome is forcing a workload into Kubernetes to achieve an arbitrary containerization target, then spending more time managing Kubernetes complexity than the workload itself. If a VM works and the operational burden is acceptable, leave it on a VM. Containerize it when there's a clear operational or scaling benefit.

Pragmatism over purity. Every time.