Deploy RKE2 on Bare Metal with Ceph Storage and Cilium
Deploy RKE2 on Bare Metal with Ceph Storage and Cilium
Most Kubernetes deployment guides assume you're clicking buttons in a cloud console. This one doesn't. This guide covers deploying RKE2 on bare metal Ubuntu nodes, wiring Ceph as the persistent storage layer via CSI, and using Cilium as the CNI -- all on hardware you own.
This is based on a production cluster: 3 control plane nodes, 4 workers, 1 auxiliary node, and 1 GPU-capable node. Nine nodes total. The design separates workload traffic from storage traffic using dual bonded interfaces on different subnets.
Why RKE2
RKE2 (also called "RKE Government") is Rancher's Kubernetes distribution focused on security and compliance. It matters for a few reasons:
- CIS hardened by default -- RKE2 ships with CIS Benchmark compliance out of the box. SELinux policies, pod security standards, and audit logging are enabled without extra configuration.
- Containerd, not Docker -- RKE2 uses containerd directly. No Docker dependency, no Docker socket security concerns.
- Air-gap friendly -- All images are bundled. Deployments in disconnected environments work without pulling from public registries.
- etcd built in -- No external etcd cluster to manage. HA etcd runs on control plane nodes automatically.
If you're running Kubernetes in a regulated or security-conscious environment, RKE2 gives you a head start on hardening that other distributions require you to do manually.
Cluster architecture
Node roles and sizing
The cluster separates control plane and worker responsibilities completely. Control plane nodes run lightweight -- they don't need massive compute. Workers handle all application workloads.
Control plane nodes (3x):
- Purpose: API server, etcd, scheduler, controller manager
- CPU: 4 cores is sufficient for clusters under 100 nodes
- RAM: 16 GB handles etcd and API server comfortably
- Storage: 100 GB SSD for OS and etcd data
- Network: Single bond on the workload subnet
Worker nodes (4x):
- Purpose: Application pods, Ceph OSDs
- CPU: Multi-socket Xeon (32+ cores) for pod density
- RAM: 64 GB minimum when running Ceph OSDs alongside workloads
- Storage: OS disk + dedicated Ceph OSD disks (no sharing)
- Network: Dual bonds -- workload subnet and dedicated storage subnet
Why separate control plane from workers? When etcd shares resources with application workloads, a noisy pod can starve etcd of I/O and destabilize the entire cluster. Dedicated control plane nodes eliminate this failure mode. The trade-off is more hardware, but the reliability gain is worth it in production.
Network design
This is where bare metal diverges sharply from cloud Kubernetes. In AWS or GCP, the network "just works." On bare metal, you design it.
The cluster uses two subnets:
- 10.0.30.0/24 -- Workload and management. API server access, pod-to-pod traffic (via Cilium VXLAN overlay), ingress traffic, and node management (SSH).
- 10.0.20.0/24 -- Storage. Ceph OSD replication, recovery traffic, and client I/O. This network exists only on nodes running Ceph OSDs (workers).
Each worker node has two bonded interfaces:
bond0on 10.0.30.x -- Workloadbond1on 10.0.20.x -- Storage
Control plane nodes have a single bond on 10.0.30.x. They don't need storage network access because they don't run Ceph OSDs.
Why separate storage traffic? Ceph replication generates significant network I/O, especially during recovery events (when an OSD goes down and data rebalances). If that traffic shares the same interface as your pod network, application latency spikes during rebalancing. A dedicated storage network eliminates this entirely.
Installing RKE2
Control plane (first node)
On the first control plane node, install RKE2 in server mode:
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=server sh -
Configure the server before starting. Create /etc/rancher/rke2/config.yaml:
# First control plane node
tls-san:
- 10.0.30.91
- 10.0.30.92
- 10.0.30.93
- k8s.yourdomain.com
# Disable default CNI -- we'll install Cilium separately
cni: none
# Write kubeconfig with correct permissions
write-kubeconfig-mode: "0644"
# etcd snapshots for disaster recovery
etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 10
Start and enable:
systemctl enable rke2-server --now
Grab the node token for joining additional nodes:
cat /var/lib/rancher/rke2/server/node-token
Additional control plane nodes
On each additional control plane node, install RKE2 and configure it to join the first:
# Additional control plane nodes
server: https://10.0.30.91:9345
token: <node-token-from-first-server>
tls-san:
- 10.0.30.91
- 10.0.30.92
- 10.0.30.93
- k8s.yourdomain.com
cni: none
write-kubeconfig-mode: "0644"
Worker nodes
Workers install the RKE2 agent:
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=agent sh -
Configure /etc/rancher/rke2/config.yaml:
server: https://10.0.30.91:9345
token: <node-token>
Start and enable:
systemctl enable rke2-agent --now
Repeat for each worker. Within a few minutes, kubectl get nodes should show all nodes in Ready state (once Cilium is installed).
Cilium as the CNI
RKE2 was configured with cni: none, so nodes will show NotReady until a CNI is installed. Cilium provides eBPF-based networking, network policy enforcement, and observability.
Install via Helm:
helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
--namespace kube-system \
--set tunnel=vxlan \
--set operator.replicas=2 \
--set ipam.mode=cluster-pool
Within a minute, all nodes should transition to Ready. Cilium runs as a DaemonSet -- one pod per node -- and handles pod-to-pod networking, service load balancing, and network policy enforcement.
Why Cilium over Calico or Flannel?
- eBPF -- Packet processing happens in kernel space without iptables chains. At scale, this is measurably faster.
- Network policy -- Cilium supports both Kubernetes NetworkPolicy and its own CiliumNetworkPolicy with L7 awareness (HTTP, gRPC, DNS).
- Hubble -- Built-in observability for network flows. When a pod can't reach a service, Hubble shows you exactly where the traffic is being dropped.
Ceph CSI integration
Ceph provides the persistent storage layer. The assumption here is that you already have a Ceph cluster running (either standalone or via Proxmox). The Kubernetes nodes connect to it via CSI drivers.
Install Ceph CSI
Deploy the Ceph RBD CSI driver for block storage:
helm repo add ceph-csi https://ceph.github.io/csi-charts
helm install ceph-csi-rbd ceph-csi/ceph-csi-rbd \
--namespace ceph-csi \
--create-namespace \
--set csiConfig[0].clusterID=<your-ceph-fsid> \
--set csiConfig[0].monitors[0]=10.0.20.x:6789 \
--set csiConfig[0].monitors[1]=10.0.20.y:6789 \
--set csiConfig[0].monitors[2]=10.0.20.z:6789
Note the monitor addresses are on the storage subnet (10.0.20.x). Ceph client traffic stays on the dedicated storage network.
For CephFS (shared filesystem for ReadWriteMany volumes):
helm install ceph-csi-cephfs ceph-csi/ceph-csi-cephfs \
--namespace ceph-csi-cephfs \
--create-namespace \
--set csiConfig[0].clusterID=<your-ceph-fsid> \
--set csiConfig[0].monitors[0]=10.0.20.x:6789
Storage classes
Create storage classes that map to Ceph pools:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-block
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: rbd.csi.ceph.com
parameters:
clusterID: <your-ceph-fsid>
pool: kubernetes-block
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi
reclaimPolicy: Delete
allowVolumeExpansion: true
For production data you want to keep (databases, GitLab repositories), create a second class with reclaimPolicy: Retain:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: cephfs-production
provisioner: cephfs.csi.ceph.com
parameters:
clusterID: <your-ceph-fsid>
fsName: cephfs
csi.storage.k8s.io/provisioner-secret-name: csi-cephfs-secret
csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi-cephfs
reclaimPolicy: Retain
allowVolumeExpansion: true
What you end up with
After completing this deployment:
- HA control plane -- 3-node etcd cluster with automatic leader election. Lose any single control plane node and the cluster keeps running.
- Separated networks -- Pod traffic and Ceph replication never compete for bandwidth.
- Dynamic persistent storage -- Any pod can request a PersistentVolumeClaim and get a Ceph RBD volume provisioned automatically.
- CIS-hardened Kubernetes -- RKE2's default configuration passes CIS Benchmark checks without manual intervention.
- eBPF networking -- Cilium provides fast, observable pod networking with kernel-level packet processing.
This is production infrastructure. It runs databases, CI/CD platforms, identity providers, container registries, and web applications -- all on hardware you own, on networks you control, with storage that replicates across physical disks.
The cloud isn't going anywhere. But neither is the need to run infrastructure on metal you own. This is how you do it properly.