Repository / Platforms and Virtualization /Deployment Guide /Deploy RKE2 on Bare Metal with Ceph Storage and Cilium

Deploy RKE2 on Bare Metal with Ceph Storage and Cilium

Type:Deployment Guide

Domain:

Status:published

Last Updated:2026-02-08

Tags:

kubernetesrke2cephciliumbare-metalstorage

A practical guide to deploying a production RKE2 Kubernetes cluster on bare metal with Ceph CSI for persistent storage, Cilium for networking, and separated storage and workload networks.

Deploy RKE2 on Bare Metal with Ceph Storage and Cilium

Most Kubernetes deployment guides assume you're clicking buttons in a cloud console. This one doesn't. This guide covers deploying RKE2 on bare metal Ubuntu nodes, wiring Ceph as the persistent storage layer via CSI, and using Cilium as the CNI -- all on hardware you own.

This is based on a production cluster: 3 control plane nodes, 4 workers, 1 auxiliary node, and 1 GPU-capable node. Nine nodes total. The design separates workload traffic from storage traffic using dual bonded interfaces on different subnets.

Why RKE2

RKE2 (also called "RKE Government") is Rancher's Kubernetes distribution focused on security and compliance. It matters for a few reasons:

CIS hardened by default -- RKE2 ships with CIS Benchmark compliance out of the box. SELinux policies, pod security standards, and audit logging are enabled without extra configuration.
Containerd, not Docker -- RKE2 uses containerd directly. No Docker dependency, no Docker socket security concerns.
Air-gap friendly -- All images are bundled. Deployments in disconnected environments work without pulling from public registries.
etcd built in -- No external etcd cluster to manage. HA etcd runs on control plane nodes automatically.

If you're running Kubernetes in a regulated or security-conscious environment, RKE2 gives you a head start on hardening that other distributions require you to do manually.

Cluster architecture

Node roles and sizing

The cluster separates control plane and worker responsibilities completely. Control plane nodes run lightweight -- they don't need massive compute. Workers handle all application workloads.

Control plane nodes (3x):

Purpose: API server, etcd, scheduler, controller manager
CPU: 4 cores is sufficient for clusters under 100 nodes
RAM: 16 GB handles etcd and API server comfortably
Storage: 100 GB SSD for OS and etcd data
Network: Single bond on the workload subnet

Worker nodes (4x):

Purpose: Application pods, Ceph OSDs
CPU: Multi-socket Xeon (32+ cores) for pod density
RAM: 64 GB minimum when running Ceph OSDs alongside workloads
Storage: OS disk + dedicated Ceph OSD disks (no sharing)
Network: Dual bonds -- workload subnet and dedicated storage subnet

Why separate control plane from workers? When etcd shares resources with application workloads, a noisy pod can starve etcd of I/O and destabilize the entire cluster. Dedicated control plane nodes eliminate this failure mode. The trade-off is more hardware, but the reliability gain is worth it in production.

Network design

This is where bare metal diverges sharply from cloud Kubernetes. In AWS or GCP, the network "just works." On bare metal, you design it.

The cluster uses two subnets:

10.0.30.0/24 -- Workload and management. API server access, pod-to-pod traffic (via Cilium VXLAN overlay), ingress traffic, and node management (SSH).
10.0.20.0/24 -- Storage. Ceph OSD replication, recovery traffic, and client I/O. This network exists only on nodes running Ceph OSDs (workers).

Each worker node has two bonded interfaces:

bond0 on 10.0.30.x -- Workload
bond1 on 10.0.20.x -- Storage

Control plane nodes have a single bond on 10.0.30.x. They don't need storage network access because they don't run Ceph OSDs.

Why separate storage traffic? Ceph replication generates significant network I/O, especially during recovery events (when an OSD goes down and data rebalances). If that traffic shares the same interface as your pod network, application latency spikes during rebalancing. A dedicated storage network eliminates this entirely.

Installing RKE2

Control plane (first node)

On the first control plane node, install RKE2 in server mode:

curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=server sh -

Configure the server before starting. Create /etc/rancher/rke2/config.yaml:

# First control plane node
tls-san:
  - 10.0.30.91
  - 10.0.30.92
  - 10.0.30.93
  - k8s.yourdomain.com

# Disable default CNI -- we'll install Cilium separately
cni: none

# Write kubeconfig with correct permissions
write-kubeconfig-mode: "0644"

# etcd snapshots for disaster recovery
etcd-snapshot-schedule-cron: "0 */6 * * *"
etcd-snapshot-retention: 10

Start and enable:

systemctl enable rke2-server --now

Grab the node token for joining additional nodes:

cat /var/lib/rancher/rke2/server/node-token

Additional control plane nodes

On each additional control plane node, install RKE2 and configure it to join the first:

# Additional control plane nodes
server: https://10.0.30.91:9345
token: <node-token-from-first-server>

tls-san:
  - 10.0.30.91
  - 10.0.30.92
  - 10.0.30.93
  - k8s.yourdomain.com

cni: none
write-kubeconfig-mode: "0644"

Worker nodes

Workers install the RKE2 agent:

curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=agent sh -

Configure /etc/rancher/rke2/config.yaml:

server: https://10.0.30.91:9345
token: <node-token>

Start and enable:

systemctl enable rke2-agent --now

Repeat for each worker. Within a few minutes, kubectl get nodes should show all nodes in Ready state (once Cilium is installed).

Cilium as the CNI

RKE2 was configured with cni: none, so nodes will show NotReady until a CNI is installed. Cilium provides eBPF-based networking, network policy enforcement, and observability.

Install via Helm:

helm repo add cilium https://helm.cilium.io/
helm install cilium cilium/cilium \
  --namespace kube-system \
  --set tunnel=vxlan \
  --set operator.replicas=2 \
  --set ipam.mode=cluster-pool

Within a minute, all nodes should transition to Ready. Cilium runs as a DaemonSet -- one pod per node -- and handles pod-to-pod networking, service load balancing, and network policy enforcement.

Why Cilium over Calico or Flannel?

eBPF -- Packet processing happens in kernel space without iptables chains. At scale, this is measurably faster.
Network policy -- Cilium supports both Kubernetes NetworkPolicy and its own CiliumNetworkPolicy with L7 awareness (HTTP, gRPC, DNS).
Hubble -- Built-in observability for network flows. When a pod can't reach a service, Hubble shows you exactly where the traffic is being dropped.

Ceph CSI integration

Ceph provides the persistent storage layer. The assumption here is that you already have a Ceph cluster running (either standalone or via Proxmox). The Kubernetes nodes connect to it via CSI drivers.

Install Ceph CSI

Deploy the Ceph RBD CSI driver for block storage:

helm repo add ceph-csi https://ceph.github.io/csi-charts
helm install ceph-csi-rbd ceph-csi/ceph-csi-rbd \
  --namespace ceph-csi \
  --create-namespace \
  --set csiConfig[0].clusterID=<your-ceph-fsid> \
  --set csiConfig[0].monitors[0]=10.0.20.x:6789 \
  --set csiConfig[0].monitors[1]=10.0.20.y:6789 \
  --set csiConfig[0].monitors[2]=10.0.20.z:6789

Note the monitor addresses are on the storage subnet (10.0.20.x). Ceph client traffic stays on the dedicated storage network.

For CephFS (shared filesystem for ReadWriteMany volumes):

helm install ceph-csi-cephfs ceph-csi/ceph-csi-cephfs \
  --namespace ceph-csi-cephfs \
  --create-namespace \
  --set csiConfig[0].clusterID=<your-ceph-fsid> \
  --set csiConfig[0].monitors[0]=10.0.20.x:6789

Storage classes

Create storage classes that map to Ceph pools:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-block
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rbd.csi.ceph.com
parameters:
  clusterID: <your-ceph-fsid>
  pool: kubernetes-block
  csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
  csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi
  csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
  csi.storage.k8s.io/node-stage-secret-namespace: ceph-csi
reclaimPolicy: Delete
allowVolumeExpansion: true

For production data you want to keep (databases, GitLab repositories), create a second class with reclaimPolicy: Retain:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: cephfs-production
provisioner: cephfs.csi.ceph.com
parameters:
  clusterID: <your-ceph-fsid>
  fsName: cephfs
  csi.storage.k8s.io/provisioner-secret-name: csi-cephfs-secret
  csi.storage.k8s.io/provisioner-secret-namespace: ceph-csi-cephfs
reclaimPolicy: Retain
allowVolumeExpansion: true

What you end up with

After completing this deployment:

HA control plane -- 3-node etcd cluster with automatic leader election. Lose any single control plane node and the cluster keeps running.
Separated networks -- Pod traffic and Ceph replication never compete for bandwidth.
Dynamic persistent storage -- Any pod can request a PersistentVolumeClaim and get a Ceph RBD volume provisioned automatically.
CIS-hardened Kubernetes -- RKE2's default configuration passes CIS Benchmark checks without manual intervention.
eBPF networking -- Cilium provides fast, observable pod networking with kernel-level packet processing.

This is production infrastructure. It runs databases, CI/CD platforms, identity providers, container registries, and web applications -- all on hardware you own, on networks you control, with storage that replicates across physical disks.

The cloud isn't going anywhere. But neither is the need to run infrastructure on metal you own. This is how you do it properly.