home-ops/docs/replace-os-disk-with-ceph-mons.md
2026-01-21 21:06:45 +01:00

1.7 KiB

Talos Node Drive Replacement Checklist (with Rook/Ceph)

Before Starting

  • Identify which nodes are running mons:
    kubectl rook-ceph ceph mon dump
    
  • Verify cluster is healthy:
    kubectl rook-ceph ceph status
    
  • Set noout to prevent OSD rebalancing:
    kubectl rook-ceph ceph osd set noout
    

Per-Node Procedure

Repeat for each node, one at a time.

1. Pre-replacement

  • If node has a mon, remove it first:
    kubectl rook-ceph ceph mon remove <mon-id>
    
  • Verify quorum (need 2/3 mons healthy):
    kubectl rook-ceph ceph status
    

2. Replace drive

  • Physically replace the drive
  • Reapply Talos config to the node
  • Wait for node to boot and rejoin the cluster

3. Post-replacement

  • Verify node is Ready:
    kubectl get nodes
    
  • Wait for OSD to rejoin (if applicable):
    kubectl rook-ceph ceph osd tree
    
  • Wait for mon to be redeployed (if applicable):
    kubectl rook-ceph ceph status
    
  • Confirm 3 mons in quorum before proceeding to next node
  • Wait 10-15 minutes for full stabilization

After All Nodes Complete

  • Unset noout:
    kubectl rook-ceph ceph osd unset noout
    
  • Final health check:
    kubectl rook-ceph ceph status
    
  • Verify all OSDs are up:
    kubectl rook-ceph ceph osd tree
    

Important Reminders

  • Always maintain quorum: 2 out of 3 mons must be up
  • Do not rush — wait for each node to fully stabilize
  • Start with the non-mon node if you want an easy warmup
  • OSD data lives on the separate disk, it will rejoin automatically
  • Mon data lives on the boot drive and must be removed/redeployed