Elgato_dark/home-ops

mirror of https://github.com/ahinko/home-ops synced 2026-04-21 13:37:24 +00:00

Peter Ahinko 93970b3b10

feat: add some minor docs

2026-01-21 21:06:45 +01:00

1.7 KiB

Raw Permalink Blame History

Talos Node Drive Replacement Checklist (with Rook/Ceph)

Before Starting

Identify which nodes are running mons:
```
kubectl rook-ceph ceph mon dump
```
Verify cluster is healthy:
```
kubectl rook-ceph ceph status
```
Set noout to prevent OSD rebalancing:
```
kubectl rook-ceph ceph osd set noout
```

Per-Node Procedure

Repeat for each node, one at a time.

1. Pre-replacement

If node has a mon, remove it first:

kubectl rook-ceph ceph mon remove <mon-id>

Verify quorum (need 2/3 mons healthy):
```
kubectl rook-ceph ceph status
```

2. Replace drive

Physically replace the drive
Reapply Talos config to the node
Wait for node to boot and rejoin the cluster

3. Post-replacement

Verify node is Ready:
```
kubectl get nodes
```
Wait for OSD to rejoin (if applicable):
```
kubectl rook-ceph ceph osd tree
```
Wait for mon to be redeployed (if applicable):
```
kubectl rook-ceph ceph status
```
Confirm 3 mons in quorum before proceeding to next node
Wait 10-15 minutes for full stabilization

After All Nodes Complete

Unset noout:
```
kubectl rook-ceph ceph osd unset noout
```
Final health check:
```
kubectl rook-ceph ceph status
```
Verify all OSDs are up:
```
kubectl rook-ceph ceph osd tree
```

Important Reminders

Always maintain quorum: 2 out of 3 mons must be up
Do not rush — wait for each node to fully stabilize
Start with the non-mon node if you want an easy warmup
OSD data lives on the separate disk, it will rejoin automatically
Mon data lives on the boot drive and must be removed/redeployed