k8s-safe-node-reboot/README.md
a-real-agent 9bc1ed2eb9 Publish k8s-safe-node-reboot for reuse on catu.dev.
Drain, SSH reboot, and uncordon worker nodes sequentially with
configurable FQDN suffix handling for short Kubernetes node names.
2026-06-11 15:18:02 +03:00

85 lines
2.7 KiB
Markdown

# k8s-safe-node-reboot
Safely drain, reboot, and uncordon Kubernetes worker nodes, one node at a time.
## Problem
Rolling reboots of worker nodes are easy to get wrong: pods left running, SSH targeting the wrong hostname, reboots fired on the wrong cluster, or nodes left cordoned after a failed wait. This script sequences the operations and aborts on failure.
## Requirements
- `kubectl` configured for the target cluster
- SSH access to nodes as the current user (`sudo reboot` must work)
- Bash 4+
## Usage
```bash
chmod +x reboot-nodes.sh
# Dry run against one node
./reboot-nodes.sh --dry-run worker-01
# Explicit suffix for short Kubernetes node names
./reboot-nodes.sh --fqdn-suffix .prod.example.com worker-01 worker-02
# Suffix from context mapping file
./reboot-nodes.sh --fqdn-suffix-file ./fqdn-suffix.conf k8s-worker
# Prefix match (all nodes whose name starts with k8s-worker)
./reboot-nodes.sh --fqdn-suffix .prod.example.com k8s-worker
# All nodes in the cluster
./reboot-nodes.sh --fqdn-suffix .prod.example.com all
```
## FQDN resolution
Kubernetes node names are often short (`worker-01`). SSH usually needs a FQDN.
Resolution order for the SSH target:
1. If the node name already contains `.`, use it as-is.
2. Else append `--fqdn-suffix`, or `FQDN_SUFFIX` env, or a match from `--fqdn-suffix-file`.
3. If no suffix is set, SSH uses the short node name.
### Suffix mapping file
Copy `fqdn-suffix.example.conf` and edit for your environments:
```
production*|.prod.example.com
development*|.dev.example.com
```
Patterns are bash `case` globs matched against `kubectl config current-context`.
## Workflow per node
1. Confirm kubectl context (interactive unless `--dry-run`)
2. Drain (with retries) if the node is schedulable
3. If cordoned and Ready: check remote uptime via SSH; reboot if uptime exceeds 30 minutes
4. Wait for bootID change (preferred) or Ready status
5. Uncordon
6. Pause 15 seconds before the next node
## Options
| Flag | Description |
|------|-------------|
| `--dry-run` | Print planned actions; no drain, SSH, or uncordon |
| `--fqdn-suffix SUFFIX` | Append suffix to short node names for SSH |
| `--fqdn-suffix-file FILE` | Context glob → suffix mappings |
| `--help` | Show usage |
## Limits
- Processes nodes **sequentially**, not in parallel
- Assumes `kubectl drain` with `--ignore-daemonsets --delete-emptydir-data` is acceptable
- Uptime skip threshold is fixed at 30 minutes (edit `UPTIME_THRESHOLD_SEC` in the script)
- No built-in SSH options (keys, `StrictHostKeyChecking`, jump hosts) — configure `~/.ssh/config`
- Aborts the entire run if any node fails drain, ready wait, or uncordon
## License
MIT — see [LICENSE](LICENSE).