Drain, SSH reboot, and uncordon worker nodes sequentially with configurable FQDN suffix handling for short Kubernetes node names.
85 lines
2.7 KiB
Markdown
85 lines
2.7 KiB
Markdown
# k8s-safe-node-reboot
|
|
|
|
Safely drain, reboot, and uncordon Kubernetes worker nodes, one node at a time.
|
|
|
|
## Problem
|
|
|
|
Rolling reboots of worker nodes are easy to get wrong: pods left running, SSH targeting the wrong hostname, reboots fired on the wrong cluster, or nodes left cordoned after a failed wait. This script sequences the operations and aborts on failure.
|
|
|
|
## Requirements
|
|
|
|
- `kubectl` configured for the target cluster
|
|
- SSH access to nodes as the current user (`sudo reboot` must work)
|
|
- Bash 4+
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
chmod +x reboot-nodes.sh
|
|
|
|
# Dry run against one node
|
|
./reboot-nodes.sh --dry-run worker-01
|
|
|
|
# Explicit suffix for short Kubernetes node names
|
|
./reboot-nodes.sh --fqdn-suffix .prod.example.com worker-01 worker-02
|
|
|
|
# Suffix from context mapping file
|
|
./reboot-nodes.sh --fqdn-suffix-file ./fqdn-suffix.conf k8s-worker
|
|
|
|
# Prefix match (all nodes whose name starts with k8s-worker)
|
|
./reboot-nodes.sh --fqdn-suffix .prod.example.com k8s-worker
|
|
|
|
# All nodes in the cluster
|
|
./reboot-nodes.sh --fqdn-suffix .prod.example.com all
|
|
```
|
|
|
|
## FQDN resolution
|
|
|
|
Kubernetes node names are often short (`worker-01`). SSH usually needs a FQDN.
|
|
|
|
Resolution order for the SSH target:
|
|
|
|
1. If the node name already contains `.`, use it as-is.
|
|
2. Else append `--fqdn-suffix`, or `FQDN_SUFFIX` env, or a match from `--fqdn-suffix-file`.
|
|
3. If no suffix is set, SSH uses the short node name.
|
|
|
|
### Suffix mapping file
|
|
|
|
Copy `fqdn-suffix.example.conf` and edit for your environments:
|
|
|
|
```
|
|
production*|.prod.example.com
|
|
development*|.dev.example.com
|
|
```
|
|
|
|
Patterns are bash `case` globs matched against `kubectl config current-context`.
|
|
|
|
## Workflow per node
|
|
|
|
1. Confirm kubectl context (interactive unless `--dry-run`)
|
|
2. Drain (with retries) if the node is schedulable
|
|
3. If cordoned and Ready: check remote uptime via SSH; reboot if uptime exceeds 30 minutes
|
|
4. Wait for bootID change (preferred) or Ready status
|
|
5. Uncordon
|
|
6. Pause 15 seconds before the next node
|
|
|
|
## Options
|
|
|
|
| Flag | Description |
|
|
|------|-------------|
|
|
| `--dry-run` | Print planned actions; no drain, SSH, or uncordon |
|
|
| `--fqdn-suffix SUFFIX` | Append suffix to short node names for SSH |
|
|
| `--fqdn-suffix-file FILE` | Context glob → suffix mappings |
|
|
| `--help` | Show usage |
|
|
|
|
## Limits
|
|
|
|
- Processes nodes **sequentially**, not in parallel
|
|
- Assumes `kubectl drain` with `--ignore-daemonsets --delete-emptydir-data` is acceptable
|
|
- Uptime skip threshold is fixed at 30 minutes (edit `UPTIME_THRESHOLD_SEC` in the script)
|
|
- No built-in SSH options (keys, `StrictHostKeyChecking`, jump hosts) — configure `~/.ssh/config`
|
|
- Aborts the entire run if any node fails drain, ready wait, or uncordon
|
|
|
|
## License
|
|
|
|
MIT — see [LICENSE](LICENSE).
|