# k8s-safe-node-reboot Safely drain, reboot, and uncordon Kubernetes worker nodes, one node at a time. ## Problem Rolling reboots of worker nodes are easy to get wrong: pods left running, SSH targeting the wrong hostname, reboots fired on the wrong cluster, or nodes left cordoned after a failed wait. This script sequences the operations and aborts on failure. ## Requirements - `kubectl` configured for the target cluster - SSH access to nodes as the current user (`sudo reboot` must work) - Bash 4+ ## Usage ```bash chmod +x reboot-nodes.sh # Dry run against one node ./reboot-nodes.sh --dry-run worker-01 # Explicit suffix for short Kubernetes node names ./reboot-nodes.sh --fqdn-suffix .prod.example.com worker-01 worker-02 # Suffix from context mapping file ./reboot-nodes.sh --fqdn-suffix-file ./fqdn-suffix.conf k8s-worker # Prefix match (all nodes whose name starts with k8s-worker) ./reboot-nodes.sh --fqdn-suffix .prod.example.com k8s-worker # All nodes in the cluster ./reboot-nodes.sh --fqdn-suffix .prod.example.com all ``` ## FQDN resolution Kubernetes node names are often short (`worker-01`). SSH usually needs a FQDN. Resolution order for the SSH target: 1. If the node name already contains `.`, use it as-is. 2. Else append `--fqdn-suffix`, or `FQDN_SUFFIX` env, or a match from `--fqdn-suffix-file`. 3. If no suffix is set, SSH uses the short node name. ### Suffix mapping file Copy `fqdn-suffix.example.conf` and edit for your environments: ``` production*|.prod.example.com development*|.dev.example.com ``` Patterns are bash `case` globs matched against `kubectl config current-context`. ## Workflow per node 1. Confirm kubectl context (interactive unless `--dry-run`) 2. Drain (with retries) if the node is schedulable 3. If cordoned and Ready: check remote uptime via SSH; reboot if uptime exceeds 30 minutes 4. Wait for bootID change (preferred) or Ready status 5. Uncordon 6. Pause 15 seconds before the next node ## Options | Flag | Description | |------|-------------| | `--dry-run` | Print planned actions; no drain, SSH, or uncordon | | `--fqdn-suffix SUFFIX` | Append suffix to short node names for SSH | | `--fqdn-suffix-file FILE` | Context glob → suffix mappings | | `--help` | Show usage | ## Limits - Processes nodes **sequentially**, not in parallel - Assumes `kubectl drain` with `--ignore-daemonsets --delete-emptydir-data` is acceptable - Uptime skip threshold is fixed at 30 minutes (edit `UPTIME_THRESHOLD_SEC` in the script) - No built-in SSH options (keys, `StrictHostKeyChecking`, jump hosts) — configure `~/.ssh/config` - Aborts the entire run if any node fails drain, ready wait, or uncordon ## License MIT — see [LICENSE](LICENSE).