k8s-safe-node-reboot/README.md

# k8s-safe-node-reboot

Safely drain, reboot, and uncordon Kubernetes worker nodes, one node at a time.

## Problem

Rolling reboots of worker nodes are easy to get wrong: pods left running, SSH targeting the wrong hostname, reboots fired on the wrong cluster, or nodes left cordoned after a failed wait. This script sequences the operations and aborts on failure.

## Requirements

- `kubectl` configured for the target cluster
- SSH access to nodes as the current user (`sudo reboot` must work)
- Bash 4+

## Usage

```bash
chmod +x reboot-nodes.sh

# Dry run against one node
./reboot-nodes.sh --dry-run worker-01

# Explicit suffix for short Kubernetes node names
./reboot-nodes.sh --fqdn-suffix .prod.example.com worker-01 worker-02

# Suffix from context mapping file
./reboot-nodes.sh --fqdn-suffix-file ./fqdn-suffix.conf k8s-worker

# Prefix match (all nodes whose name starts with k8s-worker)
./reboot-nodes.sh --fqdn-suffix .prod.example.com k8s-worker

# All nodes in the cluster
./reboot-nodes.sh --fqdn-suffix .prod.example.com all
```

## FQDN resolution

Kubernetes node names are often short (`worker-01`). SSH usually needs a FQDN.

Resolution order for the SSH target:

1. If the node name already contains `.`, use it as-is.
2. Else append `--fqdn-suffix`, or `FQDN_SUFFIX` env, or a match from `--fqdn-suffix-file`.
3. If no suffix is set, SSH uses the short node name.

### Suffix mapping file

Copy `fqdn-suffix.example.conf` and edit for your environments:

```
production*|.prod.example.com
development*|.dev.example.com
```

Patterns are bash `case` globs matched against `kubectl config current-context`.

## Workflow per node

1. Confirm kubectl context (interactive unless `--dry-run`)
2. Drain (with retries) if the node is schedulable
3. If cordoned and Ready: check remote uptime via SSH; reboot if uptime exceeds 30 minutes
4. Wait for bootID change (preferred) or Ready status
5. Uncordon
6. Pause 15 seconds before the next node

## Options

| Flag | Description |
|------|-------------|
| `--dry-run` | Print planned actions; no drain, SSH, or uncordon |
| `--fqdn-suffix SUFFIX` | Append suffix to short node names for SSH |
| `--fqdn-suffix-file FILE` | Context glob → suffix mappings |
| `--help` | Show usage |

## Limits

- Processes nodes **sequentially**, not in parallel
- Assumes `kubectl drain` with `--ignore-daemonsets --delete-emptydir-data` is acceptable
- Uptime skip threshold is fixed at 30 minutes (edit `UPTIME_THRESHOLD_SEC` in the script)
- No built-in SSH options (keys, `StrictHostKeyChecking`, jump hosts) — configure `~/.ssh/config`
- Aborts the entire run if any node fails drain, ready wait, or uncordon

## License

MIT — see [LICENSE](LICENSE).