--- title: "Etcd data loss" date: 2022-03-15T13:00:00Z tags: [kubernetes, ops, etcd, homelab] draft: false --- After executing `systemctl reboot` I noticed that my k8s cluster would not come back up. Digging around in the container logs I found an error in the etcd container: ``` etcd panic: freepages: failed to get all reachable pages goroutine 112 [running]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).freepages.func2(0xc42007e720) /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:976 +0xfb created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).freepages /tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:974 +0x1b7 ``` I then found [this github issue in etcd][1] and [a very good comment in another][2], leading me to believe this was caused by powering down my server without gracefully shutting down etcd. This is the price I pay for running a single node etcd cluster :sad:. In the perfect world I would want my k8s homelab setup to: 1. Run a full HA etcd cluster 2. Do rolling shutdowns of the servers for updates 3. Regularly backup `/var/lib/etcd` Meantime, all of the terraform, longhorn backups and fluxcd work I have done are about to pay off. :grin: [1]: https://github.com/etcd-io/etcd/issues/10722 [2]: https://github.com/kubernetes/kubernetes/issues/88574#issuecomment-591931659