You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1.5 KiB

title date tags draft
Etcd data loss 2022-03-15T13:00:00Z
kubernetes
ops
etcd
homelab
false

After executing systemctl reboot I noticed that my k8s cluster would not come back up.

Digging around in the container logs I found an error in the etcd container:

etcd panic: freepages: failed to get all reachable pages

goroutine 112 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).freepages.func2(0xc42007e720)
/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:976 +0xfb
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*DB).freepages
/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/db.go:974 +0x1b7

I then found this github issue in etcd and a very good comment in another, leading me to believe this was caused by powering down my server without gracefully shutting down etcd.

This is the price I pay for running a single node etcd cluster :sad:.

In the perfect world I would want my k8s homelab setup to:

  1. Run a full HA etcd cluster
  2. Do rolling shutdowns of the servers for updates
  3. Regularly backup /var/lib/etcd

Meantime, all of the terraform, longhorn backups and fluxcd work I have done are about to pay off. 😁