Conversation
Edited 7 months ago
it's been a bit over a week since the coral castle data loss disaster, so imma be giving some recommendations on how to not fuck up like i did to other server admins.

1. if you're upgrading your server, check the hardware you're using. this was caused by a single bad ram stick that i didn't test after install.
2. allow postgres to safely crash if it detects corruption. openrc automatically restarted the service, causing a loop where postgres would reboot and keep writing corrupted pages to storage
3. consider pg_basebackup and wal archiving. back up the whole cluster in longer intervals (weekly, monthly), and then wal can take care of the smaller changes over shorter periods of time (hourly)
4. check whether your backups work correctly. two issues arose after checking one of my backups. first, i was backing up the wrong database (this being the coral castle neo database), and second, the backups were not encoded correctly (using ASCII instead of UTF-8)

that's all everyone. keep your servers safe.
1
6
7
im also realizing returning to this checkpoint might've not been one of the best choices, and has even produced some negative effects on some people. therefore, i'd like to ask the community again: do we reset, or stay with the current state of things? the instance would still stay in the same domain, and might federate better, since some follows are not accounted for in this previous snapshots. if so, i'll give a time period for users to export and archive their data.
0% reset
0% stay
1
1
0
Edited 7 months ago
why, you may ask, would it federate better? because we can trigger user deletes, excluding certain usernames between the checkpoint and the corruption, and spin up a new instance in the same place with the same instance keys.
0
0
0