meta data for this page
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
vm:proxmox:disaster_recovery [2021/05/18 16:31] – created niziak | vm:proxmox:disaster_recovery [2024/02/12 08:26] (current) – niziak | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Disaster recovery ====== | ====== Disaster recovery ====== | ||
+ | |||
+ | ===== replace NVM device ===== | ||
+ | |||
+ | Only 1 NVM slot available, so idea is to copy nvm to hdd and then restore it on new nvm device. | ||
+ | |||
+ | Stop CEPH: | ||
+ | <code bash> | ||
+ | systemctl stop ceph.target | ||
+ | systemctl stop ceph-osd.target | ||
+ | systemctl stop ceph-mgr.target | ||
+ | systemctl stop ceph-mon.target | ||
+ | systemctl stop ceph-mds.target | ||
+ | systemctl stop ceph-crash.service | ||
+ | </ | ||
+ | |||
+ | Backup partition layout | ||
+ | <code bash> | ||
+ | sgdisk -b nvm.sgdisk / | ||
+ | sgdisk -p / | ||
+ | </ | ||
+ | |||
+ | Move ZFS nvmpool to hdds: | ||
+ | <code bash> | ||
+ | zfs destroy hddpool/ | ||
+ | zfs create -s -b 8192 -V 387.8G hddpool/ | ||
+ | |||
+ | ls -l / | ||
+ | lrwxrwxrwx 1 root root 11 01-15 11:00 / | ||
+ | |||
+ | zpool attach nvmpool 7b375b69-3ef9-c94b-bab5-ef68f13df47c /dev/zd192 | ||
+ | </ | ||
+ | And '' | ||
+ | |||
+ | Remove NVM from '' | ||
+ | <code bash> | ||
+ | |||
+ | Remove all ZILS, L2ARCs and swap: | ||
+ | <code bash> | ||
+ | swapoff -a | ||
+ | vi /etc/fstab | ||
+ | |||
+ | zpool remove hddpool <ZIL DEVICE> | ||
+ | zpool remove hddpool <L2ARC DEVICE> | ||
+ | zpool remove rpool <L2ARC DEVICE> | ||
+ | </ | ||
+ | |||
+ | CEPH OSD will be created from scratch to force to rebuild OSD DB (which can be too big due to metadata bug from previous version of CEPH) | ||
+ | |||
+ | Replace NVM. | ||
+ | |||
+ | Recreate partitions or restore from backup <code bash> | ||
+ | * swap | ||
+ | * rpool_zil | ||
+ | * hddpool_zil | ||
+ | * hddpool_l2arc | ||
+ | * ceph_db (for 4GB ceph OSD create 4096MB+4MB) | ||
+ | |||
+ | Add ZILs and L2ARCs. | ||
+ | |||
+ | Start '' | ||
+ | |||
+ | Move '' | ||
+ | <code bash> | ||
+ | zpool attach nvmpool zd16 426718f1-1b1e-40c0-a6e2-1332fe5c3f2c | ||
+ | zpool detach nvmpool zd16 | ||
+ | </ | ||
+ | |||
+ | ===== Replace rpool device ===== | ||
+ | |||
+ | Proxmox rpool ZFS is located on 3rd partition (1st is Grub BOOT, 2nd is EFI, 3rd is ZFS). | ||
+ | To replace failed device it is needed to replicate partition layout: | ||
+ | |||
+ | With new device of greater or equal size, simple replicate partitions: | ||
+ | <code bash> | ||
+ | # replicate layout from SDA to SDB | ||
+ | sgdisk /dev/sda -R /dev/sdb | ||
+ | # generate new UUIDs: | ||
+ | sgdisk -G /dev/sdb | ||
+ | </ | ||
+ | |||
+ | To replicate layout on smaller device, need manually create partitions: | ||
+ | <code bash> | ||
+ | sgdisk -p /dev/sda | ||
+ | |||
+ | Number | ||
+ | | ||
+ | | ||
+ | | ||
+ | |||
+ | sgdisk --clear /dev/sdb | ||
+ | sgdisk /dev/sdb -a1 --new 1: | ||
+ | sgdisk / | ||
+ | sgdisk / | ||
+ | </ | ||
+ | |||
+ | Restore bootloader: | ||
+ | <code bash> | ||
+ | proxmox-boot-tool format /dev/sdb2 | ||
+ | proxmox-boot-tool init /dev/sdb2 | ||
+ | proxmox-boot-tool clean | ||
+ | </ | ||
+ | |||
+ | <code bash> | ||
+ | zpool attach rpool ata-SPCC_Solid_State_Disk_XXXXXXXXXXXX-part3 / | ||
+ | zpool offline rpool ata-SSDPR-CX400-128-G2_XXXXXXXXX-part3 | ||
+ | zpool detach rpool ata-SSDPR-CX400-128-G2_XXXXXXXXX-part3 | ||
+ | </ | ||
===== Migrate VM from dead node ===== | ===== Migrate VM from dead node ===== | ||
Line 17: | Line 124: | ||
<code bash> | <code bash> | ||
zfs send rpool2/ | zfs send rpool2/ | ||
+ | zfs send rpool2/ | ||
+ | </ | ||
+ | |||
+ | ===== reinstall node ===== | ||
+ | |||
+ | Remember to clean any additional device partition belonging to '' | ||
+ | |||
+ | Install fresh Proxmox. | ||
+ | Create common cluster-wide mountpoints to local storage. | ||
+ | Copy all zfs datasets from backup ZFS pool: | ||
+ | <code bash> | ||
+ | zfs send rpool2/ | ||
+ | ... | ||
+ | </ | ||
+ | For CT volumes it getting more complicated: | ||
+ | < | ||
+ | root@pve3: | ||
+ | warning: cannot send ' | ||
+ | cannot receive: failed to read from stream | ||
+ | </ | ||
+ | Reason of problem is that SOURCE is mounted. Solution: | ||
+ | <code bash> | ||
+ | zfs set canmount=off rpool2/ | ||
+ | </ | ||
+ | |||
+ | |||
+ | Try to join to cluster. From new (reinstalled) node '' | ||
+ | Needs to use '' | ||
+ | |||
+ | <code bash> | ||
+ | root@pve3: | ||
+ | |||
+ | Please enter superuser (root) password for ' | ||
+ | Establishing API connection with host ' | ||
+ | The authenticity of host ' | ||
+ | X509 SHA256 key fingerprint is D2: | ||
+ | Are you sure you want to continue connecting (yes/no)? yes | ||
+ | Login succeeded. | ||
+ | check cluster join API version | ||
+ | No cluster network links passed explicitly, fallback to local node IP ' | ||
+ | Request addition of this node | ||
+ | Join request OK, finishing setup locally | ||
+ | stopping pve-cluster service | ||
+ | backup old database to '/ | ||
+ | waiting for quorum...OK | ||
+ | (re)generate node files | ||
+ | generate new node certificate | ||
+ | merge authorized SSH keys and known hosts | ||
+ | generated new node certificate, | ||
+ | successfully added node ' | ||
</ | </ | ||