Cluster maintenance

1. All our “high availability clusters” are active-passive pairs where the active node does 99% of the work and the other node is idle. Most common reason for them to get active is manual fail-over for maintenance reboot: kernel update, libc update, and so on.

2. Older clusters (centos 5 & 6) are using heartbeat with mon. Newer cluster (centos 7 as of this time) use pacemaker + corosync + pcs. Storage clusters are using DRBD-8.4, RPMs installed from ELRepo.

pcs status

will work on C7 and fail on C5 or 6.

service heartbeat status

will fail on C7 and work on others.

cat /proc/drbd

will work where DRBD is running.

3. Most DRBD devices are backed by RAID (mdadm) so disk failures don't trigger a cluster failover.

maintenance reboot

For software update, first update and reboot the passive node. Check DRBD and cluster status, make sure it's all clear.


On the active node:

/usr/share/heartbeat/hb_standby all

or, on the standby node:

/usr/share/heartbeat/hb_takeover all

On either node watch

tail -f /var/log/messages

and wait for the failover to finish.

Reboot the primary (formerly active) node when ready. Cluster resources will migrate back to it after it's back up.


On either node

pcs cluster standby <ACTIVE NODE>

and check

pcs status

After rebooting the primary,

pcs cluster unstandby <ACTIVE NODE>

as the resources will not migrate back until you “un-standby” it.

DRBD split brain

As of this time I haven't tried it on a pcs + pacemaker cluster. You may need to schedule maintenance downtime and power-off the passive node until then to make sure its DRBD doesn't get any more out of sync.

On heartbeat clusters it “Just Works™”.