Hallo,
Ich betreibe auf zwei Root Servern, die mittels netcup vLAN verbunden sind, ein kleines Cluster. Hierbei habe ich täglich das Problem, dass sich das Cluster irgendwann verabschiedet und nicht mehr erreichbar ist. Als Ursache konnte ich heute folgendes ausmachen: Die vLAN Interfaces verlieren an zufälligen Zeitpunkten kurzzeitig die Verbindung zueinander (statische IPs), was das Cluster in einen inkonsistenten Zustand befördert, da Ressourcen unnötig und nur halb migriert werden, bis die Verbindung wiederhergestellt ist.
Ist eine solche Service-Unterbrechung bei netcups vLAN zu erwarten (Feature) oder handelt es sich um einen Bug? Sollte es ein Feature sein, kann ich dann wenigstens davon ausgehen, dass die externen Netzwerkinterfaces stabil laufen? In dem Fall würde ich das vLAN durch einen Server-zu-Server Tunnel ersetzen können.
OS: CentOS 8 minimal
vLAN: Free 100 mbit
Hosts: RS500 (2 CPUs, 4 GB RAM, 240 GB SAS)
Grüße
Anbei noch der Auszug aus den Logs, der das Problem veranschaulicht:
Dec 14 09:00:04 mx1 corosync[1329]: [KNET ] link: host: 2 link: 0 is down
Dec 14 09:00:04 mx1 corosync[1329]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Dec 14 09:00:04 mx1 corosync[1329]: [KNET ] host: host: 2 has no active links
Dec 14 09:00:04 mx1 corosync[1329]: [TOTEM ] Token has not been received in 273 ms
Dec 14 09:00:04 mx1 corosync[1329]: [TOTEM ] A processor failed, forming new configuration.
Dec 14 09:00:05 mx1 corosync[1329]: [TOTEM ] A new membership (1:128) was formed. Members left: 2
Dec 14 09:00:05 mx1 corosync[1329]: [TOTEM ] Failed to receive the leave message. failed: 2
Dec 14 09:00:05 mx1 corosync[1329]: [CPG ] downlist left_list: 1 received
Dec 14 09:00:05 mx1 corosync[1329]: [QUORUM] Members[1]: 1
Dec 14 09:00:05 mx1 corosync[1329]: [MAIN ] Completed service synchronization, ready to provide service.
Dec 14 09:00:05 mx1 pacemaker-attrd[1591]: notice: Lost attribute writer mx2-cr
Dec 14 09:00:05 mx1 pacemakerd[1518]: notice: Node mx2-cr state is now lost
Dec 14 09:00:05 mx1 pacemaker-fenced[1589]: notice: Node mx2-cr state is now lost
Dec 14 09:00:05 mx1 pacemaker-based[1588]: notice: Node mx2-cr state is now lost
Dec 14 09:00:05 mx1 pacemaker-based[1588]: notice: Purged 1 peer with id=2 and/or uname=mx2-cr from the membership cache
Dec 14 09:00:05 mx1 pacemaker-attrd[1591]: notice: Node mx2-cr state is now lost
Dec 14 09:00:05 mx1 pacemaker-attrd[1591]: notice: Removing all mx2-cr attributes for peer loss
Dec 14 09:00:05 mx1 pacemaker-attrd[1591]: notice: Purged 1 peer with id=2 and/or uname=mx2-cr from the membership cache
Dec 14 09:00:05 mx1 pacemaker-attrd[1591]: notice: Recorded local node as attribute writer (was unset)
Dec 14 09:00:05 mx1 pacemaker-controld[1593]: notice: Node mx2-cr state is now lost
Dec 14 09:00:05 mx1 pacemaker-controld[1593]: warning: Our DC node (mx2-cr) left the cluster
Dec 14 09:00:05 mx1 pacemaker-fenced[1589]: notice: Purged 1 peer with id=2 and/or uname=mx2-cr from the membership cache
Dec 14 09:00:05 mx1 pacemaker-controld[1593]: notice: State transition S_NOT_DC -> S_ELECTION
Dec 14 09:00:05 mx1 pacemaker-controld[1593]: notice: State transition S_ELECTION -> S_INTEGRATION
Dec 14 09:00:06 mx1 pacemaker-schedulerd[1592]: notice: On loss of quorum: Ignore
Dec 14 09:00:06 mx1 pacemaker-schedulerd[1592]: notice: Calculated transition 2, saving inputs in /var/lib/pacemaker/pengine/pe-input-226.bz2
Dec 14 09:00:06 mx1 pacemaker-controld[1593]: notice: Transition 2 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-226.bz2): Complete
Dec 14 09:00:06 mx1 pacemaker-controld[1593]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: PingAck did not arrive in time.
Dec 14 09:00:07 mx1 kernel: drbd git mx2: PingAck did not arrive in time.
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Dec 14 09:00:07 mx1 kernel: drbd bst/0 drbd8 mx2: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Dec 14 09:00:07 mx1 kernel: drbd git mx2: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
Dec 14 09:00:07 mx1 kernel: drbd git/0 drbd1 mx2: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
Dec 14 09:00:07 mx1 kernel: drbd git mx2: ack_receiver terminated
Dec 14 09:00:07 mx1 kernel: drbd git mx2: Terminating ack_recv thread
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: ack_receiver terminated
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: Terminating ack_recv thread
Dec 14 09:00:07 mx1 kernel: drbd git mx2: Aborting remote state change 0 commit not possible
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: Aborting remote state change 0 commit not possible
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: Restarting sender thread
Dec 14 09:00:07 mx1 kernel: drbd git mx2: Restarting sender thread
Dec 14 09:00:07 mx1 kernel: drbd git mx2: Connection closed
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: Connection closed
Dec 14 09:00:07 mx1 kernel: drbd git mx2: conn( NetworkFailure -> Unconnected )
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: conn( NetworkFailure -> Unconnected )
Dec 14 09:00:07 mx1 kernel: drbd git mx2: Restarting receiver thread
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: Restarting receiver thread
Dec 14 09:00:07 mx1 kernel: drbd git mx2: conn( Unconnected -> Connecting )
Dec 14 09:00:07 mx1 kernel: drbd bst mx2: conn( Unconnected -> Connecting )