[Lustre-discuss] drbd(fail?)
Papp Tamás
tompos at martos.bme.hu
Thu Jan 24 10:26:26 PST 2008
helo Everybody!
I have a strange problem with my cluster.
Yesterday I saw, node3 of my lustre cluster (it's the pair of node4 of
the heartbeat+drbd cluster) was freezed up and node4 didn't took over
the OST.
After reboot it always wrote 'System halted.' on console, but it cannot
be down. I disconnected node3, rebooted node4, and everything worked fine.
Today, I tried to make it work as before with a fresh system with CentOS
4.4, drbd 0.7.25, lustre 1.6.4.1. The array drbd1, which is originally
primary on node4 went fine.
node4:
0: cs:StandAlone st:Primary/Unknown ld:Consistent
ns:0 nr:0 dw:15404660 dr:88550854 al:11773 bm:11773 lo:0 pe:0 ua:0 ap:0
node3:
0: cs:WFConnection st:Secondary/Unknown ld:Consistent
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
# drbdadm --dry-run wait_connect ost-3
drbdsetup /dev/drbd0 wait_connect --wfc-timeout=120 --degr-wfc-timeout=120
It said: Aborting.
drbdadm connect ost-3 -> in messages log I saw:
Jan 24 09:31:37 node4 kernel: drbd0: drbdsetup [8135]: cstate StandAlone
--> Unconnected
Jan 24 09:31:37 node4 kernel: drbd0: drbd0_receiver [8136]: cstate
Unconnected --> WFConnection
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate
WFConnection --> WFReportParams
Jan 24 09:31:39 node4 kernel: drbd0: Handshake successful: DRBD Network
Protocol version 74
Jan 24 09:31:39 node4 kernel: drbd0: Connection established.
Jan 24 09:31:39 node4 kernel: drbd0: I am(P):
1:00000003:00000003:00000053:00000003:10
Jan 24 09:31:39 node4 kernel: drbd0: Peer(S):
1:00000007:00000003:0000004a:00000004:00
Jan 24 09:31:39 node4 kernel: drbd0: Current Primary shall become sync
TARGET! Aborting to prevent data corruption.
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate
WFReportParams --> StandAlone
Jan 24 09:31:39 node4 kernel: drbd0: error receiving ReportParams, l: 72!
Jan 24 09:31:39 node4 kernel: drbd0: asender terminated
Jan 24 09:31:39 node4 kernel: drbd0: worker terminated
Jan 24 09:31:39 node4 kernel: drbd0: drbd0_receiver [8136]: cstate
StandAlone --> StandAlone
Jan 24 09:31:39 node4 kernel: drbd0: Connection lost.
Jan 24 09:31:39 node4 kernel: drbd0: receiver terminated
Why didn't work it? I wanted to make node4 to be SyncSource, node3
behaved fine and was listening on the right port with cstate WFConnection.
Than I made a mistake, disabled hertbeat and rebooted node4. Well, both
node was Secondary, and they started to sync, node3 was the SyncSource.
Why? What could be the right command?
So the get synced. And after that, I don't know exactly, when node4
started to behave like node3 yesterday, it wrote 'System haled' and
everything stopped to work. I stoped heartbeat, reset, mount ost by
hand, and now it looks fine, but who know, now I'm a bit paranoid.
Still I have to say, node3's kernel was 1.6.0.1 with drbd 0.7.22 (but
0.7.25 userland) until the last reboot above, I don't know, it could
cause a problem, or not.
Does anybody have an idea, what happened, what would have to make with
any part of the history?
Thank you,
tamas
More information about the lustre-discuss
mailing list