Hi Damiri,<br><br>If heartbeat is not able to start(mount) one of the OSTs I would recommend to stop heartbeat on both servers and then mount troubled OST manually. Then you should see why OST is not mounted. In order to check the consistency of the filesystem, in your case I would first run fsck with -n switch to see extent of the damage, this also prevents from damaging your filesystem even more if you have a faulty controller or links corrupting data. In normal situation I use following command: fsck -f -v /dev/<ost_dev> -C0<br>

Make sure that you log output from the fsck which will be essential for the further troubleshooting.<br><br>Best regards,<br><br>Wojciech<br><br><div class="gmail_quote">On 19 July 2011 16:58, Young, Damiri <span dir="ltr"><<a href="mailto:Damiri.Young@unt.edu">Damiri.Young@unt.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div>

<div style="direction:ltr;font-family:Tahoma;color:#000000;font-size:10pt">Many thanks for the useful info Turek. I mentioned HA (heartbeat v2) issues because after the troubled I/O got it's paths back to the OST's it failed all 4 of the 5 OSTs over to

 it's sibling server where they're now mounted. To me it seems the OSTs (we're using lustre v1.6 btw) won't be released until the failed over node is reset by it's sibling.

<br>

<br>

The OSSs seem to have trouble connecting to the 1 OST I mentioned:<br>

-------------------------------- messages -------------------------------------<br>

Jul 19 10:29:02 IO-10 kernel: LustreError: 29429:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19)  req@ffff810315ab7000 x62666211/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311089442 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 19 10:29:02 IO-10 kernel: LustreError: 29429:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 2987 previous similar messages<br>

Jul 19 10:39:02 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000a_UUID' is not available  for connect (stopping)<br>

Jul 19 10:39:02 IO-10 kernel: LustreError: Skipped 2988 previous similar messages<br>

---------------------------------------------------------------------------------------<br>

<br>

<br>

--------------------- lfs check osts output --------------------------------<br>

lfs check osts<br>

es1-OST0000-osc-ffff810c5dc61000 active.<br>

es1-OST0001-osc-ffff810c5dc61000 active.<br>

es1-OST0002-osc-ffff810c5dc61000 active.<br>

es1-OST0003-osc-ffff810c5dc61000 active.<br>

es1-OST0004-osc-ffff810c5dc61000 active.<br>

es1-OST0005-osc-ffff810c5dc61000 active.<br>

es1-OST0006-osc-ffff810c5dc61000 active.<br>

es1-OST0007-osc-ffff810c5dc61000 active.<br>

es1-OST0008-osc-ffff810c5dc61000 active.<br>

es1-OST0009-osc-ffff810c5dc61000 active.<br>

error: check 'es1-OST000a-osc-ffff810c5dc61000': Resource temporarily unavailable (11)<br>

es1-OST000b-osc-ffff810c5dc61000 active.<br>

es1-OST000c-osc-ffff810c5dc61000 active.<br>

es1-OST000d-osc-ffff810c5dc61000 active.<br>

es1-OST000e-osc-ffff810c5dc61000 active.<br>

es1-OST000f-osc-ffff810c5dc61000 active.<br>

es1-OST0010-osc-ffff810c5dc61000 active.<br>

es1-OST0011-osc-ffff810c5dc61000 active.<br>

es1-OST0012-osc-ffff810c5dc61000 active.<br>

es1-OST0013-osc-ffff810c5dc61000 active.<br>

---------------------------------------------------------------------------------------------------------------<br>

<br>

----------------- cat /proc/fs/lustre/devices output --------------------<br>

cat /proc/fs/lustre/devices <br>

  1 UP ost OSS OSS_uuid 3<br>

  2 ST obdfilter es1-OST000a es1-OST000a_UUID 3<br>

--------------------------------------------------------------------------------------<br>

<br>

 do you know the command(s) to fsck correctly?<br>

<div><br>

<div style="font-family:Tahoma;font-size:13px">

<div><font size="2">

<div>Best Regards,<br>

--<br>

DaMiri Young<div class="im"><br>

System Engineer<br>

High Performance Computing Team<br></div><div class="im">

CITC Academic Computing and User Services | UNT</div></div>

</font></div>

</div>

</div>

<div style="font-family:Times New Roman;color:rgb(0, 0, 0);font-size:16px">

<hr>

<div style="direction:ltr"><font color="#000000" face="Tahoma" size="2"><b>From:</b> <a href="mailto:turek.wojciech@gmail.com" target="_blank">turek.wojciech@gmail.com</a> [<a href="mailto:turek.wojciech@gmail.com" target="_blank">turek.wojciech@gmail.com</a>] on behalf of Wojciech Turek [<a href="mailto:wjt27@cam.ac.uk" target="_blank">wjt27@cam.ac.uk</a>]<br>

<b>Sent:</b> Monday, July 18, 2011 6:06 PM<br>

<b>To:</b> DaMiri Young<br>

<b>Cc:</b> Young, Damiri; <a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a><br>

<b>Subject:</b> Re: [Lustre-discuss] IO-Node issue<br>

</font><br>

</div><div><div></div><div class="h5">

<div></div>

<div>Ok , Good so at least you have now the OST mounting. I would have run fsck even though it said that LDISKFS was recovered correctly.<br>

<br>

As for you Heartbeat problems, If you use the old V1 heartbeat configuration with haresources file then I don't think that STONITH has anything to do with your filesystem resources not starting.<br>

>From your logs it looks like STONITH device is not configured properly so first you need to test yourSTONITH config as follows:<br>

<br>

# stonith -t external/ipmi -n<br>

HOSTNAME  IP_ADDR  IPMI_USER  IPMI_PASSWD_FILE<br>

<br>

<br>

# stonith -t external/ipmi -p "oss02 10.145.245.2 root /etc/ha.d/ipmitool.passwd" -lS<br>

stonith: external/ipmi device OK.<br>

<br>

As you can see in my config stonith command returns OK so you need to look at your config and tweak it so it also return OK.<br>

<br>

regards<br>

<br>

Wojciech<br>

<br>

<div class="gmail_quote">On 18 July 2011 23:02, DaMiri Young <span dir="ltr"><<a href="mailto:damiri@unt.edu" target="_blank">damiri@unt.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">

So you were right about the I/O node losing contact with the OST. In short, after enabling lustre debugging, restarting opensmd and openibd services on the troubled node the OSTs were remounted and lustre entered recovery:<br>

--------------------------- messages ------------------------------<u></u>----------<br>

Jul 18 10:02:56 IO-10 kernel: ib_ipath 0000:05:00.0: We got a lid: 0x75<br>

Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 11 on device= ipath0<br>

Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 13 on device= ipath0<br>

Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 17 on device= ipath0<br>

Jul 18 10:02:56 IO-10 kernel: ib_srp: ASYNC event= 9 on device= ipath0<br>

Jul 18 10:02:59 IO-10 kernel: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready<br>

Jul 18 10:03:01 IO-10 avahi-daemon[24939]: New relevant interface ib0.IPv6 for mDNS.<br>

Jul 18 10:03:01 IO-10 avahi-daemon[24939]: Joining mDNS multicast group on interface ib0.IPv6 with address fe80::211:7500:ff:7bf6.<br>

Jul 18 10:03:01 IO-10 avahi-daemon[24939]: Registering new address record for fe80::211:7500:ff:7bf6 on ib0.<br>

Jul 18 10:03:15 IO-10 ntpd[23084]: synchronized to 10.0.0.1, stratum 3<br>

Jul 18 11:41:40 IO-10 kernel: megasas: 00.00.03.15-RH1 Wed Nov. 21 10:29:45 PST 2007<br>

Jul 18 11:41:41 IO-10 kernel: Lustre: OBD class driver, <a href="http://www.lustre.org/" target="_blank">

http://www.lustre.org/</a><br>

Jul 18 11:41:41 IO-10 kernel:         Lustre Version: 1.6.6<br>

Jul 18 11:41:41 IO-10 kernel:         Build Version: 1.6.6-1.6.6-ddn3.1-<u></u>20090527173746<br>

Jul 18 11:41:41 IO-10 kernel: Lustre: 28686:0:(o2iblnd_modparams.c:<u></u>324:kiblnd_tunables_init()) Concurrent sends 7 is lower than message queue size: 8, performance may drop slightly.<br>

Jul 18 11:41:41 IO-10 kernel: Lustre: Added LNI 10.1.0.229@o2ib [8/64]<br>

Jul 18 11:41:41 IO-10 kernel: Lustre: Lustre Client File System; <a href="http://www.lustre.org/" target="_blank">

http://www.lustre.org/</a><br>

Jul 18 11:42:07 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs warning: checktime reached, running e2fsck is recommended<br>

Jul 18 11:42:07 IO-10 kernel: LDISKFS FS on dm-11, internal journal<br>

Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs: recovery complete.<br>

Jul 18 11:42:07 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:42:07 IO-10 multipathd: dm-11: umount map (uevent)<br>

Jul 18 11:42:18 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs warning: checktime reached, running e2fsck is recommended<br>

Jul 18 11:42:18 IO-10 kernel: LDISKFS FS on dm-11, internal journal<br>

Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: file extents enabled<br>

Jul 18 11:42:18 IO-10 kernel: LDISKFS-fs: mballoc enabled<br>

Jul 18 11:42:18 IO-10 kernel: fsfilt_ldiskfs: no version for "ldiskfs_free_blocks" found: kernel tainted.<br>

Jul 18 11:42:18 IO-10 kernel: Lustre: Filtering OBD driver; <a href="http://www.lustre.org/" target="_blank">

http://www.lustre.org/</a><br>

Jul 18 11:42:18 IO-10 kernel: Lustre: 29999:0:(filter.c:868:filter_<u></u>init_server_data()) RECOVERY: service es1-OST000a, 249 recoverable clients, last_rcvd 469628325<br>

Jul 18 11:42:18 IO-10 kernel: Lustre: OST es1-OST000a now serving dev (es1-OST000a/15fae56a-7dae-ba24-4423-347c0a118367), but will be in recovery for at least 5:00, or until 249 clients reconnect. During this time new clients will not be allowed to connect.

 Recovery progress can be monitored by watching /proc/fs/lustre/obdfilter/es1-<u></u>OST000a/recovery_status.<br>

Jul 18 11:42:18 IO-10 kernel: Lustre: es1-OST000a.ost: set parameter quota_type=ug<br>

Jul 18 11:42:18 IO-10 kernel: Lustre: Server es1-OST000a on device /dev/mpath/lun_11 has started<br>

Jul 18 11:42:19 IO-10 kernel: Lustre: 28952:0:(ldlm_lib.c:1226:<u></u>check_and_start_recovery_<u></u>timer()) es1-OST000a: starting recovery timer<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000c_UUID' is not available  for connect (no target)<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: 28957:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff810311f9f400 x36077513/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 3 previous similar messages<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000b_UUID' is not available  for connect (no target)<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: 28985:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff8102f81ce800 x8649866/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: 28985:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) Skipped 3 previous similar messages<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 3 previous similar messages<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000b_UUID' is not available  for connect (no target)<br>

Jul 18 11:42:19 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000a: 248 recoverable clients remain<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: 29010:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff8102f81f2c00 x368697/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007439 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: 29010:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) Skipped 19 previous similar messages<br>

Jul 18 11:42:19 IO-10 kernel: LustreError: Skipped 19 previous similar messages<br>

Jul 18 11:42:19 IO-10 kernel: Lustre: 29012:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000a: 247 recoverable clients remain<br>

Jul 18 11:42:20 IO-10 kernel: Lustre: 29106:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000a: 240 recoverable clients remain<br>

Jul 18 11:42:20 IO-10 kernel: Lustre: 29106:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) Skipped 6 previous similar messages<br>

Jul 18 11:42:20 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000b_UUID' is not available  for connect (no target)<br>

Jul 18 11:42:20 IO-10 kernel: LustreError: 29149:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff81030eff2850 x68565826/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007440 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:42:20 IO-10 kernel: LustreError: 29149:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) Skipped 31 previous similar messages<br>

Jul 18 11:42:20 IO-10 kernel: LustreError: Skipped 31 previous similar messages<br>

Jul 18 11:42:21 IO-10 kernel: Lustre: 29196:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000a: 232 recoverable clients remain<br>

Jul 18 11:42:21 IO-10 kernel: Lustre: 29196:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) Skipped 7 previous similar messages<br>

Jul 18 11:42:22 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000b_UUID' is not available  for connect (no target)<br>

Jul 18 11:42:22 IO-10 kernel: LustreError: 29275:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff810302713c50 x519337/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007442 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:42:22 IO-10 kernel: LustreError: 29275:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) Skipped 47 previous similar messages<br>

Jul 18 11:42:22 IO-10 kernel: LustreError: Skipped 47 previous similar messages<br>

Jul 18 11:42:23 IO-10 kernel: Lustre: 29320:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000a: 221 recoverable clients remain<br>

Jul 18 11:42:23 IO-10 kernel: Lustre: 29320:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) Skipped 10 previous similar messages<br>

Jul 18 11:42:27 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000c_UUID' is not available  for connect (no target)<br>

Jul 18 11:42:27 IO-10 kernel: LustreError: 29030:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff8102f87bac00 x435304948/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007447 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:42:27 IO-10 kernel: LustreError: 29030:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) Skipped 91 previous similar messages<br>

Jul 18 11:42:27 IO-10 kernel: LustreError: Skipped 91 previous similar messages<br>

Jul 18 11:42:27 IO-10 kernel: Lustre: 29182:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000a: 196 recoverable clients remain<br>

Jul 18 11:42:27 IO-10 kernel: Lustre: 29182:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) Skipped 24 previous similar messages<br>

Jul 18 11:42:46 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs warning: checktime reached, running e2fsck is recommended<br>

Jul 18 11:42:46 IO-10 kernel: LDISKFS FS on dm-10, internal journal<br>

Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs: recovery complete.<br>

Jul 18 11:42:46 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:42:46 IO-10 multipathd: dm-10: umount map (uevent)<br>

Jul 18 11:42:58 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs warning: checktime reached, running e2fsck is recommended<br>

Jul 18 11:42:58 IO-10 kernel: LDISKFS FS on dm-10, internal journal<br>

Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: file extents enabled<br>

Jul 18 11:42:58 IO-10 kernel: LDISKFS-fs: mballoc enabled<br>

Jul 18 11:42:58 IO-10 kernel: Lustre: 30227:0:(filter.c:868:filter_<u></u>init_server_data()) RECOVERY: service es1-OST000b, 249 recoverable clients, last_rcvd 608808684<br>

Jul 18 11:42:58 IO-10 kernel: Lustre: OST es1-OST000b now serving dev (es1-OST000b/1f38b48f-9a67-b3a6-4374-b25762e71391), but will be in recovery for at least 5:00, or until 249 clients reconnect. During this time new clients will not be allowed to connect.

 Recovery progress can be monitored by watching /proc/fs/lustre/obdfilter/es1-<u></u>OST000b/recovery_status.<br>

Jul 18 11:42:58 IO-10 kernel: Lustre: es1-OST000b.ost: set parameter quota_type=ug<br>

Jul 18 11:42:58 IO-10 kernel: Lustre: Server es1-OST000b on device /dev/mpath/lun_12 has started<br>

Jul 18 11:43:09 IO-10 kernel: Lustre: 28975:0:(ldlm_lib.c:1226:<u></u>check_and_start_recovery_<u></u>timer()) es1-OST000b: starting recovery timer<br>

Jul 18 11:43:09 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000c_UUID' is not available  for connect (no target)<br>

Jul 18 11:43:09 IO-10 kernel: LustreError: Skipped 111 previous similar messages<br>

Jul 18 11:43:09 IO-10 kernel: LustreError: 29079:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff8102eb3cb000 x36077574/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007489 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:43:09 IO-10 kernel: LustreError: 29079:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) Skipped 114 previous similar messages<br>

Jul 18 11:43:09 IO-10 kernel: Lustre: 28999:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000b: 248 recoverable clients remain<br>

Jul 18 11:43:09 IO-10 kernel: Lustre: 28999:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) Skipped 25 previous similar messages<br>

Jul 18 11:43:21 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended<br>

Jul 18 11:43:21 IO-10 kernel: LDISKFS FS on dm-12, internal journal<br>

Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs: recovery complete.<br>

Jul 18 11:43:21 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:43:21 IO-10 multipathd: dm-12: umount map (uevent)<br>

Jul 18 11:43:32 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended<br>

Jul 18 11:43:32 IO-10 kernel: LDISKFS FS on dm-12, internal journal<br>

Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: file extents enabled<br>

Jul 18 11:43:32 IO-10 kernel: LDISKFS-fs: mballoc enabled<br>

Jul 18 11:43:32 IO-10 kernel: Lustre: 30436:0:(filter.c:868:filter_<u></u>init_server_data()) RECOVERY: service es1-OST000c, 249 recoverable clients, last_rcvd 370809064<br>

Jul 18 11:43:32 IO-10 kernel: Lustre: OST es1-OST000c now serving dev (es1-OST000c/f8c1bf77-11b3-88be-4438-016f059a91b5), but will be in recovery for at least 5:00, or until 249 clients reconnect. During this time new clients will not be allowed to connect.

 Recovery progress can be monitored by watching /proc/fs/lustre/obdfilter/es1-<u></u>OST000c/recovery_status.<br>

Jul 18 11:43:32 IO-10 kernel: Lustre: es1-OST000c.ost: set parameter quota_type=ug<br>

Jul 18 11:43:32 IO-10 kernel: Lustre: Server es1-OST000c on device /dev/mpath/lun_13 has started<br>

Jul 18 11:43:46 IO-10 kernel: Lustre: 29050:0:(ldlm_lib.c:1226:<u></u>check_and_start_recovery_<u></u>timer()) es1-OST000c: starting recovery timer<br>

Jul 18 11:43:46 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000d_UUID' is not available  for connect (no target)<br>

Jul 18 11:43:46 IO-10 kernel: LustreError: Skipped 229 previous similar messages<br>

Jul 18 11:43:46 IO-10 kernel: LustreError: 29123:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff8102f6e36000 x36721236/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007526 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:43:46 IO-10 kernel: LustreError: 29123:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) Skipped 229 previous similar messages<br>

Jul 18 11:43:46 IO-10 kernel: Lustre: 28982:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000b: 171 recoverable clients remain<br>

Jul 18 11:43:46 IO-10 kernel: Lustre: 28982:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) Skipped 76 previous similar messages<br>

Jul 18 11:43:54 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended<br>

Jul 18 11:43:54 IO-10 kernel: LDISKFS FS on dm-13, internal journal<br>

Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs: recovery complete.<br>

Jul 18 11:43:54 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:43:55 IO-10 multipathd: dm-13: umount map (uevent)<br>

Jul 18 11:44:06 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended<br>

Jul 18 11:44:06 IO-10 kernel: LDISKFS FS on dm-13, internal journal<br>

Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: file extents enabled<br>

Jul 18 11:44:06 IO-10 kernel: LDISKFS-fs: mballoc enabled<br>

Jul 18 11:44:06 IO-10 kernel: Lustre: 30686:0:(filter.c:868:filter_<u></u>init_server_data()) RECOVERY: service es1-OST000d, 249 recoverable clients, last_rcvd 694562245<br>

Jul 18 11:44:06 IO-10 kernel: Lustre: OST es1-OST000d now serving dev (es1-OST000d/cf608dbd-accd-89b7-471a-f4487e9f8ba3), but will be in recovery for at least 5:00, or until 249 clients reconnect. During this time new clients will not be allowed to connect.

 Recovery progress can be monitored by watching /proc/fs/lustre/obdfilter/es1-<u></u>OST000d/recovery_status.<br>

Jul 18 11:44:06 IO-10 kernel: Lustre: es1-OST000d.ost: set parameter quota_type=ug<br>

Jul 18 11:44:06 IO-10 kernel: Lustre: Server es1-OST000d on device /dev/mpath/lun_14 has started<br>

Jul 18 11:44:06 IO-10 kernel: Lustre: 29293:0:(ldlm_lib.c:1226:<u></u>check_and_start_recovery_<u></u>timer()) es1-OST000d: starting recovery timer<br>

Jul 18 11:44:18 IO-10 kernel: LustreError: 137-5: UUID 'es1-OST000e_UUID' is not available  for connect (no target)<br>

Jul 18 11:44:18 IO-10 kernel: LustreError: Skipped 199 previous similar messages<br>

Jul 18 11:44:18 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) es1-OST000d: 175 recoverable clients remain<br>

Jul 18 11:44:18 IO-10 kernel: LustreError: 29135:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) @@@ processing error (-19)  req@ffff8102f4c1cc00 x56000488/t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 1311007558 ref 1 fl Interpret:/0/0 rc -19/0<br>

Jul 18 11:44:18 IO-10 kernel: LustreError: 29135:0:(ldlm_lib.c:1619:<u></u>target_send_reply_msg()) Skipped 199 previous similar messages<br>

Jul 18 11:44:18 IO-10 kernel: Lustre: 29068:0:(ldlm_lib.c:1567:<u></u>target_queue_last_replay_<u></u>reply()) Skipped 331 previous similar messages<br>

Jul 18 11:44:28 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended<br>

Jul 18 11:44:28 IO-10 kernel: LDISKFS FS on dm-14, internal journal<br>

Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs: recovery complete.<br>

Jul 18 11:44:28 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:44:28 IO-10 multipathd: dm-14: umount map (uevent)<br>

Jul 18 11:44:39 IO-10 kernel: kjournald starting.  Commit interval 5 seconds<br>

Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended<br>

Jul 18 11:44:39 IO-10 kernel: LDISKFS FS on dm-14, internal journal<br>

Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.<br>

Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: file extents enabled<br>

Jul 18 11:44:39 IO-10 kernel: LDISKFS-fs: mballoc enabled<br>

Jul 18 11:44:39 IO-10 kernel: Lustre: 30893:0:(filter.c:868:filter_<u></u>init_server_data()) RECOVERY: service es1-OST000e, 249 recoverable clients, last_rcvd 613643608<br>

Jul 18 11:44:39 IO-10 kernel: Lustre: OST es1-OST000e now serving dev (es1-OST000e/478c7dc4-4936-bfe2-45ac-2fb7a2e69f62), but will be in recovery for at least 5:00, or until 249 clients reconnect. During this time new clients will not be allowed to connect.

 Recovery progress can be monitored by watching /proc/fs/lustre/obdfilter/es1-<u></u>OST000e/recovery_status.<br>

Jul 18 11:44:39 IO-10 kernel: Lustre: es1-OST000e.ost: set parameter quota_type=ug<br>

Jul 18 11:44:39 IO-10 kernel: Lustre: Server es1-OST000e on device /dev/mpath/lun_15 has started<br>

Jul 18 11:44:40 IO-10 kernel: Lustre: 29214:0:(ldlm_lib.c:1226:<u></u>check_and_start_recovery_<u></u>timer()) es1-OST000e: starting recovery timer<br>

Jul 18 11:44:49 IO-10 kernel: Lustre: 29236:0:(service.c:939:ptlrpc_<u></u>server_handle_req_in()) @@@ Slow req_in handling 6s  req@ffff8102f4419c00 x738214853/t0 o101-><?>@<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0<br>

Jul 18 11:44:49 IO-10 kernel: Lustre: 28992:0:(service.c:939:ptlrpc_<u></u>server_handle_req_in()) @@@ Slow req_in handling 6s  req@ffff8102f4419400 x738214855/t0 o101-><?>@<?>:0/0 lens 232/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0<br>

---------------------- end messages -----------------------------<br>

<br>

It mentioned completing the recovery so I didn't bother with running another fsck, should I? The problem now seems to be that STONITH on the troubled node's failover can't reset the node. It tries and fails incessantly:<br>

------------------------ messages ------------------------------<u></u>-<br>

Jul 18 16:45:17 IO-11 heartbeat: [25037]: info: Resetting node io-10.internal.acs.unt.prv with [IPMI STONITH device ]<br>

Jul 18 16:45:18 IO-11 heartbeat: [25037]: info: glib: external_run_cmd: Calling '/usr/lib64/stonith/plugins/<u></u>external/ipmi reset io-10.internal.acs.unt.prv' returned 256<br>

Jul 18 16:45:18 IO-11 heartbeat: [25037]: ERROR: glib: external_reset_req: 'ipmi reset' for host io-10.internal.acs.unt.prv failed with rc 256<br>

Jul 18 16:45:18 IO-11 heartbeat: [25037]: ERROR: Host io-10.internal.acs.unt.prv not reset!<br>

Jul 18 16:45:18 IO-11 heartbeat: [15803]: WARN: Managed STONITH io-10.internal.acs.unt.prv process 25037 exited with return code 1.<br>

Jul 18 16:45:18 IO-11 heartbeat: [15803]: ERROR: STONITH of io-10.internal.acs.unt.prv failed.  Retrying...<br>

---------------------- end messages ------------------------------<u></u>---<br>

<br>

I've checked the logic in usr/lib64/stonith/plugins/external/ipmi which doesn't seem to be using the correct address for the BMC controller. It's possible that the HA facilites could prevent mounting of the final OSTs isn't it?

<div>

<div></div>

<div><br>

<br>

<br>

Wojciech Turek wrote:<br>

<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">

Hi Damiri,<br>

<br>

 From the logs you have provided it looks like you have a problem with your back end storage. First of all we can see that your SRP connection to backend storage reports abort and reset (I guess your backend storage hardware is connected via Infiniband if you

 are using SRP). Then Lustre reports slow messages and eventually kernel reports SCSI errors. Device mapper reports that both paths to the device are failed and Lustre remounts filesystem read-only due to I/O error. All these means that your I/O node lost contact

 with the OST due to some errors either on IB network connecting your host to the storage hardware or on the storage hardware itself. From the first part of the log we can see that the device being in trouble is OST es1-OST000b (dm-11). In the second part of

 your log I can not see that device being mounted. From your log I can see that only OST  es1-OST000a (dm-10) is mounted and enters recovery<br>

</blockquote>

<br>

<br>

</div>

</div>

<font color="#888888">-- <br>

DaMiri Young<br>

HPC System Engineer<br>

High Performance Computing Team | ACUS/CITC | UNT<br>

</font></blockquote>

</div>

<br>

</div>

</div></div></div>

</div>

</div>

</blockquote></div><br>