[Lustre-discuss] lfs check servers hangs with LBUG

Wed May 14 14:35:12 PDT 2008

Another artifact of this bug is

samba01:~ # lfs df
UUID                 1K-blocks      Used Available  Use% Mounted on
i3_lfs4-MDT0000_UUID 5127312276 293484496 4833827780    5% /mnt/lustre/i3_lfs4[MDT:0]
i3_lfs4-OST0000_UUID 5768202536 295289400 5472913136    5% /mnt/lustre/i3_lfs4[OST:0]
i3_lfs4-OST0001_UUID 5768202536 296678080 5471524456    5% /mnt/lustre/i3_lfs4[OST:1]
i3_lfs4-OST0002_UUID 5768201600 293605428 5474596172    5% /mnt/lustre/i3_lfs4[OST:2]
i3_lfs4-OST0003_UUID 5768201600 293605432 5474596168    5% /mnt/lustre/i3_lfs4[OST:3]
i3_lfs4-OST0004_UUID 5768201600 293477420 5474724180    5% /mnt/lustre/i3_lfs4[OST:4]
error: llapi_obd_statfs failed: Bad address (-14)

I have additional OSTs which appear numerically after the one bad one.

Would rebooting the MDTs help?

The logs on the client say:
May 14 15:27:22 samba01 kernel: Lustre: setting import i3_lfs4-OST0005_UUID INACTIVE by administrator request
May 14 15:27:22 samba01 kernel: Lustre: i3_lfs4-OST0005-osc-ffff8101e2d6dc00.osc: set parameter active=0
May 14 15:27:22 samba01 kernel: LustreError: 4143:0:(lov_obd.c:140:lov_connect_obd()) not connecting OSC i3_lfs4-OST0005_UUID; administratively disabled

which seems normal.

Curiosly, the MDS says:
May 14 15:29:02 mds01 kernel: Lustre: i3_lfs4-MDT0000: haven't heard from client 4dc7492d-7669-ecae-a4b5-bca2891c2dc0 (at 10.200.20.63 at tcp) in 7087 
seconds. I think it's dead, and I am evicting it.

This is the client above which seems to be functioning properly ...

Both OSSs also have that message.  I've rebooted the client, no effect.

Thanks
John

jrs wrote:
> After disabling an OST with:
> 
>  lctl conf_param i3_lfs4-OST0005.osc.active=0
> 
> one of my clients now hangs when running:
> 
> samba01:~ # lfs check osts
> i3_lfs4-OST0000-osc-ffff8101da370800 active.
> i3_lfs4-OST0001-osc-ffff8101da370800 active.
> i3_lfs4-OST0002-osc-ffff8101da370800 active.
> i3_lfs4-OST0003-osc-ffff8101da370800 active.
> i3_lfs4-OST0004-osc-ffff8101da370800 active.
> 
> The above has been running for 10 minutes.
> The load on the machine has been driven up to 1.0
> (it's a dual core box).
> 
> In /var/log/messages I see:
> 
> May 14 10:13:09 samba01 kernel: LustreError: 
> 4006:0:(client.c:504:ptlrpc_import_delay_req()) @@@ Uninitialized 
> import.  req at ffff8101e8648400 x
> 76/t0 o400->i3_lfs4-OST0005_UUID@<NULL>:6 lens 64/64 ref 1 fl Rpc:N/0/0 
> rc 0/0
> May 14 10:13:09 samba01 kernel: LustreError: 
> 4006:0:(client.c:506:ptlrpc_import_delay_req()) LBUG
> May 14 10:13:09 samba01 kernel: Lustre: 
> 4006:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for 
> process 4006
> May 14 10:13:09 samba01 kernel: lfs           R  running task       0  
> 4006   3872                     (NOTLB)
> May 14 10:13:09 samba01 kernel: ffff8101e7e046c0 0000000000000086 
> ffff8101da36a780 ffff8101dd8cdb80
> May 14 10:13:09 samba01 kernel:        0000000000000001 00007fffefa8395f 
> ffffffff8837a29b 0000004b9300f2ed
> May 14 10:13:09 samba01 kernel:        ffff8101dd8cdb80 0000000000000001
> May 14 10:13:09 samba01 kernel: Call Trace: 
> <ffffffff8837a29b>{:obdclass:lprocfs_fops_write+91}
> May 14 10:13:09 samba01 kernel:        <ffffffff80181803>{vfs_write+215} 
> <ffffffff80181dca>{sys_write+69}
> May 14 10:13:09 samba01 kernel:        <ffffffff8010ad3e>{system_call+126}
> May 14 10:13:09 samba01 kernel: LustreError: dumping log to 
> /tmp/lustre-log.1210781589.4006
> 
> 
> I've attached the referred to log file.
> 
> This might be the same bug as:
> https://bugzilla.lustre.org/show_bug.cgi?id=12565
> 
> Is there any work around?
> 
> Thanks,
> Johb
>