[Lustre-discuss] troubleshooting lustre

Tue Dec 13 15:36:34 PST 2011

All,

I am having some difficulties with my lustre system. It seems to be when someone is doing quite a bit of reads/writes.

My layout:

A large Volume Group off a DDN connected via infiniband.
This is broken into several Logical Volumes. Some are just regular ext3/4 filesystems. Quite a few are partitioned out (in 4TB chunks) for OSTs.

I have 3 lustre filesystems: home, scratch and work.
Home consists of a single OST
Scratch consists of 2 OSTs
Work consists of 10 OSTs

Each filesystem has its own combined MGS/MGT
Each OSS has 2 OSTs where possible
Each MGS will also serve one OST

I have 8 systems that are OSSes (The MGSes are also among those 8)

Now, ONE of my nodes (an OSSes that is only serving 2 OSTs) has a helluva load:

[root at nas-0-3 ~]# uptime
15:34:06 up 77 days, 22:39,  1 user,  load average: 352.59, 339.80, 318.11

I see lots of:
Lustre: work-OST0004: slow commitrw commit 91s due to heavy IO load

And:
Dec 13 15:32:48 nas-0-3 kernel: LustreError: 6413:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107)  req at ffff8105c557ac00 x1381121762230130/t0 o400-><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1323819184 ref 1 fl Interpret:H/0/0 rc -107/0
Dec 13 15:32:48 nas-0-3 kernel: LustreError: 6413:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 1900 previous similar messages

Not sure what that one means, but it seems significant.

Things get VERY slow and start timing out. Users see it as the system 'hanging'.

Could someone point me in the right direction for figuring out the culprit here?

Thanks in advance!

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20111213/1a9be368/attachment.htm>