[Lustre-discuss] OSS kind of hang prevents new clients from mounting

Wallior, Julien Julien.Wallior at sig.com
Thu Jun 5 19:08:55 PDT 2008


Hello everybody,
I have a problem that happened a few times with my lustre 1.6.4.3 setup
and I finally figured out how to explain it to you.

On one OSS, I was looking at the files in /proc/fs/lustre. I ended up
doing:
cat
/proc/fs/lustre/obdfilter/lbktst-OST002a/exports/1bb907ec-45de-b3c7-81f4
-5b4e5ab0f7fe/brw_stats
which hang. If I do a ps axu, the process is in a D state
(uninterruptible sleep).

At this point the OSS is running fine, nothing in /var/log/messages.

Then I try to mount the lustre fs on a new client, it hangs and I get
the following error on the OSS:

Jun  5 17:58:48 lustrebal802 kernel: Lustre:
0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 6178: it was
inactive for 100s
Jun  5 17:58:48 lustrebal802 kernel: Lustre:
0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for
process 6178
Jun  5 17:58:48 lustrebal802 kernel: ll_ost_io_116 D ffff81000500ffd0
0  6178      1          6179  6177 (L-TLB)
Jun  5 17:58:48 lustrebal802 kernel: ffff81006ebe9868 0000000000000046
00000000ffffffff 000000000000000a 
Jun  5 17:58:48 lustrebal802 kernel:        ffff81006ebd12c8
ffff81006ebd1080 ffff810114798080 00004dd01bad3479 
Jun  5 17:58:48 lustrebal802 kernel:        000000000000b9e6
000000003600cc43 
Jun  5 17:58:48 lustrebal802 kernel: Call Trace:
<ffffffff885eb478>{:libcfs:libcfs_debug_vmsg2+1608}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff802da779>{__down+232}
<ffffffff8012af07>{default_wake_function+0}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff886288f3>{:lnet:lnet_ni_send+147}
<ffffffff802da3ed>{__down_failed+53}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88681d1d>{:obdclass:.text.lock.lprocfs_status+115}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff889442cc>{:obdfilter:filter_connect_internal+380}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88942488>{:obdfilter:filter_export_stats_init+280}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88673d30>{:obdclass:class_conn2export+592}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88956d46>{:obdfilter:filter_connect+630}
<ffffffff8866f60a>{:obdclass:lustre_hash_get_o
bject_by_key+282}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88717195>{:ptlrpc:lustre_swab_buf+197}
<ffffffff886f0331>{:ptlrpc:target_handle_connect
+7057}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88915643>{:ost:ost_brw_read+7363}
<ffffffff8862b81e>{:lnet:lnet_match_blocked_msg+958}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88916695>{:ost:ost_msg_check_version+181}
<ffffffff8891a72f>{:ost:ost_handle+2431}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88714008>{:ptlrpc:lustre_unpack_msg_v1+280}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff8013adcb>{lock_timer_base+27}
<ffffffff8013af05>{__mod_timer+173}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff88673d30>{:obdclass:class_conn2export+592}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff887212d7>{:ptlrpc:ptlrpc_main+4903}
<ffffffff8012af07>{default_wake_function+0}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff8010bdce>{child_rip+8}
<ffffffff8871ffb0>{:ptlrpc:ptlrpc_main+0}
Jun  5 17:58:48 lustrebal802 kernel:
<ffffffff8010bdc6>{child_rip+0}
Jun  5 17:58:48 lustrebal802 kernel: LustreError: dumping log to
/tmp/lustre-log.1212703128.6178

I'm not attaching the lustre-log file which is 18MB.

All the clients on which the fs is already mounted are running fine, but
I can't add new clients and can't most of the files in /proc/fs/lustre
can't be read anymore.

I know one way of getting out of that state which is: unmount
everything, reboot the mds/oss and restart the whole thing. Not very
convenient.

Currently the cat program is still hanging.

Any idea why this happens? Should I not mess with the /proc files? I
checked the ml and the bugzilla and couldn't find anything.

Thanks for your help,

Julien


IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments.  Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited.  Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument.  Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.



More information about the lustre-discuss mailing list