[Lustre-discuss] OSS kind of hang prevents new clients from mounting

Wallior, Julien Julien.Wallior at sig.com
Fri Jun 6 07:14:06 PDT 2008


It fixed itself and I figured out why.
I was running a llobdstat /proc/fs/lustre/obdfilter/lbktst-OST002a 15 in
the background to get some statistics and insert them in ganglia. This
process restarts at midnight and when it restarted all the hanging
connections came up.

So my question becomes: how do you monitor your lustre? Is it safe to
use llobdstat? What am I doing wrong?

Thank you,
Julien

> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-
> bounces at lists.lustre.org] On Behalf Of Wallior, Julien
> Sent: Thursday, June 05, 2008 10:09 PM
> To: lustre-discuss at clusterfs.com
> Subject: [Lustre-discuss] OSS kind of hang prevents new clients from
> mounting
> 
> Hello everybody,
> I have a problem that happened a few times with my lustre 1.6.4.3
setup
> and I finally figured out how to explain it to you.
> 
> On one OSS, I was looking at the files in /proc/fs/lustre. I ended up
> doing:
> cat
>
/proc/fs/lustre/obdfilter/lbktst-OST002a/exports/1bb907ec-45de-b3c7-81f4
> -5b4e5ab0f7fe/brw_stats
> which hang. If I do a ps axu, the process is in a D state
> (uninterruptible sleep).
> 
> At this point the OSS is running fine, nothing in /var/log/messages.
> 
> Then I try to mount the lustre fs on a new client, it hangs and I get
> the following error on the OSS:
> 
> Jun  5 17:58:48 lustrebal802 kernel: Lustre:
> 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 6178: it was
> inactive for 100s
> Jun  5 17:58:48 lustrebal802 kernel: Lustre:
> 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for
> process 6178
> Jun  5 17:58:48 lustrebal802 kernel: ll_ost_io_116 D ffff81000500ffd0
> 0  6178      1          6179  6177 (L-TLB)
> Jun  5 17:58:48 lustrebal802 kernel: ffff81006ebe9868 0000000000000046
> 00000000ffffffff 000000000000000a
> Jun  5 17:58:48 lustrebal802 kernel:        ffff81006ebd12c8
> ffff81006ebd1080 ffff810114798080 00004dd01bad3479
> Jun  5 17:58:48 lustrebal802 kernel:        000000000000b9e6
> 000000003600cc43
> Jun  5 17:58:48 lustrebal802 kernel: Call Trace:
> <ffffffff885eb478>{:libcfs:libcfs_debug_vmsg2+1608}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff802da779>{__down+232}
> <ffffffff8012af07>{default_wake_function+0}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff886288f3>{:lnet:lnet_ni_send+147}
> <ffffffff802da3ed>{__down_failed+53}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88681d1d>{:obdclass:.text.lock.lprocfs_status+115}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff889442cc>{:obdfilter:filter_connect_internal+380}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88942488>{:obdfilter:filter_export_stats_init+280}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88673d30>{:obdclass:class_conn2export+592}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88956d46>{:obdfilter:filter_connect+630}
> <ffffffff8866f60a>{:obdclass:lustre_hash_get_o
> bject_by_key+282}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88717195>{:ptlrpc:lustre_swab_buf+197}
> <ffffffff886f0331>{:ptlrpc:target_handle_connect
> +7057}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88915643>{:ost:ost_brw_read+7363}
> <ffffffff8862b81e>{:lnet:lnet_match_blocked_msg+958}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88916695>{:ost:ost_msg_check_version+181}
> <ffffffff8891a72f>{:ost:ost_handle+2431}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88714008>{:ptlrpc:lustre_unpack_msg_v1+280}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff8013adcb>{lock_timer_base+27}
> <ffffffff8013af05>{__mod_timer+173}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff88673d30>{:obdclass:class_conn2export+592}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff887212d7>{:ptlrpc:ptlrpc_main+4903}
> <ffffffff8012af07>{default_wake_function+0}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff8010bdce>{child_rip+8}
> <ffffffff8871ffb0>{:ptlrpc:ptlrpc_main+0}
> Jun  5 17:58:48 lustrebal802 kernel:
> <ffffffff8010bdc6>{child_rip+0}
> Jun  5 17:58:48 lustrebal802 kernel: LustreError: dumping log to
> /tmp/lustre-log.1212703128.6178
> 
> I'm not attaching the lustre-log file which is 18MB.
> 
> All the clients on which the fs is already mounted are running fine,
but
> I can't add new clients and can't most of the files in /proc/fs/lustre
> can't be read anymore.
> 
> I know one way of getting out of that state which is: unmount
> everything, reboot the mds/oss and restart the whole thing. Not very
> convenient.
> 
> Currently the cat program is still hanging.
> 
> Any idea why this happens? Should I not mess with the /proc files? I
> checked the ml and the bugzilla and couldn't find anything.
> 
> Thanks for your help,
> 
> Julien
> 
> 
> IMPORTANT: The information contained in this email and/or its
attachments
> is confidential. If you are not the intended recipient, please notify
the
> sender immediately by reply and immediately delete this message and
all
> its attachments.  Any review, use, reproduction, disclosure or
> dissemination of this message or any attachment by an unintended
recipient
> is strictly prohibited.  Neither this message nor any attachment is
> intended as or should be construed as an offer, solicitation or
> recommendation to buy or sell any security or other financial
instrument.
> Neither the sender, his or her employer nor any of their respective
> affiliates makes any warranties as to the completeness or accuracy of
any
> of the information contained herein or that this message or any of its
> attachments is free of viruses.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments.  Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited.  Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument.  Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.



More information about the lustre-discuss mailing list