[Lustre-discuss] FW: Soft CPU Lockup

Hendelman, Rob Rob.Hendelman at magnetar.com
Tue Oct 6 06:18:17 PDT 2009


Correction below:  The stack dumps once a minute or so started at 12:40.
I rebooted client1 at 13:13.

Sorry for the confusion.

Rob

-----Original Message-----
From: Hendelman, Rob 
Sent: Tuesday, October 06, 2009 8:15 AM
To: 'lustre-discuss at lists.lustre.org'
Subject: Soft CPU Lockup

Hello Mr. Drokin,

Thank you for your prior response.

There was a client eviction just prior to the threads hanging and eating
100%, but NOT prior to the OSS finally dropping cpu usage again.
 
Here is a basic timeline(in hours:min "military" time)

09:07am->12:39:  Client "6" which was cloned from "client1" is being
worked on, rebooted, and connected/disconnected from the lustre servers.
No issues
12:39: final OSS message that says "haven't heard from <ip of client66>
in 240 seconds, I think it's dead and I'm evicting it.
12:40: what appear to be stack dumps on the OSS server for 2 i/o threads
(previously mentioned) 
12:44: client1 has lost it's lustre mounts and is complaining in nagios.
All other clients are fine.
13:13:  "stack dumps" once a minute or so, but no LBUG.  I leave the
server up and finally reboot client1.  The other clients2-5 are not
affected.  All other clients seem to be working normally so I don't
touch the OSS.
14:10: Final messages on OSS before OSS calms down (no messages after
this)

Oct  5 14:10:56 maglustre04 kernel: 
Oct  5 14:10:59 maglustre04 kernel: Lustre:
13366:0:(service.c:1317:ptlrpc_server_handle_request()) @@@ Request
x6413848 took longer than estimated (100+5495s); client may timeout.
  req at ffff81009308c400 x6413848/t0
o101->1b9e4991-1d5e-814d-2607-8c52f432e68d@:0/0 lens 232/288 e 0 to 0 dl
1254764364 ref 1 fl Complete:/0/0 rc 301/301
Oct  5 14:10:59 maglustre04 kernel: Lustre:
13421:0:(watchdog.c:330:lcw_update_time()) Expired watchdog for pid
13421 disabled after 5595.8041s
Oct  5 14:10:59 maglustre04 kernel: Lustre:
13366:0:(service.c:1317:ptlrpc_server_handle_request()) Skipped 1
previous similar message
Oct  5 14:10:59 maglustre04 kernel: Lustre:
13366:0:(watchdog.c:330:lcw_update_time()) Expired watchdog for pid
13366 disabled after 5595.8059s

Should I file a new bug?  Is there enough info in /var/log/messages to
file a bug or do I need to turn on some sort of more verbose debugging
incase this happens again?

Thanks,

Robert

The information contained in this message and its attachments 
is intended only for the private and confidential use of the 
intended recipient(s).  If you are not the intended recipient 
(or have received this e-mail in error) please notify the 
sender immediately and destroy this e-mail. Any unauthorized 
copying, disclosure or distribution of the material in this e-
mail is strictly prohibited.



More information about the lustre-discuss mailing list