[Lustre-discuss] Lustre problem recovering from a hardware error

Jonathan Buch jonathan.buch at hs-karlsruhe.de
Thu Jun 17 08:43:40 PDT 2010


Hello everyone.

I hope someone on this list can advise me on what to do.

A few days back one of our SAN systems started to produce I/O read
errors.  This affected one of our Lustre partitions.

Our Setup:

Lustre server 1.8.3 with a RHEL5 kernel.  Clients are still 1.8.2.


client~# lfs df -h
UUID                     bytes      Used Available  Use% Mounted on
eu01-MDT0000_UUID        24.4G      3.3G     19.8G   13% /mnt/home.eu01[MDT:0]
eu01-OST0000_UUID         3.5T      3.1T    257.3G   87% /mnt/home.eu01[OST:0]
eu01-OST0001_UUID         3.7T      3.1T    397.6G   84% /mnt/home.eu01[OST:1]
eu01-OST0002_UUID         3.7T      3.3T    164.8G   90% /mnt/home.eu01[OST:2]
eu01-OST0003_UUID         3.7T      1.6T      1.9T   43% /mnt/home.eu01[OST:3]
eu01-OST0004_UUID       889.2G    474.6G    369.4G   53% /mnt/home.eu01[OST:4]
eu01-OST0005_UUID         6.3T      1.5T      4.4T   24% /mnt/home.eu01[OST:5]
eu01-OST0006_UUID         6.3T      1.6T      4.4T   25% /mnt/home.eu01[OST:6]
filesystem summary:      27.9T     14.7T     11.9T   52% /mnt/home.eu01

(the eu01-OST0003_UUID is the "broken" one)
server~# df -h
/dev/sdd1             3,8T  1,7T  2,0T  46% /mnt/lustre-eu-ost3




What I did next:

 * the SAN system did not show any broken harddrives, but did output
some information in its controller log:
    "Log Number","Concern Level","Date","Time","Device","Message",
    "4052","2","06/15/10","10:46:31","Configuration WWN:
20000050CC204369 Controller: 0","An unrecoverable drive error has
occurred as a result of a command being issued. This may be due to a
drive error in a non-fault tolerant array, such as RAID 0, or when
the array is already in a degraded mode. The controller will pass
the status from the drive back to the host system to allow the host
recovery mechanisms to be used.  Details: Host Loop = 0, Host Loop
ID = 2, Mapped Logical Drive Requested = 2, Op Code = 0x88, Sense
Data = 03/11/01."
   That Array is covered by a RAID5, so I would've preferred if it
had just disabled the bad harddrive.  I also can't remove the
harddrive and let it rebuild the array because I can't deduce the
exact harddrive from the logical block which had the I/O error.

 * next I tried a filesystem check on the eu01-OST0003_UUID:
   e2fsck 1.41.10.sun2 (24-Feb-2010)
   I used -c to also check for bad blocks.  The output contained
quite a few errors in the filesystem, some 2000 files were put into
lost+found.
   I reran the check just to be sure.

 * Now I remounted the partition (atually the server was so locked
that I had to force a reboot) and I now have more problems than
before.  Some kernel threads (ll_ost_40 and llog_process_th) are 
blocking CPUs and fills the syslog with kernel error messages about
them getting stuck.
   I attached relevant portions of my /var/log/messages file.

I also attached output of `lctl dk` in case that helps.
As a sidenote:  I can't unmount the OST, the unmount hangs
indefinitly and I also can't reboot the system cleanly, `reboot`
will also hang.

Right now I can access part of the filesystem, but accessing certain
files/directories will lock up the client.  We do have backups
(around 2 weeks old) on tape, but I would prefer not to have to
replay them, as that would take around 5 days.

I'll provide more information, if that is needed.
Please advise on how I can resolve the situation.

Thank you very much.

Jonathan Buch

--
B.Sc. Jonathan Buch
Karlsruhe University of Applied Sciences
Institute of Materials and Processes (IMP)
CMSE - Systemadministration
Moltkestrasse 30
D-76133 Karlsruhe
Germany


jonathan.buch(at)hs-karlsruhe.de
Phone:   +49 721 925 1415
Fax:     +49 721 925 2348

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lctldk_20100617_2.log.gz
Type: application/x-gunzip
Size: 120677 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100617/05b44497/attachment.bin>
-------------- next part --------------
jo at cmse-svr01:~$ sudo tail -f /var/log/messages | grep -v DHCP
Jun 17 16:20:00 cmse-svr01 kernel: Lustre: OBD class driver, http://www.lustre.org/
Jun 17 16:20:00 cmse-svr01 kernel: Lustre:     Lustre Version: 1.8.3
Jun 17 16:20:00 cmse-svr01 kernel: Lustre:     Build Version: 1.8.3-20100503161738-PRISTINE-2.6.18-164.11.1.el5lustre.1.8.2-0rc4
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Added LNI 192.168.21.174 at tcp1 [8/256/0/180]
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Added LNI 10.101.0.2 at tcp [8/256/0/180]
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Accept secure, port 988
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Lustre Client File System; http://www.lustre.org/
Jun 17 16:20:01 cmse-svr01 kernel: init dynlocks cache
Jun 17 16:20:01 cmse-svr01 kernel: ldiskfs created from ext3-2.6-rhel5
Jun 17 16:20:01 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:20:01 cmse-svr01 kernel: LDISKFS FS on sdc1, internal journal
Jun 17 16:20:01 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:20:01 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:20:01 cmse-svr01 kernel: LDISKFS FS on sdc1, internal journal
Jun 17 16:20:01 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: MGS MGS started
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: MGC192.168.21.174 at tcp1: Reactivating import
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: Enabling user_xattr
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: eu01-MDT0000: denying duplicate export for 324cd728-79b3-56ea-3a4a-b86fd985e0c7, -114
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: 24142:0:(mds_fs.c:673:mds_init_server_data()) RECOVERY: service eu01-MDT0000, 17 recoverable clients, 0 delayed clients, last_transno 21474848741
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: eu01-MDT0000: Now serving eu01-MDT0000 on /dev/sdc1 with recovery enabled
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: eu01-MDT0000: Will be in recovery for at least 5:00, or until 17 clients reconnect
Jun 17 16:20:01 cmse-svr01 kernel: Lustre: 24142:0:(mds_lov.c:1167:mds_notify()) MDS eu01-MDT0000: add target eu01-OST0000_UUID
Jun 17 16:20:02 cmse-svr01 kernel: Lustre: 24142:0:(mds_lov.c:1167:mds_notify()) MDS eu01-MDT0000: add target eu01-OST0001_UUID
Jun 17 16:20:02 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480062988 sent from eu01-OST0005-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (5s prior to deadline).
Jun 17 16:20:02 cmse-svr01 kernel:  req at ffff810249acac00 x1338805480062988/t0 o8->eu01-OST0005_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784407 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:20:02 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480062990 sent from eu01-OST0006-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (5s prior to deadline).
Jun 17 16:20:02 cmse-svr01 kernel:  req at ffff8101a65c9000 x1338805480062990/t0 o8->eu01-OST0006_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784407 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:20:07 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480062983 sent from eu01-OST0000-osc to NID 0 at lo 5s ago has timed out (5s prior to deadline).
Jun 17 16:20:07 cmse-svr01 kernel:  req at ffff81029c757400 x1338805480062983/t0 o8->eu01-OST0000_UUID at 192.168.21.174@tcp1:28/4 lens 368/584 e 0 to 1 dl 1276784407 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:20:08 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.126 at tcp1
Jun 17 16:20:10 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.133 at tcp1
Jun 17 16:20:15 cmse-svr01 kernel: Lustre: Failing over eu01-MDT0000
Jun 17 16:20:15 cmse-svr01 kernel: Lustre: *** setting obd eu01-MDT0000 device 'sdc1' read-only ***
Jun 17 16:20:15 cmse-svr01 kernel: Turning device sdc (0x800021) read-only
Jun 17 16:20:15 cmse-svr01 kernel: Lustre: Failing over eu01-OST0006-osc
Jun 17 16:20:15 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0000-osc: tried all connections, increasing latency to 1s
Jun 17 16:20:15 cmse-svr01 kernel: Lustre: MGS has stopped.
Jun 17 16:20:15 cmse-svr01 kernel: Lustre: eu01-MDT0000: shutting down for failover; client state will be preserved.
Jun 17 16:20:15 cmse-svr01 kernel: Lustre: MDT eu01-MDT0000 has stopped.
Jun 17 16:20:18 cmse-svr01 kernel: Removing read-only on unknown block (0x800021)
Jun 17 16:20:18 cmse-svr01 kernel: Lustre: server umount eu01-MDT0000 complete
Jun 17 16:20:46 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS FS on sdc1, internal journal
Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS-fs: recovery complete.
Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:20:46 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS FS on sdc1, internal journal
Jun 17 16:20:46 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:20:46 cmse-svr01 kernel: Lustre: MGS MGS started
Jun 17 16:20:46 cmse-svr01 kernel: Lustre: MGC192.168.21.174 at tcp1: Reactivating import
Jun 17 16:20:47 cmse-svr01 kernel: Lustre: Enabling user_xattr
Jun 17 16:20:47 cmse-svr01 kernel: Lustre: eu01-MDT0000: Now serving eu01-MDT0000 on /dev/sdc1 with recovery enabled
Jun 17 16:20:47 cmse-svr01 kernel: Lustre: 24240:0:(mds_lov.c:1167:mds_notify()) MDS eu01-MDT0000: add target eu01-OST0000_UUID
Jun 17 16:20:47 cmse-svr01 kernel: Lustre: 24240:0:(mds_lov.c:1167:mds_notify()) Skipped 5 previous similar messages
Jun 17 16:20:47 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063008 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (5s prior to deadline).
Jun 17 16:20:47 cmse-svr01 kernel:  req at ffff81015d47c000 x1338805480063008/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784452 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:20:47 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Jun 17 16:20:47 cmse-svr01 kernel: Lustre: eu01-MDT0000: Aborting recovery.
Jun 17 16:20:51 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.103 at tcp1
Jun 17 16:20:52 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063004 sent from eu01-OST0000-osc to NID 0 at lo 5s ago has timed out (5s prior to deadline).
Jun 17 16:20:52 cmse-svr01 kernel:  req at ffff810323104000 x1338805480063004/t0 o8->eu01-OST0000_UUID at 192.168.21.174@tcp1:28/4 lens 368/584 e 0 to 1 dl 1276784452 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:20:52 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Jun 17 16:20:53 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.118 at tcp1
Jun 17 16:20:53 cmse-svr01 kernel: Lustre: Skipped 1 previous similar message
Jun 17 16:20:59 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.107 at tcp1
Jun 17 16:20:59 cmse-svr01 kernel: Lustre: Skipped 1 previous similar message
Jun 17 16:21:03 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.105 at tcp1
Jun 17 16:21:03 cmse-svr01 kernel: Lustre: Skipped 5 previous similar messages
Jun 17 16:21:12 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.126 at tcp1
Jun 17 16:21:12 cmse-svr01 kernel: Lustre: Skipped 1 previous similar message
Jun 17 16:21:17 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0000-osc: tried all connections, increasing latency to 1s
Jun 17 16:21:17 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 3 previous similar messages
Jun 17 16:21:17 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063018 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (6s prior to deadline).
Jun 17 16:21:17 cmse-svr01 kernel:  req at ffff8102fd548400 x1338805480063018/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784483 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:21:17 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Jun 17 16:21:29 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.103 at tcp1
Jun 17 16:21:29 cmse-svr01 kernel: Lustre: Skipped 9 previous similar messages
Jun 17 16:21:32 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS FS on sdc4, internal journal
Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:21:32 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS FS on sdc4, internal journal
Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS-fs: file extents enabled
Jun 17 16:21:32 cmse-svr01 kernel: LDISKFS-fs: mballoc enabled
Jun 17 16:21:42 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0000-osc: tried all connections, increasing latency to 2s
Jun 17 16:21:42 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 6 previous similar messages
Jun 17 16:21:42 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063031 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (7s prior to deadline).
Jun 17 16:21:42 cmse-svr01 kernel:  req at ffff81030dcf1000 x1338805480063031/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784509 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:21:42 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 6 previous similar messages
Jun 17 16:21:46 cmse-svr01 kernel: Lustre: Filtering OBD driver; http://www.lustre.org/
Jun 17 16:21:46 cmse-svr01 kernel: Lustre: eu01-OST0000: Now serving eu01-OST0000 on /dev/sdc4 with recovery enabled
Jun 17 16:22:01 cmse-svr01 kernel: Lustre: eu01-MDT0000: temporarily refusing client connection from 192.168.21.119 at tcp1
Jun 17 16:22:01 cmse-svr01 kernel: Lustre: Skipped 47 previous similar messages
Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0000-osc: tried all connections, increasing latency to 3s
Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 6 previous similar messages
Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063039 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (8s prior to deadline).
Jun 17 16:22:07 cmse-svr01 kernel:  req at ffff81029e491400 x1338805480063039/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784535 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Jun 17 16:22:07 cmse-svr01 kernel: Lustre: 23941:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/7 OSTs are active, abort quota recovery
Jun 17 16:22:07 cmse-svr01 kernel: Lustre: eu01-OST0000: received MDS connection from 0 at lo
Jun 17 16:22:07 cmse-svr01 kernel: Lustre: MDS eu01-MDT0000: eu01-OST0000_UUID now active, resetting orphans
Jun 17 16:22:30 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS FS on sdb1, internal journal
Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:22:30 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS FS on sdb1, internal journal
Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS-fs: file extents enabled
Jun 17 16:22:30 cmse-svr01 kernel: LDISKFS-fs: mballoc enabled
Jun 17 16:22:30 cmse-svr01 kernel: Lustre: eu01-OST0001: Now serving eu01-OST0001 on /dev/sdb1 with recovery enabled
Jun 17 16:22:32 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0001-osc: tried all connections, increasing latency to 4s
Jun 17 16:22:32 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 6 previous similar messages
Jun 17 16:22:32 cmse-svr01 kernel: Lustre: 23941:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/7 OSTs are active, abort quota recovery
Jun 17 16:22:32 cmse-svr01 kernel: Lustre: eu01-OST0001: received MDS connection from 0 at lo
Jun 17 16:22:32 cmse-svr01 kernel: Lustre: MDS eu01-MDT0000: eu01-OST0001_UUID now active, resetting orphans
Jun 17 16:22:57 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0002-osc: tried all connections, increasing latency to 5s
Jun 17 16:22:57 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 5 previous similar messages
Jun 17 16:22:57 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063100 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (10s prior to deadline).
Jun 17 16:22:57 cmse-svr01 kernel:  req at ffff8102a0cc1800 x1338805480063100/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784587 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:22:57 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Jun 17 16:23:10 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:23:10 cmse-svr01 kernel: LDISKFS FS on sdb2, internal journal
Jun 17 16:23:10 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:23:11 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:23:11 cmse-svr01 kernel: LDISKFS FS on sdb2, internal journal
Jun 17 16:23:11 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:23:11 cmse-svr01 kernel: LDISKFS-fs: file extents enabled
Jun 17 16:23:11 cmse-svr01 kernel: LDISKFS-fs: mballoc enabled
Jun 17 16:23:22 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0002-osc: tried all connections, increasing latency to 6s
Jun 17 16:23:22 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 4 previous similar messages
Jun 17 16:23:27 cmse-svr01 kernel: Lustre: eu01-OST0002: Now serving eu01-OST0002 on /dev/sdb2 with recovery enabled
Jun 17 16:23:47 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0002-osc: tried all connections, increasing latency to 7s
Jun 17 16:23:47 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 4 previous similar messages
Jun 17 16:23:47 cmse-svr01 kernel: Lustre: 23941:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/7 OSTs are active, abort quota recovery
Jun 17 16:23:47 cmse-svr01 kernel: Lustre: eu01-OST0002: received MDS connection from 0 at lo
Jun 17 16:23:47 cmse-svr01 kernel: Lustre: MDS eu01-MDT0000: eu01-OST0002_UUID now active, resetting orphans
Jun 17 16:24:12 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063143 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (13s prior to deadline).
Jun 17 16:24:12 cmse-svr01 kernel:  req at ffff810153414000 x1338805480063143/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784665 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:24:12 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Jun 17 16:24:37 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0003-osc: tried all connections, increasing latency to 9s
Jun 17 16:24:37 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 8 previous similar messages
Jun 17 16:25:52 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) eu01-OST0003-osc: tried all connections, increasing latency to 12s
Jun 17 16:25:52 cmse-svr01 kernel: Lustre: 23942:0:(import.c:517:import_select_connection()) Skipped 11 previous similar messages
Jun 17 16:26:42 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1338805480063191 sent from eu01-OST0004-osc to NID 192.168.21.173 at tcp1 0s ago has failed due to network error (19s prior to deadline).
Jun 17 16:26:42 cmse-svr01 kernel:  req at ffff8102c094e400 x1338805480063191/t0 o8->eu01-OST0004_UUID at 10.101.0.1@tcp:28/4 lens 368/584 e 0 to 1 dl 1276784821 ref 1 fl Rpc:N/0/0 rc 0/0
Jun 17 16:26:42 cmse-svr01 kernel: Lustre: 23941:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 17 previous similar messages
Jun 17 16:26:44 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS FS on sdd1, internal journal
Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:26:44 cmse-svr01 kernel: kjournald starting.  Commit interval 5 seconds
Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS FS on sdd1, internal journal
Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS-fs: file extents enabled
Jun 17 16:26:44 cmse-svr01 kernel: LDISKFS-fs: mballoc enabled
Jun 17 16:26:45 cmse-svr01 kernel: Lustre: eu01-OST0003: Now serving eu01-OST0003 on /dev/sdd1 with recovery enabled
Jun 17 16:27:07 cmse-svr01 kernel: Lustre: 23941:0:(quota_master.c:1716:mds_quota_recovery()) Only 0/7 OSTs are active, abort quota recovery
Jun 17 16:27:07 cmse-svr01 kernel: Lustre: eu01-OST0003: received MDS connection from 0 at lo
Jun 17 16:27:07 cmse-svr01 kernel: Lustre: MDS eu01-MDT0000: eu01-OST0003_UUID now active, resetting orphans
Jun 17 16:27:17 cmse-svr01 kernel: CPU 0:
Jun 17 16:27:17 cmse-svr01 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ppdev(U) parport_pc(U) lp(U) parport(U) nfsd(U) exportfs(U) auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) fuse(U) loop(U) ixgbe(U) i2c_i801(U) pl2303(U) serio_raw(U) i2c_core(U) pcspkr(U) shpchp(U) usbserial(U) joydev(U) ext3(U) jbd(U) dm_mirror(U) dm_log(U) dm_snapshot(U) dm_mod(U) sg(U) sd_mod(U) st(U) ch(U) ide_cd(U) floppy(U) cdrom(U) 3w_9xxx(U) uhci_hcd(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) qla2xxx(U) scsi_transport_fc(U) ehci_hcd(U) scsi_mod(U) igb(U) 8021q(U)
Jun 17 16:27:17 cmse-svr01 kernel: Pid: 24646, comm: llog_process_th Tainted: G      2.6.18-164.11.1.el5lustre.1.8.2-0rc4 #3
Jun 17 16:27:17 cmse-svr01 kernel: RIP: 0010:[<ffffffff88950645>]  [<ffffffff88950645>] :ldiskfs:ldiskfs_find_entry+0x245/0x5b0
Jun 17 16:27:17 cmse-svr01 kernel: RSP: 0018:ffff810628abd970  EFLAGS: 00000202
Jun 17 16:27:17 cmse-svr01 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 000000004c1a30bb
Jun 17 16:27:17 cmse-svr01 kernel: RDX: ffff81028f388000 RSI: ffff810628abd8e0 RDI: ffff81034a868118
Jun 17 16:27:17 cmse-svr01 kernel: RBP: ffff81063edcb100 R08: ffff81063bbdbff8 R09: ffff81063bbdb000
Jun 17 16:27:17 cmse-svr01 kernel: R10: ffff810628abd950 R11: 00000000000000e8 R12: 0000000000000000
Jun 17 16:27:17 cmse-svr01 kernel: R13: 0000000000000002 R14: ffff81063b511f10 R15: ffffffff80063adb
Jun 17 16:27:17 cmse-svr01 kernel: FS:  00002ac84db0f6e0(0000) GS:ffffffff803c2000(0000) knlGS:0000000000000000
Jun 17 16:27:17 cmse-svr01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 17 16:27:17 cmse-svr01 kernel: CR2: 0000000002829808 CR3: 000000063e705000 CR4: 00000000000006e0
Jun 17 16:27:17 cmse-svr01 kernel:
Jun 17 16:27:17 cmse-svr01 kernel: Call Trace:
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff80128a60>] avc_has_perm+0x46/0x58
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88952403>] :ldiskfs:ldiskfs_lookup+0x53/0x281
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff80036eb5>] __lookup_hash+0x10b/0x12f
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff800e73ba>] lookup_one_len+0x54/0x62
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88a9a35d>] :obdfilter:filter_fid2dentry+0x42d/0x740
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8873d6a4>] :ptlrpc:at_measured+0x114/0x320
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8873d6a4>] :ptlrpc:at_measured+0x114/0x320
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8000d3a5>] dput+0x2c/0x113
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88aa1fb4>] :obdfilter:filter_destroy+0x154/0x1fb0
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88723175>] :ptlrpc:lustre_msg_set_transno+0x45/0x120
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff887221f5>] :ptlrpc:lustre_msg_get_transno+0x35/0xf0
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8870f21a>] :ptlrpc:after_reply+0x97a/0xd00
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8002e3d3>] __wake_up+0x38/0x4f
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff887153c4>] :ptlrpc:ptlrpc_queue_wait+0x1654/0x16f0
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88722cb5>] :ptlrpc:lustre_msg_set_opc+0x45/0x120
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8870be35>] :ptlrpc:ptlrpc_at_set_req_timeout+0x85/0xd0
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8005c362>] cache_alloc_refill+0x106/0x186
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88ab1363>] :obdfilter:filter_recov_log_mds_ost_cb+0x5b3/0xf10
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88735f66>] :ptlrpc:llog_client_next_block+0x5a6/0x650
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8005c362>] cache_alloc_refill+0x106/0x186
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff88643d22>] :obdclass:llog_process_thread+0x882/0xc30
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff886434a0>] :obdclass:llog_process_thread+0x0/0xc30
Jun 17 16:27:17 cmse-svr01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jun 17 16:27:17 cmse-svr01 kernel:
Jun 17 16:27:19 cmse-svr01 kernel: CPU 1:
Jun 17 16:27:19 cmse-svr01 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ppdev(U) parport_pc(U) lp(U) parport(U) nfsd(U) exportfs(U) auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) fuse(U) loop(U) ixgbe(U) i2c_i801(U) pl2303(U) serio_raw(U) i2c_core(U) pcspkr(U) shpchp(U) usbserial(U) joydev(U) ext3(U) jbd(U) dm_mirror(U) dm_log(U) dm_snapshot(U) dm_mod(U) sg(U) sd_mod(U) st(U) ch(U) ide_cd(U) floppy(U) cdrom(U) 3w_9xxx(U) uhci_hcd(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) qla2xxx(U) scsi_transport_fc(U) ehci_hcd(U) scsi_mod(U) igb(U) 8021q(U)
Jun 17 16:27:19 cmse-svr01 kernel: Pid: 24314, comm: ll_ost_40 Tainted: G      2.6.18-164.11.1.el5lustre.1.8.2-0rc4 #3
Jun 17 16:27:19 cmse-svr01 kernel: RIP: 0010:[<ffffffff889505d8>]  [<ffffffff889505d8>] :ldiskfs:ldiskfs_find_entry+0x1d8/0x5b0
Jun 17 16:27:19 cmse-svr01 kernel: RSP: 0018:ffff8102bc35d690  EFLAGS: 00000202
Jun 17 16:27:19 cmse-svr01 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 000000004c1a30bd
Jun 17 16:27:19 cmse-svr01 kernel: RDX: ffff81028f388000 RSI: ffff8102bc35d600 RDI: ffff81010b644610
Jun 17 16:27:19 cmse-svr01 kernel: RBP: ffff8102a3c0f080 R08: ffff8103148dfff8 R09: ffff8103148df000
Jun 17 16:27:19 cmse-svr01 kernel: R10: ffff8102bc35d670 R11: ffff81010b6e9ee8 R12: 0000000000000000
Jun 17 16:27:19 cmse-svr01 kernel: R13: 0000000000000002 R14: ffff8102e58855b0 R15: ffffffff80063adb
Jun 17 16:27:19 cmse-svr01 kernel: FS:  00002ac84db0f6e0(0000) GS:ffff81010b699440(0000) knlGS:0000000000000000
Jun 17 16:27:19 cmse-svr01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 17 16:27:19 cmse-svr01 kernel: CR2: 00002b82d0a84ad8 CR3: 000000063e705000 CR4: 00000000000006e0
Jun 17 16:27:19 cmse-svr01 kernel:
Jun 17 16:27:19 cmse-svr01 kernel: Call Trace:
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff80128a60>] avc_has_perm+0x46/0x58
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88952403>] :ldiskfs:ldiskfs_lookup+0x53/0x281
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff80036eb5>] __lookup_hash+0x10b/0x12f
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff800e73ba>] lookup_one_len+0x54/0x62
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88a9a35d>] :obdfilter:filter_fid2dentry+0x42d/0x740
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8027e2f0>] __down_trylock+0x44/0x4e
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88ab454b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88607ca7>] :lnet:lnet_prep_send+0x67/0xb0
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886e25be>] :ptlrpc:ldlm_resource_get+0x90e/0xa60
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88a77530>] :ost:ost_blocking_ast+0x0/0x610
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88701150>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886d8efa>] :ptlrpc:ldlm_lock_create+0xba/0xa00
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8871de21>] :ptlrpc:lustre_swab_buf+0x81/0x170
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886fbb90>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886fbb90>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88701150>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88a77530>] :ost:ost_blocking_ast+0x0/0x610
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff886fe27f>] :ptlrpc:ldlm_handle_enqueue+0x66f/0x1210
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8871ca38>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88a7edb7>] :ost:ost_handle+0x4e17/0x53e0
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff88729c8d>] :ptlrpc:ptlrpc_server_handle_request+0xaad/0x1150
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8008c23d>] __activate_task+0x56/0x6d
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff80047205>] try_to_wake_up+0x473/0x485
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8003dbd8>] lock_timer_base+0x1b/0x3c
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8008ac3a>] __wake_up_common+0x3e/0x68
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8872d708>] :ptlrpc:ptlrpc_main+0x1258/0x1420
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8008c837>] default_wake_function+0x0/0xe
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8872c4b0>] :ptlrpc:ptlrpc_main+0x0/0x1420
Jun 17 16:27:19 cmse-svr01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jun 17 16:27:19 cmse-svr01 kernel:
Jun 17 16:27:29 cmse-svr01 kernel: CPU 1:
Jun 17 16:27:29 cmse-svr01 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ppdev(U) parport_pc(U) lp(U) parport(U) nfsd(U) exportfs(U) auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) fuse(U) loop(U) ixgbe(U) i2c_i801(U) pl2303(U) serio_raw(U) i2c_core(U) pcspkr(U) shpchp(U) usbserial(U) joydev(U) ext3(U) jbd(U) dm_mirror(U) dm_log(U) dm_snapshot(U) dm_mod(U) sg(U) sd_mod(U) st(U) ch(U) ide_cd(U) floppy(U) cdrom(U) 3w_9xxx(U) uhci_hcd(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) qla2xxx(U) scsi_transport_fc(U) ehci_hcd(U) scsi_mod(U) igb(U) 8021q(U)
Jun 17 16:27:29 cmse-svr01 kernel: Pid: 24314, comm: ll_ost_40 Tainted: G      2.6.18-164.11.1.el5lustre.1.8.2-0rc4 #3
Jun 17 16:27:29 cmse-svr01 kernel: RIP: 0010:[<ffffffff8895064f>]  [<ffffffff8895064f>] :ldiskfs:ldiskfs_find_entry+0x24f/0x5b0
Jun 17 16:27:29 cmse-svr01 kernel: RSP: 0018:ffff8102bc35d690  EFLAGS: 00000202
Jun 17 16:27:29 cmse-svr01 kernel: RAX: 0000000000000000 RBX: 0000000000000007 RCX: 000000004c1a30bd
Jun 17 16:27:29 cmse-svr01 kernel: RDX: ffff81028f388000 RSI: ffff8102bc35d600 RDI: ffff81010b644610
Jun 17 16:27:29 cmse-svr01 kernel: RBP: ffff8102a3c0f080 R08: ffff8103148dfff8 R09: ffff8103148df000
Jun 17 16:27:29 cmse-svr01 kernel: R10: ffff8102bc35d670 R11: ffff81010b6e9ee8 R12: 0000000000000000
Jun 17 16:27:29 cmse-svr01 kernel: R13: 0000000000000002 R14: ffff8102e58855b0 R15: ffffffff80063adb
Jun 17 16:27:29 cmse-svr01 kernel: FS:  00002ac84db0f6e0(0000) GS:ffff81010b699440(0000) knlGS:0000000000000000
Jun 17 16:27:29 cmse-svr01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 17 16:27:29 cmse-svr01 kernel: CR2: 00002b82d0a84ad8 CR3: 000000063e705000 CR4: 00000000000006e0
Jun 17 16:27:29 cmse-svr01 kernel:
Jun 17 16:27:29 cmse-svr01 kernel: Call Trace:
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff80128a60>] avc_has_perm+0x46/0x58
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88952403>] :ldiskfs:ldiskfs_lookup+0x53/0x281
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff80036eb5>] __lookup_hash+0x10b/0x12f
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff800e73ba>] lookup_one_len+0x54/0x62
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88a9a35d>] :obdfilter:filter_fid2dentry+0x42d/0x740
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8027e2f0>] __down_trylock+0x44/0x4e
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88ab454b>] :obdfilter:filter_lvbo_init+0x3bb/0x68b
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88607ca7>] :lnet:lnet_prep_send+0x67/0xb0
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886e25be>] :ptlrpc:ldlm_resource_get+0x90e/0xa60
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88a77530>] :ost:ost_blocking_ast+0x0/0x610
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88701150>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886d8efa>] :ptlrpc:ldlm_lock_create+0xba/0xa00
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8871de21>] :ptlrpc:lustre_swab_buf+0x81/0x170
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886fbb90>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886fbb90>] :ptlrpc:ldlm_server_glimpse_ast+0x0/0x3b0
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88701150>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5d0
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88a77530>] :ost:ost_blocking_ast+0x0/0x610
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff886fe27f>] :ptlrpc:ldlm_handle_enqueue+0x66f/0x1210
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8871ca38>] :ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88a7edb7>] :ost:ost_handle+0x4e17/0x53e0
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff88729c8d>] :ptlrpc:ptlrpc_server_handle_request+0xaad/0x1150
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8008c23d>] __activate_task+0x56/0x6d
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff80047205>] try_to_wake_up+0x473/0x485
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8003dbd8>] lock_timer_base+0x1b/0x3c
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8008ac3a>] __wake_up_common+0x3e/0x68
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8872d708>] :ptlrpc:ptlrpc_main+0x1258/0x1420
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8008c837>] default_wake_function+0x0/0xe
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8872c4b0>] :ptlrpc:ptlrpc_main+0x0/0x1420
Jun 17 16:27:29 cmse-svr01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jun 17 16:27:29 cmse-svr01 kernel:

(many more of the hung-process stacktraces, about one each two seconds)

Jun 17 16:43:59 cmse-svr01 kernel: Lustre: 24299:0:(ldlm_lib.c:575:target_handle_reconnect()) eu01-OST0003: eu01-mdtlov_UUID reconnecting
Jun 17 16:43:59 cmse-svr01 kernel: Lustre: 24299:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 4 previous similar messages
Jun 17 16:43:59 cmse-svr01 kernel: Lustre: 24299:0:(ldlm_lib.c:875:target_handle_connect()) eu01-OST0003: refuse reconnection from eu01-mdtlov_UUID at 0@lo to 0xffff81034a921800; still busy with 1 active RPCs
Jun 17 16:43:59 cmse-svr01 kernel: Lustre: 24299:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 4 previous similar messages


Jun 17 17:02:41 cmse-svr01 kernel: Lustre: 24437:0:(niobuf.c:202:ptlrpc_abort_bulk()) Unexpectedly long timeout: desc ffff810324ad9480



More information about the lustre-discuss mailing list