[Lustre-discuss] Fw: lustre 1.8.3 make OSS node crashed

xiaojunhua xiaojunhua at ict.ac.cn
Sun Jul 10 17:57:51 PDT 2011


recently, our lab's lustre file system frequently failed.
and make one OSS node crashed.

when reboot, the file system can recovered.
I judge some data in the file system was corruption. and want to do lfsck to fix it.
but why lustre make one OSS node crashed? does the lustre 1.8.3 have some bug in it?
occasionally, when we unable the panic_on_oops, one OST would unavailable to that OSS.
and we cannot unmount and re mount the OST. then we need reboot that OSS.
we have one MDS and three OSS.

does some one else come up against that problem.

filesystem Lustre :1.8.3
Linux :Rehat EL 5.4 x86-64
Server: HP DL385 G7.
Storage: HP EVA 4400.

/var/log/message:

Jun 22 16:47:16 node97 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended
Jun 22 16:47:16 node97 kernel: LDISKFS FS on dm-0, internal journal
Jun 22 16:47:16 node97 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 22 16:47:16 node97 kernel: LDISKFS-fs: file extents enabled
Jun 22 16:47:16 node97 kernel: LDISKFS-fs: mballoc enabled
Jun 22 16:47:16 node97 kernel: Lustre: MGC192.168.111.100 at tcp: Reactivating import
Jun 22 16:47:26 node97 kernel: LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available  for connect (no target)
Jun 22 16:47:26 node97 kernel: LustreError: 7255:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-19)  req at ffff81012b888800 x1371312002115897/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1308732546 ref 1 fl Interpret:/0/0 rc -19/0
Jun 22 16:47:31 node97 kernel: LustreError: 137-5: UUID 'lustre-OST0003_UUID' is not available  for connect (no target)
Jun 22 16:47:31 node97 kernel: LustreError: 7256:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-19)  req at ffff810113247800 x1371312002115898/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1308732551 ref 1 fl Interpret:/0/0 rc -19/0
Jun 22 16:47:34 node97 kernel: LustreError: 137-5: UUID 'lustre-OST0000_UUID' is not available  for connect (no target)
Jun 22 16:47:34 node97 kernel: LustreError: 7257:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-19)  req at ffff81012503f000 x1371314868890750/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1308732554 ref 1 fl Interpret:/0/0 rc -19/0
Jun 22 16:47:36 node97 kernel: Lustre: Filtering OBD driver; http://www.lustre.org/
Jun 22 16:47:36 node97 kernel: Lustre: 7538:0:(filter.c:990:filter_init_server_data()) RECOVERY: service lustre-OST0000, 37 recoverable clients, 0 delayed clients, last_rcvd 257712296235
Jun 22 16:47:36 node97 kernel: Lustre: lustre-OST0000: Now serving lustre-OST0000 on /dev/mapper/mpath0 with recovery enabled
Jun 22 16:47:36 node97 kernel: Lustre: lustre-OST0000: Will be in recovery for at least 5:00, or until 37 clients reconnect
Jun 22 16:47:36 node97 kernel: LustreError: 7538:0:(obd_config.c:1011:class_process_proc_param()) lustre-OST0000: unknown param mdt.quota_type=ug2
Jun 22 16:47:36 node97 kernel: Lustre: lustre-OST0000.ost: set parameter quota_type=ug2
Jun 22 16:47:39 node97 kernel: Lustre: 7348:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre-OST0000: 36 recoverable clients remain
Jun 22 16:47:39 node97 kernel: kjournald starting.  Commit interval 5 seconds
Jun 22 16:47:39 node97 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended
Jun 22 16:47:39 node97 kernel: LDISKFS FS on dm-1, internal journal
Jun 22 16:47:39 node97 kernel: LDISKFS-fs: recovery complete.
Jun 22 16:47:39 node97 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 22 16:47:39 node97 multipathd: dm-1: umount map (uevent) 
Jun 22 16:47:39 node97 kernel: kjournald starting.  Commit interval 5 seconds
Jun 22 16:47:39 node97 kernel: LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended
Jun 22 16:47:39 node97 kernel: LDISKFS FS on dm-1, internal journal
Jun 22 16:47:39 node97 kernel: LDISKFS-fs: mounted filesystem with ordered data mode.
Jun 22 16:47:39 node97 kernel: LDISKFS-fs: file extents enabled
Jun 22 16:47:39 node97 kernel: LustreError: 137-5: UUID 'lustre-OST0003_UUID' is not available  for connect (no target)
Jun 22 16:47:39 node97 kernel: LustreError: 7855:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-19)  req at ffff81012b449800 x1371314791390292/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1308732559 ref 1 fl Interpret:/0/0 rc -19/0
Jun 22 16:47:39 node97 kernel: LDISKFS-fs: mballoc enabled
Jun 22 16:47:39 node97 kernel: Lustre: 7966:0:(filter.c:990:filter_init_server_data()) RECOVERY: service lustre-OST0003, 37 recoverable clients, 0 delayed clients, last_rcvd 253410481848
Jun 22 16:47:39 node97 kernel: Lustre: lustre-OST0003: Now serving lustre-OST0003 on /dev/mapper/mpath1 with recovery enabled
Jun 22 16:47:39 node97 kernel: Lustre: lustre-OST0003: Will be in recovery for at least 5:00, or until 37 clients reconnect
Jun 22 16:47:39 node97 kernel: Lustre: lustre-OST0003.ost: set parameter quota_type=ug2
Jun 22 16:47:39 node97 kernel: LustreError: 7966:0:(obd_config.c:1011:class_process_proc_param()) lustre-OST0003: unknown param mdt.quota_type=ug2
Jun 22 16:47:39 node97 pcscd: pcscdaemon.c:507:main() pcsc-lite 1.4.4 daemon ready.
Jun 22 16:47:40 node97 kernel: Lustre: 7684:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre-OST0000: 35 recoverable clients remain
Jun 22 16:47:40 node97 pcscd: hotplug_libusb.c:402:HPEstablishUSBNotifications() Driver ifd-egate.bundle does not support IFD_GENERATE_HOTPLUG. Using active polling instead.
Jun 22 16:47:40 node97 pcscd: hotplug_libusb.c:411:HPEstablishUSBNotifications() Polling forced every 1 second(s)
Jun 22 16:47:41 node97 kernel: Lustre: 7265:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre-OST0000: 34 recoverable clients remain
Jun 22 16:47:42 node97 kernel: Lustre: 7255:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre-OST0003: 36 recoverable clients remain
Jun 22 16:47:42 node97 kernel: Lustre: 7255:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 1 previous similar message
Jun 22 16:47:44 node97 kernel: Lustre: 7355:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre-OST0000: 31 recoverable clients remain
Jun 22 16:47:44 node97 kernel: Lustre: 7355:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 3 previous similar messages
Jun 22 16:47:44 node97 kernel: Bluetooth: Core ver 2.10
Jun 22 16:47:44 node97 kernel: NET: Registered protocol family 31
Jun 22 16:47:44 node97 kernel: Bluetooth: HCI device and connection manager initialized
Jun 22 16:47:44 node97 kernel: Bluetooth: HCI socket layer initialized
Jun 22 16:47:44 node97 kernel: Bluetooth: L2CAP ver 2.8
Jun 22 16:47:44 node97 kernel: Bluetooth: L2CAP socket layer initialized
Jun 22 16:47:44 node97 kernel: Bluetooth: HIDP (Human Interface Emulation) ver 1.1
Jun 22 16:47:44 node97 hidd[8099]: Bluetooth HID daemon
Jun 22 16:47:46 node97 automount[8136]: lookup_read_master: lookup(nisplus): couldn't locate nis+ table auto.master
Jun 22 16:47:46 node97 gpm[8188]: *** info [startup.c(95)]: 
Jun 22 16:47:46 node97 gpm[8188]: Started gpm successfully. Entered daemon mode.
Jun 22 16:47:46 node97 xinetd[8179]: xinetd Version 2.3.14 started with libwrap loadavg labeled-networking options compiled in.
Jun 22 16:47:46 node97 xinetd[8179]: Started working: 0 available services
Jun 22 16:47:47 node97 avahi-daemon[8269]: Found user 'avahi' (UID 70) and group 'avahi' (GID 70).
Jun 22 16:47:47 node97 avahi-daemon[8269]: Successfully dropped root privileges.
Jun 22 16:47:47 node97 avahi-daemon[8269]: avahi-daemon 0.6.16 starting up.
Jun 22 16:47:47 node97 avahi-daemon[8269]: WARNING: No NSS support for mDNS detected, consider installing nss-mdns!
Jun 22 16:47:47 node97 avahi-daemon[8269]: Successfully called chroot().
Jun 22 16:47:47 node97 avahi-daemon[8269]: Successfully dropped remaining capabilities.
Jun 22 16:47:47 node97 avahi-daemon[8269]: Loading service file /services/sftp-ssh.service.
Jun 22 16:47:47 node97 avahi-daemon[8269]: New relevant interface eth1.IPv6 for mDNS.
Jun 22 16:47:47 node97 avahi-daemon[8269]: Joining mDNS multicast group on interface eth1.IPv6 with address fe80::225:b3ff:fe21:f364.
Jun 22 16:47:47 node97 avahi-daemon[8269]: New relevant interface eth1.IPv4 for mDNS.
Jun 22 16:47:47 node97 avahi-daemon[8269]: Joining mDNS multicast group on interface eth1.IPv4 with address 192.168.111.97.
Jun 22 16:47:47 node97 avahi-daemon[8269]: Network interface enumeration completed.
Jun 22 16:47:47 node97 avahi-daemon[8269]: Registering new address record for fe80::225:b3ff:fe21:f364 on eth1.
Jun 22 16:47:47 node97 avahi-daemon[8269]: Registering new address record for 192.168.111.97 on eth1.
Jun 22 16:47:47 node97 avahi-daemon[8269]: Registering HINFO record with values 'X86_64'/'LINUX'.
Jun 22 16:47:48 node97 avahi-daemon[8269]: Server startup complete. Host name is node97.local. Local service cookie is 1151745486.
Jun 22 16:47:49 node97 avahi-daemon[8269]: Service "SFTP File Transfer on node97" (/services/sftp-ssh.service) successfully established.
Jun 22 16:47:57 node97 kernel: Lustre: 7706:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre-OST0000: 28 recoverable clients remain
Jun 22 16:47:57 node97 kernel: Lustre: 7706:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 3 previous similar messages
Jun 22 16:47:59 node97 kernel: Adding 8191992k swap on /swap.add.  Priority:-2 extents:2382 across:8856828k
Jun 22 16:47:59 node97 avahi-daemon[8269]: Registering new address record for 11.11.11.97 on eth1.
Jun 22 16:47:59 node97 avahi-daemon[8269]: Withdrawing address record for 11.11.11.97 on eth1.
Jun 22 16:47:59 node97 avahi-daemon[8269]: Registering new address record for 11.11.11.97 on eth1.
Jun 22 16:48:01 node97 pcscd: winscard.c:304:SCardConnect() Reader E-Gate 0 0 Not Found
Jun 22 16:48:01 node97 last message repeated 3 times
Jun 22 16:48:05 node97 kernel: Lustre: 7707:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre-OST0000: 26 recoverable clients remain
Jun 22 16:48:05 node97 kernel: Lustre: 7707:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 7 previous similar messages
Jun 22 16:48:15 node97 kernel: Lustre: 7838:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0003: 993d4d04-1a19-dd6b-0220-163a6869cb64 reconnecting
Jun 22 16:48:15 node97 kernel: Lustre: 7838:0:(ldlm_lib.c:875:target_handle_connect()) lustre-OST0003: refuse reconnection from 993d4d04-1a19-dd6b-0220-163a6869cb64 at 11.11.11.15@tcp to 0xffff81012ca74000; still busy with 264 active RPCs
Jun 22 16:48:15 node97 kernel: LustreError: 7838:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16)  req at ffff810115eb8800 x1371314807096328/t0 o8->993d4d04-1a19-dd6b-0220-163a6869cb64 at NET_0x200000b0b0b0f_UUID:0/0 lens 368/264 e 0 to 0 dl 1308732592 ref 1 fl Interpret:/0/0 rc -16/0
Jun 22 16:48:15 node97 kernel: Lustre: 7606:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0003: 993d4d04-1a19-dd6b-0220-163a6869cb64 reconnecting
Jun 22 16:48:15 node97 kernel: Lustre: 7606:0:(ldlm_lib.c:875:target_handle_connect()) lustre-OST0003: refuse reconnection from 993d4d04-1a19-dd6b-0220-163a6869cb64 at 11.11.11.15@tcp to 0xffff81012ca74000; still busy with 4 active RPCs
Jun 22 16:48:22 node97 kernel: Lustre: 7914:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) lustre-OST0000: 15 recoverable clients remain
Jun 22 16:48:22 node97 kernel: Lustre: 7914:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 23 previous similar messages
Jun 22 16:48:23 node97 kernel: Lustre: 7260:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0003: 993d4d04-1a19-dd6b-0220-163a6869cb64 reconnecting
Jun 22 16:48:30 node97 kernel: Lustre: 7273:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0000: 993d4d04-1a19-dd6b-0220-163a6869cb64 reconnecting
Jun 22 16:48:30 node97 kernel: Lustre: 7273:0:(ldlm_lib.c:875:target_handle_connect()) lustre-OST0000: refuse reconnection from 993d4d04-1a19-dd6b-0220-163a6869cb64 at 11.11.11.15@tcp to 0xffff81013388aa00; still busy with 510 active RPCs
Jun 22 16:48:30 node97 kernel: LustreError: 7273:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16)  req at ffff810219738400 x1371314807119112/t0 o8->993d4d04-1a19-dd6b-0220-163a6869cb64 at NET_0x200000b0b0b0f_UUID:0/0 lens 368/264 e 0 to 0 dl 1308732608 ref 1 fl Interpret:/0/0 rc -16/0
Jun 22 16:48:30 node97 kernel: LustreError: 7273:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 1 previous similar message
Jun 22 16:48:37 node97 kernel: Lustre: 7874:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0000: 1e209d42-9195-820e-f318-877a9e7f08dd reconnecting
Jun 22 16:48:37 node97 kernel: Lustre: 7874:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 1 previous similar message
Jun 22 16:48:37 node97 kernel: Lustre: 7808:0:(ldlm_lib.c:875:target_handle_connect()) lustre-OST0000: refuse reconnection from 993d4d04-1a19-dd6b-0220-163a6869cb64 at 11.11.11.15@tcp to 0xffff81013388aa00; still busy with 6 active RPCs
Jun 22 16:48:43 node97 kernel: Lustre: 7791:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0000: 993d4d04-1a19-dd6b-0220-163a6869cb64 reconnecting
Jun 22 16:48:43 node97 kernel: Lustre: 7791:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 2 previous similar messages
Jun 22 16:48:55 node97 kernel: Lustre: lustre-OST0000: Recovery period over after 1:17, of 37 clients 37 recovered and 0 were evicted.
Jun 22 16:48:55 node97 kernel: Lustre: lustre-OST0000: sending delayed replies to recovered clients
Jun 22 16:48:55 node97 kernel: Lustre: lustre-OST0000: received MDS connection from 192.168.111.100 at tcp
Jun 22 16:48:55 node97 kernel: Lustre: lustre-OST0003: Recovery period over after 1:15, of 37 clients 37 recovered and 0 were evicted.
Jun 22 16:48:55 node97 kernel: Lustre: lustre-OST0003: sending delayed replies to recovered clients
Jun 22 16:48:55 node97 kernel: Lustre: lustre-OST0003: received MDS connection from 192.168.111.100 at tcp
Jun 22 16:49:15 node97 kernel: Lustre: 7688:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0000: 632d5b96-c7de-a5c1-1d90-214ae7a2f32d reconnecting
Jun 22 16:49:15 node97 kernel: Lustre: 7688:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 1 previous similar message
Jun 22 16:49:15 node97 kernel: Lustre: 7873:0:(ldlm_lib.c:875:target_handle_connect()) lustre-OST0003: refuse reconnection from 632d5b96-c7de-a5c1-1d90-214ae7a2f32d at 11.11.11.43@tcp to 0xffff810110f65200; still busy with 1 active RPCs
Jun 22 16:49:15 node97 kernel: Lustre: 7328:0:(ldlm_lib.c:804:target_handle_connect()) lustre-OST0003: exp ffff810110f65200 already connecting
Jun 22 16:49:15 node97 kernel: LustreError: 7873:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16)  req at ffff8101ce611c00 x1371463404703431/t0 o8->632d5b96-c7de-a5c1-1d90-214ae7a2f32d at NET_0x200000b0b0b2b_UUID:0/0 lens 368/264 e 0 to 0 dl 1308732655 ref 1 fl Interpret:/0/0 rc -16/0
Jun 22 16:49:15 node97 kernel: LustreError: 7873:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 2 previous similar messages
Jun 22 16:49:15 node97 kernel: LustreError: 8552:0:(ost_handler.c:829:ost_brw_read()) @@@ Reconnect on bulk PUT  req at ffff81010af3e800 x1371463404703416/t0 o3->632d5b96-c7de-a5c1-1d90-214ae7a2f32d at NET_0x200000b0b0b2b_UUID:0/0 lens 448/400 e 1 to 0 dl 1308732567 ref 1 fl Interpret:/0/0 rc 0/0
Jun 22 16:49:15 node97 kernel: Lustre: 8552:0:(ost_handler.c:886:ost_brw_read()) lustre-OST0000: ignoring bulk IO comm error with 632d5b96-c7de-a5c1-1d90-214ae7a2f32d at NET_0x200000b0b0b2b_UUID id 12345-11.11.11.43 at tcp - client will retry
Jun 22 16:49:20 node97 kernel: Lustre: 7927:0:(ldlm_lib.c:875:target_handle_connect()) lustre-OST0000: refuse reconnection from 632d5b96-c7de-a5c1-1d90-214ae7a2f32d at 11.11.11.43@tcp to 0xffff81022ff20600; still busy with 4 active RPCs
Jun 22 16:49:20 node97 kernel: Lustre: 7927:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1 previous similar message
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow journal start 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow journal start 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow brw_start 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow parent lock 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow preprw_read setup 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow parent lock 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: Skipped 2 previous similar messages
Jun 22 16:49:42 node97 kernel: LustreError: 8525:0:(ost_handler.c:844:ost_brw_read()) @@@ bulk PUT failed: rc -107  req at ffff8101cbc1ec00 x1371463404703377/t0 o3->632d5b96-c7de-a5c1-1d90-214ae7a2f32d at NET_0x200000b0b0b2b_UUID:0/0 lens 448/400 e 2 to 0 dl 1308732619 ref 1 fl Interpret:/0/0 rc 0/0
Jun 22 16:49:42 node97 kernel: LustreError: 8525:0:(ost_handler.c:844:ost_brw_read()) Skipped 1 previous similar message
Jun 22 16:49:42 node97 kernel: Lustre: lustre-OST0000: slow preprw_read setup 46s due to heavy IO load
Jun 22 16:49:42 node97 kernel: Lustre: Skipped 1 previous similar message
Jun 22 16:49:43 node97 kernel: Lustre: 7631:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0000: 632d5b96-c7de-a5c1-1d90-214ae7a2f32d reconnecting
Jun 22 16:49:43 node97 kernel: Lustre: 7631:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 2 previous similar messages
Jun 22 16:49:43 node97 kernel: Lustre: 7631:0:(ldlm_lib.c:875:target_handle_connect()) lustre-OST0000: refuse reconnection from 632d5b96-c7de-a5c1-1d90-214ae7a2f32d at 11.11.11.43@tcp to 0xffff81022ff20600; still busy with 4 active RPCs
Jun 22 16:49:43 node97 kernel: LustreError: 7631:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16)  req at ffff8102288d3400 x1371463404703504/t0 o8->632d5b96-c7de-a5c1-1d90-214ae7a2f32d at NET_0x200000b0b0b2b_UUID:0/0 lens 368/264 e 0 to 0 dl 1308732683 ref 1 fl Interpret:/0/0 rc -16/0
Jun 22 16:49:43 node97 kernel: LustreError: 7631:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 1 previous similar message
Jun 22 16:49:44 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 49s due to heavy IO load
Jun 22 16:49:44 node97 kernel: Lustre: Skipped 32 previous similar messages
Jun 22 16:49:44 node97 kernel: Lustre: 8513:0:(ost_handler.c:886:ost_brw_read()) lustre-OST0000: ignoring bulk IO comm error with 632d5b96-c7de-a5c1-1d90-214ae7a2f32d at NET_0x200000b0b0b2b_UUID id 12345-11.11.11.43 at tcp - client will retry
Jun 22 16:49:44 node97 kernel: Lustre: 8513:0:(ost_handler.c:886:ost_brw_read()) Skipped 1 previous similar message
Jun 22 16:49:46 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 51s due to heavy IO load
Jun 22 16:49:46 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 51s due to heavy IO load
Jun 22 16:49:46 node97 kernel: Lustre: Skipped 4 previous similar messages
Jun 22 16:50:06 node97 kernel: Lustre: 7842:0:(ldlm_lib.c:875:target_handle_connect()) lustre-OST0000: refuse reconnection from 632d5b96-c7de-a5c1-1d90-214ae7a2f32d at 11.11.11.43@tcp to 0xffff81022ff20600; still busy with 2 active RPCs
Jun 22 16:50:16 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 81s due to heavy IO load
Jun 22 16:50:16 node97 kernel: Lustre: Skipped 28 previous similar messages
Jun 22 16:50:17 node97 kernel: Lustre: lustre-OST0000: slow journal start 30s due to heavy IO load
Jun 22 16:50:17 node97 kernel: Lustre: lustre-OST0000: slow brw_start 30s due to heavy IO load
Jun 22 16:50:17 node97 kernel: Lustre: Skipped 5 previous similar messages
Jun 22 16:50:17 node97 kernel: Lustre: Skipped 3 previous similar messages
Jun 22 16:50:17 node97 kernel: Lustre: lustre-OST0000: slow journal start 30s due to heavy IO load
Jun 22 16:50:17 node97 kernel: Lustre: Skipped 11 previous similar messages
Jun 22 16:50:17 node97 kernel: Lustre: lustre-OST0000: slow direct_io 81s due to heavy IO load
Jun 22 16:50:17 node97 kernel: Lustre: lustre-OST0000: slow parent lock 32s due to heavy IO load
Jun 22 16:50:17 node97 kernel: Lustre: Skipped 10 previous similar messages
Jun 22 16:50:26 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 90s due to heavy IO load
Jun 22 16:50:26 node97 kernel: Lustre: Skipped 9 previous similar messages
Jun 22 16:50:29 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 93s due to heavy IO load
Jun 22 16:50:29 node97 kernel: Lustre: Skipped 1 previous similar message
Jun 22 16:50:29 node97 kernel: Lustre: lustre-OST0000: slow direct_io 93s due to heavy IO load
Jun 22 16:50:29 node97 kernel: Lustre: Skipped 2 previous similar messages
Jun 22 16:50:29 node97 kernel: Lustre: lustre-OST0000: slow direct_io 46s due to heavy IO load
Jun 22 16:50:29 node97 kernel: Lustre: lustre-OST0000: slow parent lock 42s due to heavy IO load
Jun 22 16:50:29 node97 kernel: Lustre: Skipped 6 previous similar messages
Jun 22 16:50:29 node97 kernel: Lustre: lustre-OST0000: slow direct_io 47s due to heavy IO load
Jun 22 16:50:29 node97 kernel: Lustre: Skipped 6 previous similar messages
Jun 22 16:50:29 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 44s due to heavy IO load
Jun 22 16:50:29 node97 kernel: Lustre: Skipped 10 previous similar messages
Jun 22 16:50:29 node97 kernel: Lustre: 7597:0:(ldlm_lib.c:575:target_handle_reconnect()) lustre-OST0000: 632d5b96-c7de-a5c1-1d90-214ae7a2f32d reconnecting
Jun 22 16:50:29 node97 kernel: Lustre: 7597:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 1 previous similar message
Jun 22 16:50:29 node97 kernel: LustreError: 7597:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error (-16)  req at ffff8101cfc78c00 x1371463404703526/t0 o8->632d5b96-c7de-a5c1-1d90-214ae7a2f32d at NET_0x200000b0b0b2b_UUID:0/0 lens 368/264 e 0 to 0 dl 1308732729 ref 1 fl Interpret:/0/0 rc -16/0
Jun 22 16:50:29 node97 kernel: LustreError: 7597:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 1 previous similar message
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 40s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow i_mutex 64s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: Skipped 28 previous similar messages
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow direct_io 70s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow journal start 70s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow journal start 70s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: Skipped 1 previous similar message
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow commitrw commit 70s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: Skipped 1 previous similar message
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow parent lock 70s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow parent lock 70s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow preprw_write setup 70s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow parent lock 40s due to heavy IO load
Jun 22 16:51:40 node97 kernel: Lustre: lustre-OST0000: slow preprw_write setup 40s due to heavy IO load
2011-07-09 



xiaojunhua 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110711/cf866095/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 97.log.tar.gz
Type: application/octet-stream
Size: 3707317 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110711/cf866095/attachment.obj>


More information about the lustre-discuss mailing list