[Lustre-discuss] MDS Crash Urgent Help Need

Mag Gam magawake at gmail.com
Sat Feb 7 09:23:22 PST 2009


Can you please provide some details?

What OS? What kernel version? Did you patch the kernel or are you
using the RPMS?



On Fri, Feb 6, 2009 at 9:12 PM, Mahmoud Hanafi <trek5200trek at yahoo.com> wrote:
> We had an mds crash and a subsequent reboot results in a panic. Any help
> would be greatly appreciated.
>
> This error appears to be the key event.
>
> Feb  6 13:51:58 service100 kernel: LustreError:
> 6976:0:(llog_obd.c:211:llog_add()) No ctxt
>
> Thank,
> Mahmoud Hanafi
>
> Feb  6 13:39:14 service100 kernel: Lustre: m45_nb1-MDT0000: recovery
> complete: rc 0
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6597:0:(llog_obd.c:211:llog_add()) No ctxt
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6597:0:(llog_obd.c:211:llog_add()) Skipped 909 previous similar messages
> Feb  6 13:39:15 service100 kernel: Lustre: MDS m45_nb1-MDT0000:
> m45_nb1-OST0000_UUID now active, resetting orphans
> Feb  6 13:39:15 service100 kernel: Lustre: MDS m45_nb1-MDT0000:
> m45_nb1-OST0001_UUID now active, resetting orphans
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6496:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile
> 0x11b80054:0x10703925: rc -2
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6496:0:(llog_cat.c:176:llog_cat_id2handle()) error opening log id
> 0x11b80054:10703925: rc -2
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6496:0:(llog_cat.c:330:llog_cat_cancel_records()) Cannot find log 0x11b80054
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6497:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile
> 0x11b8004f:0x10703922: rc -2
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6497:0:(llog_cat.c:176:llog_cat_id2handle()) error opening log id
> 0x11b8004f:10703922: rc -2
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6497:0:(llog_cat.c:330:llog_cat_cancel_records()) Cannot find log 0x11b8004f
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6497:0:(llog_server.c:447:llog_origin_handle_cancel()) cancel 124
> llog-records failed: -22
> Feb  6 13:39:15 service100 kernel: LustreError:
> 6496:0:(llog_server.c:447:llog_origin_handle_cancel()) cancel 124
> llog-records failed: -22
> Feb  6 13:39:15 service100 kernel: Lustre: MDS m45_nb1-MDT0000:
> m45_nb1-OST0007_UUID now active, resetting orphans
> Feb  6 13:39:15 service100 kernel: Lustre: Skipped 5 previous similar
> messages
> Feb  6 13:39:16 service100 kernel: LustreError:
> 6497:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile
> 0x11b80052:0x1070392a: rc -2
> Feb  6 13:39:16 service100 kernel: LustreError:
> 6497:0:(llog_lvfs.c:612:llog_lvfs_create()) Skipped 6 previous similar
> messages
> Feb  6 13:39:16 service100 kernel: LustreError:
> 6497:0:(llog_cat.c:176:llog_cat_id2handle()) error opening log id
> 0x11b80052:1070392a: rc -2
> Feb  6 13:39:16 service100 kernel: LustreError:
> 6497:0:(llog_cat.c:176:llog_cat_id2handle()) Skipped 6 previous similar
> messages
> Feb  6 13:39:16 service100 kernel: LustreError:
> 6497:0:(llog_cat.c:330:llog_cat_cancel_records()) Cannot find log 0x11b80052
> Feb  6 13:39:16 service100 kernel: LustreError:
> 6497:0:(llog_cat.c:330:llog_cat_cancel_records()) Skipped 6 previous similar
> messages
> Feb  6 13:39:16 service100 kernel: LustreError:
> 6499:0:(llog_server.c:447:llog_origin_handle_cancel()) cancel 124
> llog-records failed: -22
>
>
> Feb  6 13:51:51 service100 kernel: LDISKFS-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Feb  6 13:51:51 service100 kernel: LDISKFS FS on sde1, internal journal
> Feb  6 13:51:51 service100 kernel: LDISKFS-fs: recovery complete.
> Feb  6 13:51:51 service100 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Feb  6 13:51:51 service100 kernel: kjournald starting.  Commit interval 5
> seconds
> Feb  6 13:51:51 service100 kernel: LDISKFS-fs warning: maximal mount count
> reached, running e2fsck is recommended
> Feb  6 13:51:51 service100 kernel: LDISKFS FS on sde1, internal journal
> Feb  6 13:51:51 service100 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Feb  6 13:51:51 service100 kernel: Lustre: Added LNI 10.151.25.163 at o2ib
> [8/64]
> Feb  6 13:51:51 service100 kernel: LustreError: 137-5: UUID 'MGS' is not
> available  for connect (not set up)
> Feb  6 13:51:51 service100 kernel: LustreError:
> 6798:0:(mgs_handler.c:647:mgs_handle()) MGS handle cmd=250 rc=-19
> Feb  6 13:51:51 service100 kernel: LustreError:
> 6798:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19)
> req at ffff8107fb3c3050 x4876961/t0 o250-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> 1233957211 ref 1 fl Interpret:/0/0 rc -19/0
> Feb  6 13:51:51 service100 kernel: Lustre: MGS MGS started
> Feb  6 13:51:51 service100 kernel: Lustre: Server MGS on device /dev/sde1
> has started
> Feb  6 13:51:56 service100 kernel: (fs/jbd/recovery.c, 255):
> journal_recover: JBD: recovery, exit status 0, recovered transactions
> 2765219 to 2765244
> Feb  6 13:51:56 service100 kernel: (fs/jbd/recovery.c, 257):
> journal_recover: JBD: Replayed 17611 and revoked 0/15 blocks
> Feb  6 13:51:56 service100 kernel: kjournald starting.  Commit interval 5
> seconds
> Feb  6 13:51:57 service100 kernel: LDISKFS FS on sde2, internal journal
> Feb  6 13:51:57 service100 kernel: LDISKFS-fs: recovery complete.
> Feb  6 13:51:57 service100 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Feb  6 13:51:57 service100 kernel: kjournald starting.  Commit interval 5
> seconds
> Feb  6 13:51:57 service100 kernel: LDISKFS FS on sde2, internal journal
> Feb  6 13:51:57 service100 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Feb  6 13:51:57 service100 kernel: LustreError: 137-5: UUID
> 'm45_nb1-MDT0000_UUID' is not available  for connect (no target)
> Feb  6 13:51:57 service100 kernel: LustreError:
> 6854:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-19)
> req at ffff8107db2d5400 x6027981/t0 o38-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl
> 1233957217 ref 1 fl Interpret:/0/0 rc -19/0
> Feb  6 13:51:58 service100 kernel: Lustre: Enabling user_xattr
> Feb  6 13:51:58 service100 kernel: Lustre: Enabling ACL
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6923:0:(mds_fs.c:493:mds_init_server_data()) RECOVERY: service
> m45_nb1-MDT0000, 5893 recoverable clients, last_transno 5429096891
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6923:0:(mds_lov.c:1070:mds_notify()) MDS m45_nb1-MDT0000: in recovery, not
> resetting orphans on m45_nb1-OST0000_UUID
> Feb  6 13:51:58 service100 kernel: LustreError:
> 6923:0:(obd_class.h:339:obd_get_info()) obd_get_info: NULL export
> Feb  6 13:51:58 service100 kernel: LustreError:
> 6923:0:(lov_obd.c:455:lov_connect()) m45_nb1-mdtlov error sending notify -19
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6923:0:(mds_lov.c:1070:mds_notify()) MDS m45_nb1-MDT0000: in recovery, not
> resetting orphans on m45_nb1-OST0003_UUID
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6923:0:(mds_lov.c:1070:mds_notify()) Skipped 2 previous similar messages
> Feb  6 13:51:58 service100 kernel: LustreError:
> 6923:0:(obd_class.h:339:obd_get_info()) obd_get_info: NULL export
> Feb  6 13:51:58 service100 kernel: LustreError:
> 6923:0:(obd_class.h:339:obd_get_info()) Skipped 2 previous similar messages
> Feb  6 13:51:58 service100 kernel: LustreError:
> 6923:0:(lov_obd.c:455:lov_connect()) m45_nb1-mdtlov error sending notify -19
> Feb  6 13:51:58 service100 kernel: LustreError:
> 6923:0:(lov_obd.c:455:lov_connect()) Skipped 2 previous similar messages
> Feb  6 13:51:58 service100 kernel: Lustre: MDT m45_nb1-MDT0000 now serving
> dev (m45_nb1-MDT0000/c528a9db-4b84-a59c-41b6-ad3a6ec11fbf), but will be in
> recovery for at least 5:00, or until 5893 clients reconnect. During this
> time new clients will not be allowed to connect. Recovery progress can be
> monitored by watching /proc/fs/lustre/mds/m45_nb1-MDT0000/recovery_status.
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6923:0:(lproc_mds.c:273:lprocfs_wr_group_upcall()) m45_nb1-MDT0000: group
> upcall set to NONE
> Feb  6 13:51:58 service100 kernel: Lustre: m45_nb1-MDT0000.mdt: set
> parameter group_upcall=NONE
> Feb  6 13:51:58 service100 kernel: Lustre: m45_nb1-MDT0000: temporarily
> refusing client connection from 10.151.9.169 at o2ib
> Feb  6 13:51:58 service100 kernel: Lustre: m45_nb1-MDT0000: temporarily
> refusing client connection from 10.151.6.241 at o2ib
> Feb  6 13:51:58 service100 kernel: Lustre: m45_nb1-MDT0000.mdt: set
> parameter quota_type=u2
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6861:0:(ldlm_lib.c:1226:check_and_start_recovery_timer()) m45_nb1-MDT0000:
> starting recovery timer
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6882:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000:
> 5892 recoverable clients remain
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6868:0:(mds_open.c:835:mds_open_by_fid()) Orphan 53f26ea:0f8c9a49 found and
> opened in PENDING directory
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6870:0:(mds_open.c:835:mds_open_by_fid()) Orphan 5482886:0fa65d97 found and
> opened in PENDING directory
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6869:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000:
> 5891 recoverable clients remain
> Feb  6 13:51:58 service100 kernel: Lustre:
> 7002:0:(mds_open.c:835:mds_open_by_fid()) Orphan 54820e2:0fa5e637 found and
> opened in PENDING directory
> Feb  6 13:51:58 service100 kernel: Lustre:
> 7002:0:(mds_open.c:835:mds_open_by_fid()) Skipped 137 previous similar
> messages
> Feb  6 13:51:58 service100 kernel: LustreError:
> 6976:0:(llog_obd.c:211:llog_add()) No ctxt
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6885:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000:
> 5861 recoverable clients remain
> Feb  6 13:51:58 service100 kernel: Lustre:
> 6885:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 29
> previous similar messages
> Feb  6 13:51:59 service100 kernel: Lustre:
> 6875:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000:
> 5565 recoverable clients remain
> Feb  6 13:51:59 service100 kernel: Lustre:
> 6875:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 295
> previous similar messages
> Feb  6 13:51:59 service100 kernel: Lustre:
> 6974:0:(mds_open.c:835:mds_open_by_fid()) Orphan 530890c:0fad2c72 found and
> opened in PENDING directory
> Feb  6 13:51:59 service100 kernel: Lustre:
> 6974:0:(mds_open.c:835:mds_open_by_fid()) Skipped 713 previous similar
> messages
> Feb  6 13:52:01 service100 kernel: Lustre:
> 6881:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000:
> 4755 recoverable clients remain
> Feb  6 13:52:01 service100 kernel: Lustre:
> 6881:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 809
> previous similar messages
> Feb  6 13:52:01 service100 kernel: Lustre:
> 6866:0:(mds_open.c:835:mds_open_by_fid()) Orphan 54c028f:0fad585d found and
> opened in PENDING directory
> Feb  6 13:52:01 service100 kernel: Lustre:
> 6866:0:(mds_open.c:835:mds_open_by_fid()) Skipped 1691 previous similar
> messages
> Feb  6 13:52:05 service100 kernel: Lustre:
> 6865:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000:
> 3930 recoverable clients remain
> Feb  6 13:52:05 service100 kernel: Lustre:
> 6865:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 824
> previous similar messages
> Feb  6 13:52:05 service100 kernel: Lustre:
> 6968:0:(mds_open.c:835:mds_open_by_fid()) Orphan 54cd214:0fabedc6 found and
> opened in PENDING directory
> Feb  6 13:52:05 service100 kernel: Lustre:
> 6968:0:(mds_open.c:835:mds_open_by_fid()) Skipped 2113 previous similar
> messages
> Feb  6 13:52:13 service100 kernel: Lustre:
> 6879:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000:
> 2153 recoverable clients remain
> Feb  6 13:52:13 service100 kernel: Lustre:
> 6879:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 1775
> previous similar messages
> Feb  6 13:52:13 service100 kernel: Lustre:
> 6872:0:(mds_open.c:835:mds_open_by_fid()) Orphan 52f9aea:0f799bf6 found and
> opened in PENDING directory
> Feb  6 13:52:13 service100 kernel: Lustre:
> 6872:0:(mds_open.c:835:mds_open_by_fid()) Skipped 3299 previous similar
> messages
> Feb  6 13:53:31 service100 kernel: Lustre:
> 7002:0:(mds_open.c:835:mds_open_by_fid()) Orphan 52f9af7:0f7b104d found and
> opened in PENDING directory
> Feb  6 13:53:31 service100 kernel: Lustre:
> 7002:0:(mds_open.c:835:mds_open_by_fid()) Skipped 1232 previous similar
> messages
> Feb  6 13:53:31 service100 kernel: Lustre:
> 6983:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000:
> 1295 recoverable clients remain
> Feb  6 13:53:31 service100 kernel: Lustre:
> 6983:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 856
> previous similar messages
> Feb  6 13:53:38 service100 kernel: Lustre:
> 7006:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 reconnecting
> Feb  6 13:53:38 service100 kernel: Lustre:
> 7006:0:(ldlm_lib.c:773:target_handle_connect()) m45_nb1-MDT0000: refuse
> reconnection from 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at 10.151.81.216@o2ib to
> 0xffff8107cd932000; still busy with 2 active RPCs
> Feb  6 13:53:38 service100 kernel: LustreError:
> 7006:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff810725625800 x5375830/t0
> o38->9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at NET_0x500000a9751d8_UUID:0/0 lens
> 304/200 e 0 to 0 dl 1233957318 ref 1 fl Interpret:/0/0 rc -16/0
> Feb  6 13:53:38 service100 kernel: LustreError:
> 7006:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 38 previous similar
> messages
> Feb  6 13:53:38 service100 kernel: Lustre:
> 6971:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> a432bfb8-6afb-cc67-d49e-8e1ba23de270 reconnecting
> Feb  6 13:53:38 service100 kernel: LustreError:
> 6982:0:(ldlm_lib.c:1434:target_queue_recovery_request()) @@@ dropping resent
> queued req  req at ffff81072508f400 x5066410/t0
> o101->a432bfb8-6afb-cc67-d49e-8e1ba23de270 at NET_0x500000a97482f_UUID:0/0 lens
> 512/0 e 0 to 0 dl 1233957318 ref 1 fl Interpret:/6/0 rc 0/0
> Feb  6 13:53:39 service100 kernel: Lustre:
> 6986:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> bacd25ef-2f62-e88e-b080-d129171a0666 reconnecting
> Feb  6 13:53:39 service100 kernel: LustreError:
> 7008:0:(ldlm_lib.c:1434:target_queue_recovery_request()) @@@ dropping resent
> queued req  req at ffff8107257f1000 x32603191/t0
> o36->bacd25ef-2f62-e88e-b080-d129171a0666 at NET_0x500000a97055a_UUID:0/0 lens
> 336/0 e 0 to 0 dl 1233957319 ref 1 fl Interpret:/6/0 rc 0/0
> Feb  6 13:53:41 service100 kernel: Lustre:
> 6861:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> 9e70be4b-f534-5f36-39ca-cbd3f398981f reconnecting
> Feb  6 13:53:41 service100 kernel: LustreError:
> 6919:0:(ldlm_lib.c:1434:target_queue_recovery_request()) @@@ dropping resent
> queued req  req at ffff81072562ca00 x5621783/t0
> o35->9e70be4b-f534-5f36-39ca-cbd3f398981f at NET_0x500000a97131b_UUID:0/0 lens
> 296/0 e 0 to 0 dl 1233957321 ref 1 fl Interpret:/6/0 rc 0/0
> Feb  6 13:53:43 service100 kernel: Lustre:
> 6869:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> 668fb888-f573-8a5d-656d-f0f6943b261d reconnecting
> Feb  6 13:53:43 service100 kernel: LustreError:
> 6994:0:(ldlm_lib.c:1434:target_queue_recovery_request()) @@@ dropping resent
> queued req  req at ffff810725028600 x5087532/t0
> o36->668fb888-f573-8a5d-656d-f0f6943b261d at NET_0x500000a970477_UUID:0/0 lens
> 360/0 e 0 to 0 dl 1233957323 ref 1 fl Interpret:/6/0 rc 0/0
> Feb  6 13:53:48 service100 kernel: Lustre:
> 6881:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> 37479c6c-952d-1e5b-f28b-08a886b21994 reconnecting
> Feb  6 13:53:48 service100 kernel: LustreError:
> 6918:0:(ldlm_lib.c:1434:target_queue_recovery_request()) @@@ dropping resent
> queued req  req at ffff81072504ca00 x2467781/t0
> o35->37479c6c-952d-1e5b-f28b-08a886b21994 at NET_0x500000a970bba_UUID:0/0 lens
> 296/0 e 0 to 0 dl 1233957328 ref 1 fl Interpret:/6/0 rc 0/0
> Feb  6 13:53:53 service100 kernel: LustreError:
> 6859:0:(ldlm_lib.c:1434:target_queue_recovery_request()) @@@ dropping resent
> queued req  req at ffff81072504ca00 x5025647/t0
> o101->d5ff86c7-b54a-57cf-1948-928fac
> Feb  6 13:54:02 service100 kernel: LustreError:
> 6965:0:(llog_obd.c:211:llog_add()) No ctxt
> Feb  6 13:54:28 service100 kernel: Lustre:
> 6870:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 reconnecting
> Feb  6 13:54:28 service100 kernel: Lustre:
> 6870:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 2 previous similar
> messages
> Feb  6 13:54:28 service100 kernel: Lustre:
> 6870:0:(ldlm_lib.c:773:target_handle_connect()) m45_nb1-MDT0000: refuse
> reconnection from 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at 10.151.81.216@o2ib to
> 0xffff8107cd932000; still busy with 2 active RPCs
> Feb  6 13:54:28 service100 kernel: LustreError:
> 6870:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff810724675a00 x5375927/t0
> o38->9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at NET_0x500000a9751d8_UUID:0/0 lens
> 304/200 e 0 to 0 dl 1233957368 ref 1 fl Interpret:/0/0 rc -16/0
> Feb  6 13:54:53 service100 kernel: Lustre:
> 6994:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 reconnecting
> Feb  6 13:54:53 service100 kernel: Lustre:
> 6994:0:(ldlm_lib.c:773:target_handle_connect()) m45_nb1-MDT0000: refuse
> reconnection from 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at 10.151.81.216@o2ib to
> 0xffff8107cd932000; still busy with 2 active RPCs
> Feb  6 13:54:53 service100 kernel: LustreError:
> 6994:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff81072458fe00 x5376006/t0
> o38->9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at NET_0x500000a9751d8_UUID:0/0 lens
> 304/200 e 0 to 0 dl 1233957393 ref 1 fl Interpret:/0/0 rc -16/0
> Feb  6 13:55:18 service100 kernel: Lustre:
> 6968:0:(ldlm_lib.c:773:target_handle_connect()) m45_nb1-MDT0000: refuse
> reconnection from 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at 10.151.81.216@o2ib to
> 0xffff8107cd932000; still busy with 2 active RPCs
> Feb  6 13:55:18 service100 kernel: LustreError:
> 6968:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff8107247fee00 x5376085/t0
> o38->9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at NET_0x500000a9751d8_UUID:0/0 lens
> 304/200 e 0 to 0 dl 1233957418 ref 1 fl Interpret:/0/0 rc -16/0
> Feb  6 13:55:18 service100 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb())
> Watchdog triggered for pid 6965: it was inactive for 200s
> Feb  6 13:55:18 service100 kernel: Lustre:
> 0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process
> 6965
> Feb  6 13:55:18 service100 kernel: ll_mdt_33     S ffffffffffffffff     0
> 6965      1          6966  6964 (L-TLB)
> Feb  6 13:55:18 service100 kernel: ffff8107cabf3b28 0000000000000046
> 0000000000001705 000000000000000a
> Feb  6 13:55:18 service100 kernel:        ffff8108134f8a48 ffff8108134f87f0
> ffff810009059800 0000005b41b8f6c4
> Feb  6 13:55:18 service100 kernel:        0000000000001735 0000000300000000
> Feb  6 13:55:18 service100 kernel: Call Trace:
> <ffffffff885fe428>{:ptlrpc:target_queue_recovery_request+2792}
> Feb  6 13:55:18 service100 kernel:
> <ffffffff8012c8a9>{default_wake_function+0}
> <ffffffff8873ad91>{:mds:mds_handle+2273}
> Feb  6 13:55:18 service100 kernel:
> <ffffffff8833aa71>{:lnet:lnet_match_blocked_msg+961}
> Feb  6 13:55:18 service100 kernel:
> <ffffffff80305642>{thread_return+0}
> <ffffffff88393995>{:obdclass:class_handle2object+213}
> Feb  6 13:55:18 service100 kernel:
> <ffffffff8862e765>{:ptlrpc:lustre_msg_get_conn_cnt+53}
> Feb  6 13:55:18 service100 kernel:
> <ffffffff8012bac9>{find_busiest_group+360}
> <ffffffff8863860a>{:ptlrpc:ptlrpc_check_req+26}
> Feb  6 13:55:18 service100 kernel:
> <ffffffff8863a867>{:ptlrpc:ptlrpc_server_handle_request+2503}
> Feb  6 13:55:18 service100 kernel:
> <ffffffff8010f239>{do_gettimeofday+92}
> <ffffffff882fa3d6>{:libcfs:lcw_update_time+38}
> Feb  6 13:55:19 service100 kernel:
> <ffffffff8013d49d>{__mod_timer+173}
> <ffffffff8863d9d1>{:ptlrpc:ptlrpc_main+3745}
> Feb  6 13:55:19 service100 kernel:
> <ffffffff8012c8a9>{default_wake_function+0} <ffffffff8010bfc2>{child_rip+8}
> Feb  6 13:55:19 service100 kernel:
> <ffffffff8863cb30>{:ptlrpc:ptlrpc_main+0} <ffffffff8010bfba>{child_rip+0}
> Feb  6 13:55:19 service100 kernel: LustreError: dumping log to
> /tmp/lustre-log.1233957318.6965
> Feb  6 13:55:21 service100 kernel: LustreError:
> 6919:0:(ldlm_lib.c:1434:target_queue_recovery_request()) @@@ dropping resent
> queued req  req at ffff810724035200 x5621783/t0
> o35->9e70be4b-f534-5f36-39ca-cbd3f398981f at NET_0x500000a97131b_UUID:0/0 lens
> 296/0 e 0 to 0 dl 1233957421 ref 1 fl Interpret:/6/0 rc 0/0
> Feb  6 13:55:21 service100 kernel: LustreError:
> 6919:0:(ldlm_lib.c:1434:target_queue_recovery_request()) Skipped 1 previous
> similar message
> Feb  6 13:55:43 service100 kernel: Lustre:
> 6861:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 reconnecting
> Feb  6 13:55:43 service100 kernel: Lustre:
> 6861:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 2 previous similar
> messages
> Feb  6 13:55:43 service100 kernel: Lustre:
> 6861:0:(ldlm_lib.c:773:target_handle_connect()) m45_nb1-MDT0000: refuse
> reconnection from 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at 10.151.81.216@o2ib to
> 0xffff8107cd932000; still busy with 2 active RPCs
> Feb  6 13:55:43 service100 kernel: LustreError:
> 6861:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff8107246d9600 x5376164/t0
> o38->9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at NET_0x500000a9751d8_UUID:0/0 lens
> 304/200 e 0 to 0 dl 1233957443 ref 1 fl Interpret:/0/0 rc -16/0
> Feb  6 13:56:08 service100 kernel: Lustre:
> 7009:0:(ldlm_lib.c:773:target_handle_connect()) m45_nb1-MDT0000: refuse
> reconnection from 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at 10.151.81.216@o2ib to
> 0xffff8107cd932000; still busy with 2 active RPCs
> Feb  6 13:56:08 service100 kernel: LustreError:
> 7009:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff8107247a0400 x5376243/t0
> o38->9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at NET_0x500000a9751d8_UUID:0/0 lens
> 304/200 e 0 to 0 dl 1233957468 ref 1 fl Interpret:/0/0 rc -16/0
> Feb  6 13:56:33 service100 kernel: Lustre:
> 6855:0:(ldlm_lib.c:773:target_handle_connect()) m45_nb1-MDT0000: refuse
> reconnection from 9236b2bf-92ee-fc8b-c7f2-e3563a377de0 at 10.151.81.216@o2ib to
> 0xffff8107cd932000; still busy with 2 active RPCs
> Feb  6 13:56:58 service100 kernel: Lustre:
> 7008:0:(ldlm_lib.c:538:target_handle_reconnect()) m45_nb1-MDT0000:
> 6c3e80bd-92fb-8a7c-5bd9-72bc744956fc reconnecting
> Feb  6 13:56:58 service100 kernel: Lustre:
> 7008:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 2 previous similar
> messages
> Feb  6 13:56:58 service100 kernel: Lustre:
> 6973:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) m45_nb1-MDT0000: 3
> recoverable clients remain
> Feb  6 13:56:58 service100 kernel: Lustre:
> 6973:0:(ldlm_lib.c:1567:target_queue_last_replay_reply()) Skipped 1292
> previous similar messages
> Feb  6 13:56:58 service100 kernel: Lustre: Parent 87005581/3805388507 lookup
> error -2. Evicting client 7a197206-3055-fbec-480a-93bdd6753834 with export
> 10.151.77.211 at o2ib.
> Feb  6 13:56:58 service100 kernel: LustreError:
> 6983:0:(handler.c:1590:mds_handle()) operation 101 on unconnected MDS from
> 12345-10.151.77.211 at o2ib
> Feb  6 13:56:58 service100 kernel: LustreError:
> 6983:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error
> (-107)  req at ffff8107243ece00 x5257230/t0 o101-><?>@<?>:0/0 lens 232/0 e 0 to
> 0 dl 1233957518 ref 1 fl Interpret:/0/0 rc -107/0
> Feb  6 13:56:58 service100 kernel: LustreError:
> 6983:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 1 previous similar
> message
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>



More information about the lustre-discuss mailing list