[Lustre-discuss] ost lock up

Papp Tamas tompos at martos.bme.hu
Tue Jan 6 05:45:44 PST 2009


hi All,


I have made a new small cluster in the close past for backup purposes, 
it has only 2 client nodes.
It's based on CentOS 5.1 and 1.6.6, meta datas are on drbd. The freezed 
OST node is the secondary node for the drbd.

I' quite sure, this is probalbly not the problem, this scenario was 
working for months with Lustre 1.6.4.3.


When I was making the initial backup (a bit bigger and continuously 
traffic) on of the OST node get freezed up. Actually not the whole node, 
but the 2 OST mounts:

/dev/sdc1 on /mnt/ost-0031 type lustre (rw)
/dev/sdd1 on /mnt/ost-0032 type lustre (rw)


The mounts cannot be umounted even with 'umount -f'.

In logs I see:

Jan  3 11:15:36 backup3 kernel: LustreError: 
3271:0:(socklnd.c:1631:ksocknal_destroy_conn()) Completing partial 
receive from 12345-192.168.0.24 at tcp, ip 192.16
8.0.24:1021, with error
Jan  3 11:15:36 backup3 kernel: LustreError: 
3271:0:(events.c:372:server_bulk_callback()) event type 2, status -5, 
desc ffff810026250000
Jan  3 11:15:36 backup3 kernel: LustreError: 
11080:0:(ost_handler.c:1153:ost_brw_write()) @@@ network error on bulk 
GET 0(1048576)  req at ffff81007ad24450 x2223
2997/t0 
o4->32625da1-9d50-a532-53d9-cdd307d13ed4 at NET_0x20000c0a80018_UUID:0/0 
lens 384/352 e 0 to 0 dl 1230977776 ref 1 fl Interpret:/0/0 rc 0/0
Jan  3 11:15:36 backup3 kernel: Lustre: 
11080:0:(ost_handler.c:1284:ost_brw_write()) backup-OST0031: ignoring 
bulk IO comm error with 32625da1-9d50-a532-53d9-
cdd307d13ed4 at NET_0x20000c0a80018_UUID id 12345-192.168.0.24 at tcp - client 
will retry
Jan  3 11:15:41 backup3 kernel: Lustre: 
11034:0:(ldlm_lib.c:538:target_handle_reconnect()) backup-OST0031: 
32625da1-9d50-a532-53d9-cdd307d13ed4 reconnecting
Jan  3 11:15:41 backup3 kernel: Lustre: 
11034:0:(ldlm_lib.c:773:target_handle_connect()) backup-OST0031: refuse 
reconnection from 32625da1-9d50-a532-53d9-cdd3
07d13ed4 at 192.168.0.24@tcp to 0xffff810041e7c000; still busy with 2 
active RPCs
Jan  3 11:15:41 backup3 kernel: LustreError: 
11034:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-16)  req at ffff810040ece600 x22233043/t0 o
8->32625da1-9d50-a532-53d9-cdd307d13ed4 at NET_0x20000c0a80018_UUID:0/0 
lens 304/200 e 0 to 0 dl 1230977841 ref 1 fl Interpret:/0/0 rc -16/0
Jan  3 11:16:06 backup3 kernel: Lustre: 
11038:0:(ldlm_lib.c:538:target_handle_reconnect()) backup-OST0031: 
32625da1-9d50-a532-53d9-cdd307d13ed4 reconnecting
Jan  3 11:16:06 backup3 kernel: Lustre: 
11038:0:(ldlm_lib.c:773:target_handle_connect()) backup-OST0031: refuse 
reconnection from 32625da1-9d50-a532-53d9-cdd3
07d13ed4 at 192.168.0.24@tcp to 0xffff810041e7c000; still busy with 2 
active RPCs
Jan  3 11:16:06 backup3 kernel: LustreError: 
11038:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error 
(-16)  req at ffff81001aed8e00 x22233117/t0 o
8->32625da1-9d50-a532-53d9-cdd307d13ed4 at NET_0x20000c0a80018_UUID:0/0 
lens 304/200 e 0 to 0 dl 1230977866 ref 1 fl Interpret:/0/0 rc -16/0
Jan  3 11:16:16 backup3 kernel: LustreError: 
11083:0:(ost_handler.c:1142:ost_brw_write()) @@@ timeout on bulk GET 
after 100+0s  req at ffff810041edd800 x22232986
/t0 
o4->32625da1-9d50-a532-53d9-cdd307d13ed4 at NET_0x20000c0a80018_UUID:0/0 
lens 384/352 e 0 to 0 dl 1230977776 ref 1 fl Interpret:/0/0 rc 0/0
Jan  3 11:16:16 backup3 kernel: Lustre: 
11083:0:(ost_handler.c:1284:ost_brw_write()) backup-OST0031: ignoring 
bulk IO comm error with 32625da1-9d50-a532-53d9-
cdd307d13ed4 at NET_0x20000c0a80018_UUID id 12345-192.168.0.24 at tcp - client 
will retry
Jan  3 11:16:40 backup3 kernel: Lustre: 
11037:0:(ldlm_lib.c:538:target_handle_reconnect()) backup-OST0031: 
32625da1-9d50-a532-53d9-cdd307d13ed4 reconnecting
Jan  3 11:16:40 backup3 kernel: Lustre: 
11037:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 1 previous 
similar message




And after that meny of these:

Jan  3 23:11:08 backup3 kernel: BUG: soft lockup - CPU#0 stuck for 10s! 
[ll_ost_io_08:11074]
Jan  3 23:11:08 backup3 kernel: CPU 0:
Jan  3 23:11:08 backup3 kernel: Modules linked in: obdfilter(FU) 
fsfilt_ldiskfs(FU) ost(FU) mgc(FU) ldiskfs(FU) crc16(U) lustre(FU) 
lov(FU) mdc(FU) lquota(FU)
 osc(FU) ksocklnd(FU) ptlrpc(FU) obdclass(FU) lnet(FU) lvfs(FU) 
libcfs(FU) drbd(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U) 
auth_rpcgss(U) ipv6(U) xfrm_nalgo(U
) crypto_api(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) sunrpc(U) 
dm_mirror(U) dm_multipath(U) dm_mod(U) video(U) sbs(U) backlight(U) 
i2c_ec(U) button(U) bat
tery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) lp(U) snd_intel8x0(U) 
snd_ac97_codec(U) ac97_bus(U) snd_seq_dummy(U) snd_seq_oss(U) 
snd_seq_midi_event(U) snd_se
q(U) snd_seq_device(U) snd_pcm_oss(U) snd_mixer_oss(U) k8temp(U) 
hwmon(U) shpchp(U) snd_pcm(U) k8_edac(U) snd_timer(U) edac_mc(U) snd(U) 
i2c_nforce2(U) soundc
ore(U) snd_page_alloc(U) sg(U) forcedeth(U) floppy(U) i2c_core(U) 
e1000e(U) pcspkr(U) parport_pc(U) parport(U) aacraid(U) pata_acpi(U) 
sata_nv(U) libata(U) sd
_mod(U) scsi_mod(U) raid1(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) 
uhci_hcd(U)
Jan  3 23:11:08 backup3 kernel: Pid: 11074, comm: ll_ost_io_08 Tainted: 
GF     2.6.18-92.1.10.el5_lustre.1.6.6custom #2
Jan  3 23:11:08 backup3 kernel: RIP: 0010:[<ffffffff80064a36>]  
[<ffffffff80064a36>] _spin_lock+0x9/0xa
Jan  3 23:11:08 backup3 kernel: RSP: 0018:ffff81007db91308  EFLAGS: 00000247
Jan  3 23:11:08 backup3 kernel: RAX: ffff810015153758 RBX: 
ffff81003fd15258 RCX: 0000000000000007
Jan  3 23:11:08 backup3 kernel: RDX: ffff81003fd15268 RSI: 
ffff81001bb38000 RDI: ffff810015153778
Jan  3 23:11:08 backup3 kernel: RBP: 0000000000000000 R08: 
ffff810001578980 R09: 0000000000001000
Jan  3 23:11:08 backup3 kernel: R10: ffff81001bb38000 R11: 
0000000000000000 R12: ffff81001fa33ba0
Jan  3 23:11:08 backup3 kernel: R13: 0000000000000000 R14: 
ffff81002a1164f8 R15: ffffffff88033cfe
Jan  3 23:11:08 backup3 kernel: FS:  00002adcf73f8230(0000) 
GS:ffffffff803a0000(0000) knlGS:0000000000000000
Jan  3 23:11:08 backup3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Jan  3 23:11:08 backup3 kernel: CR2: 00002b5de33a4018 CR3: 
0000000000201000 CR4: 00000000000006e0
Jan  3 23:11:08 backup3 kernel:
Jan  3 23:11:08 backup3 kernel: Call Trace:
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8891f709>] 
:ldiskfs:ldiskfs_mb_discard_group_preallocations+0x199/0x3c0
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8891f962>] 
:ldiskfs:ldiskfs_mb_discard_preallocations+0x32/0x70
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8891fd9b>] 
:ldiskfs:ldiskfs_mb_new_blocks+0x1cb/0x280
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8897bc11>] 
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x620
Jan  3 23:11:08 backup3 kernel:  [<ffffffff800194f6>] __getblk+0x25/0x235
Jan  3 23:11:08 backup3 kernel:  [<ffffffff88918857>] 
:ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8897b910>] 
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x620
Jan  3 23:11:08 backup3 kernel:  [<ffffffff88977d1e>] 
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
Jan  3 23:11:08 backup3 kernel:  [<ffffffff889a9cf8>] 
:obdfilter:filter_direct_io+0x478/0xcd0
Jan  3 23:11:08 backup3 kernel:  [<ffffffff889ac690>] 
:obdfilter:filter_commitrw_write+0x1450/0x23d0
Jan  3 23:11:08 backup3 kernel:  [<ffffffff88955621>] 
:ost:ost_checksum_bulk+0x351/0x590
Jan  3 23:11:08 backup3 kernel:  [<ffffffff889553ac>] 
:ost:ost_checksum_bulk+0xdc/0x590
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8895bda4>] 
:ost:ost_brw_write+0x2004/0x2760
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8008ad7e>] 
default_wake_function+0x0/0xe
Jan  3 23:11:08 backup3 kernel:  [<ffffffff88712148>] 
:ptlrpc:lustre_msg_check_version_v2+0x8/0x20
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8895f3ee>] 
:ost:ost_handle+0x2eee/0x5bf0
Jan  3 23:11:08 backup3 kernel:  [<ffffffff801437a4>] __next_cpu+0x19/0x28
Jan  3 23:11:08 backup3 kernel:  [<ffffffff80089aa8>] 
find_busiest_group+0x20d/0x621
Jan  3 23:11:08 backup3 kernel:  [<ffffffff88673031>] 
:obdclass:class_handle2object+0xd1/0x160
Jan  3 23:11:08 backup3 kernel:  [<ffffffff88711705>] 
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8871b0da>] 
:ptlrpc:ptlrpc_check_req+0x1a/0x110
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8871d2c2>] 
:ptlrpc:ptlrpc_server_handle_request+0x992/0x1040
Jan  3 23:11:08 backup3 kernel:  [<ffffffff80062f4b>] thread_return+0x0/0xdf
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8006d940>] 
do_gettimeofday+0x50/0x92
Jan  3 23:11:08 backup3 kernel:  [<ffffffff885d1476>] 
:libcfs:lcw_update_time+0x16/0x100
Jan  3 23:11:08 backup3 kernel:  [<ffffffff800893bb>] 
__wake_up_common+0x3e/0x68
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8872022c>] 
:ptlrpc:ptlrpc_main+0xe0c/0xf90
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8008ad7e>] 
default_wake_function+0x0/0xe
Jan  3 23:11:08 backup3 kernel:  [<ffffffff800b4610>] 
audit_syscall_exit+0x31b/0x336
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8871f420>] 
:ptlrpc:ptlrpc_main+0x0/0xf90
Jan  3 23:11:08 backup3 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jan  3 23:11:08 backup3 kernel:



$ uname -a
Linux backup3 2.6.18-92.1.10.el5_lustre.1.6.6custom #2 SMP Fri Dec 26 
20:27:40 CET 2008 x86_64 x86_64 x86_64 GNU/Linux

The drbd is 8.3.0, but I think it's OK.

Is this a known bug, or something other?

Thank you,

tamas





More information about the lustre-discuss mailing list