[Lustre-discuss] lustre + nfs + alphas

Aaron S. Knister aaron at iges.org
Tue Dec 11 15:51:14 PST 2007


This is the strangest problem I have seen. I have a lustre filesystem mounted on a linux server and its being exported to various alpha systems. The alphas mount it just fine however under heavy load the NFS server stops responding, as does the lustre mount on the export server. The weird thing is that if i mount the nfs export on another nfs server and run the same benchmark (bonnie) everything is fine. The lustre mount on the export server can take a real pounding (ive seen it push 300MB/sec) so I don't know why nfs is crashing it.

On the nfs export server i see these messages--


Lustre: 4224:0:(o2iblnd_cb.c:412:kiblnd_handle_rx()) PUT_NACK from 192.168.64.70 at o2ib
LustreError: 4400:0:(client.c:969:ptlrpc_expire_one_request()) @@@ timeout (sent at 1197415542, 100s ago)  req at ffff810827bfbc00 x38827/t0 o36->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 14256/672 ref 1 fl Rpc:/0/0 rc 0/-22
Lustre: data-MDT0000-mdc-ffff81082d702000: Connection to service data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress operations using this service
will wait for recovery to complete.

A trace of the hung nfs deamons revels the following--

Dec 11 18:46:33 cpu3 kernel: nfsd          S ffff8108246ff008     0  4729      1          4730  4728 (L-TLB)
Dec 11 18:46:33 cpu3 kernel:  ffff81082be0daa0 0000000000000046 ffff810824710740 000064b0886cfdc4
Dec 11 18:46:33 cpu3 kernel:  0000000000000009 ffff81082fc6f7e0 ffffffff802dcae0 000000814fbeae1f
Dec 11 18:46:33 cpu3 kernel:  0000000003d51554 ffff81082fc6f9c8 0000000000000000 ffff8108246ff000
Dec 11 18:46:33 cpu3 kernel: Call Trace:
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80061839>] schedule_timeout+0x8a/0xad
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80092b26>] process_timeout+0x0/0x5
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88700a3d>] :ptlrpc:ptlrpc_queue_wait+0xa9d/0x1250
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff886d67a1>] :ptlrpc:ldlm_resource_putref+0x331/0x3b0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8870a2c5>] :ptlrpc:lustre_msg_set_flags+0x45/0x120
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff800884f8>] default_wake_function+0x0/0xe
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a37d0>] :mdc:mdc_reint+0xc0/0x240
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a5c77>] :mdc:mdc_unlink_pack+0x117/0x140
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a4ab7>] :mdc:mdc_unlink+0x307/0x3d0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff801405f7>] __next_cpu+0x19/0x28
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80087090>] find_busiest_group+0x20d/0x621
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80009499>] __d_lookup+0xb0/0xff
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8886ced6>] :lustre:ll_unlink+0x1d6/0x370
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8883b791>] :lustre:ll_inode_permission+0xa1/0xc0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80047fc8>] vfs_unlink+0xc2/0x108
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857c57a>] :nfsd:nfsd_unlink+0x1de/0x24b
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88583e9a>] :nfsd:nfsd3_proc_remove+0xa8/0xb5
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff885791c4>] :nfsd:nfsd_dispatch+0xd7/0x198
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88488514>] :sunrpc:svc_process+0x44d/0x70b
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff800625bf>] __down_read+0x12/0x92
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff885796fb>] :nfsd:nfsd+0x1ae/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8005bfb1>] child_rip+0xa/0x11
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8005bfa7>] child_rip+0x0/0x11




More information about the lustre-discuss mailing list