[Lustre-discuss] client randomly evicted

Aaron Knister aaron at iges.org
Wed Apr 30 08:40:04 PDT 2008


Some more information that might be helpful. There is a particular  
code that one of our users runs. Personally after the trouble this  
code has caused us we'd like to hand him a calculator and disable his  
accounts but sadly that's not an option. Since the time of the hang,  
there is what seems to be one process associated with lustre that is  
running as the userid of the problem user- "ll_sa_15530". A trace of  
this process in its current state shows this -

Apr 30 11:29:30 cola10 kernel: ll_sa_15530   S 0000000000000000     0  
15531      1         17700 18228 (L-TLB)
Apr 30 11:29:30 cola10 kernel:  ffff810116c31c10 0000000000000046  
ffff81013e7747a0 ffffffff80087d0e
Apr 30 11:29:30 cola10 kernel:  0000000000000007 ffff81003a76b040  
ffff81012f11f0c0 000fcb5175eba398
Apr 30 11:29:30 cola10 kernel:  0000000000001407 ffff81003a76b228  
0000000000000001 0000000000000068
Apr 30 11:29:30 cola10 kernel: Call Trace:
Apr 30 11:29:30 cola10 kernel:  [<ffffffff80087d0e>] enqueue_task 
+0x41/0x56
Apr 30 11:29:30 cola10 kernel:   
[<ffffffff8862b7e4>] :ptlrpc:ldlm_prep_enqueue_req+0x1b4/0x2e0
Apr 30 11:29:30 cola10 kernel:  [<ffffffff886e528c>] :mdc:mdc_req_avail 
+0x6c/0xf0
Apr 30 11:29:30 cola10 kernel:   
[<ffffffff886e6275>] :mdc:mdc_enter_request+0x145/0x1e0
Apr 30 11:29:30 cola10 kernel:  [<ffffffff800884ed>]  
default_wake_function+0x0/0xe
Apr 30 11:29:30 cola10 kernel:   
[<ffffffff886e6410>] :mdc:mdc_intent_lookup_pack+0xd0/0xf0
Apr 30 11:29:30 cola10 kernel:   
[<ffffffff886e6644>] :mdc:mdc_intent_getattr_async+0x214/0x420
Apr 30 11:29:30 cola10 kernel:  [<ffffffff887ae63d>] :lustre:ll_i2gids 
+0x5d/0x150
Apr 30 11:29:30 cola10 kernel:   
[<ffffffff887b94c5>] :lustre:ll_statahead_thread+0xf75/0x1810
Apr 30 11:29:30 cola10 kernel:  [<ffffffff800884ed>]  
default_wake_function+0x0/0xe
Apr 30 11:29:30 cola10 kernel:  [<ffffffff8005bfb1>] child_rip+0xa/0x11
Apr 30 11:29:30 cola10 kernel:   
[<ffffffff887b8550>] :lustre:ll_statahead_thread+0x0/0x1810
Apr 30 11:29:30 cola10 kernel:  [<ffffffff8005bfa7>] child_rip+0x0/0x11

Is this a problem with the lustre readahead code? If so would this fix  
it? "echo 0 > /proc/fs/lustre/llite/*/statahead_count "

Thank you so much for all your help.

-Aaron

On Apr 30, 2008, at 11:16 AM, Aaron S. Knister wrote:

> I have a lustre client that was randomly evicted early this morning.  
> The errors from the dmesg are below. It's running infiniband. There  
> were no infiniband errors that I could tell and all the mds/mgs and  
> oss's said was "haven't heard from client xyz in 2277 seconds.  
> Evicting". The client has halfway come back and now shows this -
>
>
> aaron at cola10:~ $ lfs df -h
> UUID                     bytes      Used Available  Use% Mounted on
> data-MDT0000_UUID        87.5G      6.4G     81.1G    7% /data[MDT:0]
> data-OST0000_UUID         5.4T      4.9T    439.6G   92% /data[OST:0]
> data-OST0001_UUID   : inactive device
> data-OST0002_UUID   : inactive device
> data-OST0003_UUID   : inactive device
> data-OST0004_UUID   : inactive device
> data-OST0005_UUID   : inactive device
> data-OST0006_UUID   : inactive device
> data-OST0007_UUID   : inactive device
> data-OST0008_UUID   : inactive device
> data-OST0009_UUID   : inactive device
>
> filesystem summary:       5.4T      4.9T    439.6G   92% /data
>
> so it's reconnected to one of 10 osts. I tried to to an lctl -- 
> device {device} reconnect and it said "Error: Operation in  
> progress". I have no idea what went wrong and I'm confident a reboot  
> would fix it but I'd like to avoid it if possible.
>
>
> Thanks in advance.
>
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.70 at o2ib. The mds_statfs operation failed with -107
> Lustre: data-MDT0000-mdc-ffff81013037b800: Connection to service  
> data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress  
> operations using this service will wait for recovery to complete.
> LustreError: 167-0: This client was evicted by data-MDT0000; in  
> progress operations using this service will fail.
> LustreError: 22345:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -5
> LustreError: 22396:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff810136334400 x81717113/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 22396:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 22454:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff8101136d2000 x81717114/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 22454:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 22463:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff810024ee4c00 x81717115/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 22463:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 22734:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff8101316c8200 x81717138/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 22734:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 22736:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff8101136d2c00 x81717139/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 22736:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 22912:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff8101136d2c00 x81717140/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 22912:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff81012cebb000 x81717143/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req())  
> Skipped 2 previous similar messages
> LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped  
> 2 previous similar messages
> LustreError: 23781:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff81012bd02000 x81717144/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 23781:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 23796:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff81006c776000 x81717156/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 23827:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff81013cbae400 x81717157/t0 o41->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal())  
> mdc_statfs fails: rc = -108
> LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped  
> 1 previous similar message
> LustreError: 22346:0:(client.c:519:ptlrpc_import_delay_req()) @@@  
> IMP_INVALID  req at ffff8100a5f3d400 x81717169/t0 o35->data-MDT0000_UUID at 192.168.64.70 
> @o2ib:12 lens 296/896 ref 1 fl Rpc:/0/0 rc 0/0
> LustreError: 22346:0:(file.c:97:ll_close_inode_openhandle()) inode  
> 21601226 mdc close failed: rc = -108
> Lustre: data-MDT0000-mdc-ffff81013037b800: Connection restored to  
> service data-MDT0000 using nid 192.168.64.70 at o2ib.
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.71 at o2ib. The ost_statfs operation failed with -107
> Lustre: data-OST0001-osc-ffff81013037b800: Connection to service  
> data-OST0001 via nid 192.168.64.71 at o2ib was lost; in progress  
> operations using this service will wait for recovery to complete.
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.71 at o2ib. The ost_statfs operation failed with -107
> LustreError: 167-0: This client was evicted by data-OST0001; in  
> progress operations using this service will fail.
> LustreError: 167-0: This client was evicted by data-OST0002; in  
> progress operations using this service will fail.
> LustreError: 24093:0:(llite_lib.c:1520:ll_statfs_internal())  
> obd_statfs fails: rc = -5
> Lustre: data-OST0000-osc-ffff81013037b800: Connection restored to  
> service data-OST0000 using nid 192.168.64.71 at o2ib.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080430/399fb03f/attachment.htm>


More information about the lustre-discuss mailing list