[Lustre-discuss] evicted clients with 1.6.5.1

Christopher Walker cwalker at fas.harvard.edu
Tue Aug 19 16:42:36 PDT 2008


Hello,

Occasionally when we put a client, typically a head node, under very 
heavy load, it freezes all operations on the Lustre mount and requires a 
hard reboot before the mount is usable again.  The symptoms look similar 
to the statahead problem observed by others, but I was under the 
impression that this wouldn't be an issue in 1.6.5.1, the version that 
we're running.  On the client, the messages in the log file are:

Aug 19 12:42:47 herologin1 kernel: LustreError: 11-0: an error occurred 
while communicating with 10.242.42.204 at tcp. The mds_statfs operation 
failed with -107
Aug 19 12:42:47 herologin1 kernel: Lustre: 
circelfs-MDT0000-mdc-ffff81021eabdc00: Connection to service 
circelfs-MDT0000 via nid 10.242.42.204 at tcp was lost; in progress 
operations using this service will wait for recovery to complete.
Aug 19 12:42:47 herologin1 kernel: LustreError: 167-0: This client was 
evicted by circelfs-MDT0000; in progress operations using this service 
will fail.
Aug 19 12:42:47 herologin1 kernel: LustreError: 
7067:0:(llite_lib.c:1549:ll_statfs_internal()) mdc_statfs fails: rc = -5

while on the MGS/MDS the messages are:
Aug 19 12:41:11 circe1 kernel: Lustre: MGS: haven't heard from client 
1be0f382-ff65-f231-d348-9d2523654fbb (at 10.242.40.14 at tcp) in 1127 
seconds. I think it's dead, and I am evicting it.
Aug 19 12:41:11 circe1 kernel: Lustre: Skipped 2 previous similar messages
Aug 19 12:41:50 circe1 kernel: Lustre: circelfs-OST001f: haven't heard 
from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at 10.242.40.14 at tcp) 
in 1127 seconds. I think it's dead, and I am evicting it.
Aug 19 12:41:51 circe1 kernel: Lustre: circelfs-OST0019: haven't heard 
from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at 10.242.40.14 at tcp) 
in 1127 seconds. I think it's dead, and I am evicting it.
Aug 19 12:41:52 circe1 kernel: Lustre: circelfs-OST001a: haven't heard 
from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at 10.242.40.14 at tcp) 
in 1127 seconds. I think it's dead, and I am evicting it.
Aug 19 12:42:15 circe1 kernel: Lustre: circelfs-MDT0000: haven't heard 
from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at 10.242.40.14 at tcp) 
in 1127 seconds. I think it's dead, and I am evicting it.
Aug 19 12:42:15 circe1 kernel: Lustre: Skipped 4 previous similar messages
Aug 19 12:42:47 circe1 kernel: LustreError: 
7735:0:(handler.c:1515:mds_handle()) operation 41 on unconnected MDS 
from 12345-10.242.40.14 at tcp
Aug 19 12:42:47 circe1 kernel: LustreError: 
7735:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error 
(-107)  req at ffff810038817a00 x12738961/t0 o41-><?>@<?>:0/0 lens 128/0 e 
0 to 0 dl 1219164667 ref 1 fl Interpret:/0/0 rc -107/0
Aug 19 12:42:47 circe1 kernel: LustreError: 
7735:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 41 previous 
similar messages


(In the logs above 10.242.40.14 = herologin1).  Should I try the

echo 0 > /proc/fs/lustre/llite/*/statahead_max

solution that fixed the statahead problem?

Many thanks,
Chris




More information about the lustre-discuss mailing list