[Lustre-discuss] Lustre filesystem hangs when reading large files

Kevin Van Maren kevin.van.maren at oracle.com
Wed Apr 20 03:42:21 PDT 2011


Chris Exton wrote:
>
> Hello,
>
> We are currently using lustre 1.8.1.1 and using kernel version 
> 2.6.18_128.7.1.el5_lustre.
>
> We are experiencing problems when performing reads of large files from 
> my lustre filesystem, small reads are not affected.
>
> The read process hangs and the following message is reported in 
> /var/log/messages:
>
> Feb 22 15:59:38 leopard kernel: LustreError: 11-0: an error occurred 
> while communicating with 192.168.13.200 at o2ib. The obd_ping operation 
> failed with -107
>
> Feb 22 15:59:38 leopard kernel: Lustre: 
> lustre-OST0000-osc-ffff81067e0eac00: Connection to service 
> lustre-OST0000 via nid 192.168.13.200 at o2ib was lost; in progress 
> operations using this service will wait for recovery to complete.
>
> Feb 22 15:59:38 leopard kernel: LustreError: 
> 6811:0:(import.c:939:ptlrpc_connect_interpret()) lustre-OST0000_UUID 
> went back in time (transno 476754140074 was previously committed, 
> server now claims 0)! See 
> https://bugzilla.lustre.org/show_bug.cgi?id=9646
>
> Feb 22 15:59:38 leopard kernel: LustreError: 167-0: This client was 
> evicted by lustre-OST0000; in progress operations using this service 
> will fail.
>
> Feb 22 15:59:38 leopard kernel: Lustre: 
> lustre-OST0000-osc-ffff81067e0eac00: Connection restored to service 
> lustre-OST0000 using nid 192.168.13.200 at o2ib.
>
> Feb 22 15:59:38 leopard kernel: LustreError: 
> 17592:0:(lov_request.c:196:lov_update_enqueue_set()) enqueue objid 
> 0x18f87222 subobj 0x4d0c9f on OST idx 0: rc -5
>
> I have checked the bugzilla report but we have not had a disk crash 
> and the system was not restarted. Could this be an underlying hardware 
> problem that’s not getting logged?
>

Could be a hardware issue with your network, but not your disk: it looks 
like a network failure resulted in client eviction (server unable to 
contact client, so it was evicted), which resulted in the "back in time" 
message when it reconnected (and could not complete outstanding IOs -- 
pending writes, ie from client cache, get dropped on the floor when 
evicted). See https://bugzilla.lustre.org/show_bug.cgi?id=21681
>
> Any additional help on this matter would be much appreciated.
>
> Kind Regards
>
> Chris
>




More information about the lustre-discuss mailing list