[Lustre-discuss] Frequent OSS Crashes with heavy load

wanglu wanglu at ihep.ac.cn
Thu Nov 13 03:32:42 PST 2008


Dear all, 
    This is a piece of error log: 
    Nov 13 18:25:26 boss02 kernel: Lustre: 27228:0:(filter_io_26.c:700:filter_commitrw_write()) Skipped 56 previous similar messages
Nov 13 18:25:26 boss02 kernel: Lustre: 27176:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) besfs-OST0004: slow journal start 47s
Nov 13 18:25:26 boss02 kernel: Lustre: 27231:0:(filter_io_26.c:713:filter_commitrw_write()) besfs-OST0004: slow brw_start 47s
Nov 13 18:25:26 boss02 kernel: Lustre: 27231:0:(filter_io_26.c:713:filter_commitrw_write()) Skipped 8 previous similar messages
Nov 13 18:25:26 boss02 kernel: Lustre: 27176:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) Skipped 10 previous similar messages
Nov 13 18:25:26 boss02 kernel: Lustre: 27278:0:(filter_io_26.c:765:filter_commitrw_write()) besfs-OST0004: slow direct_io 47s
Nov 13 18:25:26 boss02 kernel: Lustre: 27235:0:(lustre_fsfilt.h:302:fsfilt_commit_wait()) besfs-OST0004: slow journal start 47s
Nov 13 18:25:26 boss02 kernel: Lustre: 27235:0:(filter_io_26.c:778:filter_commitrw_write()) besfs-OST0004: slow commitrw commit 47s
Nov 13 18:25:47 boss02 sshd[18062]: Accepted password for root from 192.168.50.33 port 32796
Nov 13 10:25:47 boss02 sshd[18063]: Accepted password for root from 192.168.50.33 port 32796  

<----I could not log in from SSH here and went to the console-->
<---What I saw--->
Nov 13 18:25:47 boss02 sshd(pam_unix)[18064]: session opened for user root by root(uid=0)
Nov 13 18:29:00 boss02 kernel: Lustre: 27501:0:(ldlm_lib.c:525:target_handle_reconnect()) besfs-OST0004: f8e1ba7f-1faf-9b85-b04b-cbf89fe80640 reconnecting
Nov 13 18:29:00 boss02 kernel: Lustre: 27501:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 1 previous similar message
Nov 13 18:29:00 boss02 kernel: Lustre: 27359:0:(ldlm_lib.c:525:target_handle_reconnect()) besfs-OST0001: f819d104-ee19-f011-d6d6-bde44a19a8df reconnecting
Nov 13 18:29:00 boss02 kernel: Lustre: 27359:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 4 previous similar messages
Nov 13 18:29:51 boss02 kernel: Lustre: 27074:0:(ldlm_lib.c:525:target_handle_reconnect()) besfs-OST0001: 1cd07c52-94c0-3a1c-dcd4-390daf0f0d10 reconnecting
Nov 13 18:29:51 boss02 kernel: Lustre: 27074:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 2 previous similar messages
Nov 13 18:35:02 boss02 kernel: LustreError: 26928:0:(socklnd.c:1613:ksocknal_destroy_conn()) Completing partial receive from 12345-192.168.52.79 at tcp, ip 192.168.52.79:1021, with error
Nov 13 18:35:02 boss02 kernel: LustreError: 26928:0:(events.c:361:server_bulk_callback()) event type 2, status -5, desc e1c24000
Nov 13 18:35:02 boss02 kernel: LustreError: 17941:0:(ost_handler.c:1139:ost_brw_write()) @@@ network error on bulk GET 0(1048576)  req at ea8cd200 x10376088/t0 o4->b99b0138-d1de-93db-0418-c08eeb8c4b57 at NET_0x20000c0a8344f_UUID:0/0 lens 384/352 e 0 to 0 dl 1226573467 ref 1 fl Interpret:/0/0 rc 0/0
Nov 13 18:35:02 boss02 kernel: Lustre: 17941:0:(ost_handler.c:1270:ost_brw_write()) besfs-OST0001: ignoring bulk IO comm error with b99b0138-d1de-93db-0418-c08eeb8c4b57 at NET_0x20000c0a8344f_UUID id 12345-192.168.52.79 at tcp - client will retry
Nov 13 18:35:04 boss02 kernel: LustreError: 26928:0:(socklnd.c:1613:ksocknal_destroy_conn()) Completing partial receive from 12345-192.168.52.94 at tcp, ip 192.168.52.94:1021, with error
Nov 13 18:35:04 boss02 kernel: LustreError: 26928:0:(events.c:361:server_bulk_callback()) event type 2, status -5, desc f6ce6000
Nov 13 18:35:04 boss02 kernel: LustreError: 18214:0:(ost_handler.c:1139:ost_brw_write()) @@@ network error on bulk GET 0(1048576)  req at e3d07a00 x12379068/t0 o4->7dfc3e78-2411-0625-f276-26756a033f22 at NET_0x20000c0a8345e_UUID:0/0 lens 384/352 e 0 to 0 dl 1226573468 ref 1 fl Interpret:/0/0 rc 0/0
Nov 13 18:35:04 boss02 kernel: Lustre: 18214:0:(ost_handler.c:1270:ost_brw_write()) besfs-OST0000: ignoring bulk IO comm error with 7dfc3e78-2411-0625-f276-26756a033f22 at NET_0x20000c0a8345e_UUID id 12345-192.168.52.94 at tcp - client will retry
Nov 13 18:35:13 boss02 kernel: LustreError: 26928:0:(socklnd.c:1613:ksocknal_destroy_conn()) Completing partial receive from 12345-192.168.52.70 at tcp, ip 192.168.52.70:1021, with error
Nov 13 18:35:13 boss02 kernel: LustreError: 26928:0:(events.c:361:server_bulk_callback()) event type 2, status -5, desc d9f6b000
Nov 13 18:35:13 boss02 kernel: LustreError: 27177:0:(ost_handler.c:1139:ost_brw_write()) @@@ network error on bulk GET 0(1048576)  req at f5164800 x600925/t0 o4->53e9e602-8258-51f4-c7f9-4b9ded4efc27 at NET_0x20000c0a83446_UUID:0/0 lens 384/352 e 0 to 0 dl 1226573467 ref 1 fl Interpret:/0/0 rc 0/0
Nov 13 18:35:13 boss02 kernel: Lustre: 27177:0:(ost_handler.c:1270:ost_brw_write()) besfs-OST0000: ignoring bulk IO comm error with 53e9e602-8258-51f4-c7f9-4b9ded4efc27 at NET_0x20000c0a83446_UUID id 12345-192.168.52.70 at tcp - client will retry
Nov 13 18:35:15 boss02 kernel: LustreError: 26928:0:(socklnd.c:1613:ksocknal_destroy_conn()) Completing partial receive from 12345-192.168.52.81 at tcp, ip 192.168.52.81:1021, with error
Nov 13 18:35:15 boss02 kernel: LustreError: 26928:0:(events.c:361:server_bulk_callback()) event type 2, status -5, desc d34d2000
Nov 13 18:35:15 boss02 kernel: LustreError: 27237:0:(ost_handler.c:1139:ost_brw_write()) @@@ network error on bulk GET 0(1048576)  req at c5a8da2c x12883457/t0 o4->dce502fc-79fb-9a4e-5e97-90a58a814569 at NET_0x20000c0a83451_UUID:0/0 lens 384/352 e 0 to 0 dl 1226573467 ref 1 fl Interpret:/0/0 rc 0/0
Nov 13 18:35:15 boss02 kernel: Lustre: 27237:0:(ost_handler.c:1270:ost_brw_write()) besfs-OST0003: ignoring bulk IO comm error with dce502fc-79fb-9a4e-5e97-90a58a814569 at NET_0x20000c0a83451_UUID id 12345-192.168.52.81 at tcp - client will retry
Nov 13 18:35:17 boss02 kernel: LustreError: 26928:0:(events.c:361:server_bulk_callback()) event type 2, status -5, desc da5d7000
Nov 13 18:35:18 boss02 kernel: LustreError: 26928:0:(socklnd.c:1613:ksocknal_destroy_conn()) Completing partial receive from 12345-192.168.52.108 at tcp, ip 192.168.52.108:1021, with error
Nov 13 18:35:18 boss02 kernel: LustreError: 26928:0:(socklnd.c:1613:ksocknal_destroy_conn()) Skipped 1 previous similar message
Nov 13 18:35:18 boss02 kernel: LustreError: 26928:0:(events.c:361:server_bulk_callback()) event type 2, status -5, desc c71c0000
Nov 13 18:35:18 boss02 kernel: LustreError: 27215:0:(ost_handler.c:1139:ost_brw_write()) @@@ network error on bulk GET 0(1048576)  req at e3767800 x7236800/t0 o4->66fc5c30-3666-f0e6-d005-a39f58eb4be2 at NET_0x20000c0a8346c_UUID:0/0 lens 384/352 e 0 to 0 dl 1226573468 ref 1 fl Interpret:/0/0 rc 0/0
Nov 13 18:35:18 boss02 kernel: LustreError: 27215:0:(ost_handler.c:1139:ost_brw_write()) Skipped 1 previous similar message
Nov 13 18:35:18 boss02 kernel: Lustre: 27215:0:(ost_handler.c:1270:ost_brw_write()) besfs-OST0000: ignoring bulk IO comm error with 66fc5c30-3666-f0e6-d005-a39f58eb4be2 at NET_0x20000c0a8346c_UUID id 12345-192.168.52.108 at tcp - client will retry
Nov 13 18:35:18 boss02 kernel: Lustre: 27215:0:(ost_handler.c:1270:ost_brw_write()) Skipped 1 previous similar message


<---At that time, the network was down, couldn't ping gateway-->
<--I have tried restart service network, but after restarted, gateway was still unreachable--->







------------------				 
wanglu
2008-11-13

-------------------------------------------------------------
发件人:Andreas Dilger
发送日期:2008-11-13 01:36:57
收件人:Wang lu
抄送:Brian J. Murrell; lustre-discuss at lists.lustre.org
主题:Re: [Lustre-discuss] Frequent OSS Crashes with heavy load

On Nov 12, 2008  13:48 +0000, Wang lu wrote:
> May I ask where can I run PIOS command? I think to determine the max thread
> number of OSS, it should be run on OSS, however, the OST directorys are
> unwritable. Can I write to /dev/sdaX? I am confused. 

Running PIOS directly the /dev/sdX will overwrite all data there.  It should
only be run on the disk devices before the filesystem is formatted.  You
can run PIOS against the filesystem itself (e.g. /mnt/lustre) to just create
regular files in the filesystem.

> Brian J. Murrell 写:
> 
> > On Mon, 2008-11-10 at 16:42 +0000, Wang lu wrote:
> >> I have already 512(max number) IO thread running. Some of them are of "Dead"
> >> status. Is it safe to draw conclusion that the OSS is oversubscribed? 
> > 
> > Until you do some analysis of your storage with the iokit, one cannot
> > really draw any conclusions, however if you are already at the maximum
> > value of OST threads, it would not be difficult to believe that perhaps
> > this is a possibility.
> > 
> > Try a simple experiment and half the number to 256 and see if you have
> > any drop off in throughput to the storage devices.  If not, then you can
> > easily assume that 512 was either too much or not necessary.  You can
> > try doing this again if you wish.  If you get to a value of OST threads
> > where your throughput is lower than it should be, you've gone too low.
> > 
> > But really, the iokit is the more efficient and accurate way to
> > determine this.
> > 
> > b.
> > 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list