[Lustre-discuss] bug or transport problem?

Daniel Mayfield dmayfield at rgmadvisors.com
Wed Mar 9 11:44:21 PST 2011


I have a small cluster here to test the viability of Lustre for our purposes.  I have 56 client nodes, an active/standby MDS pair, and 14 OSSes.  One of the users started up a job on the client nodes, and the cluster promptly went nuts (I ended up having to reboot a bunch of nodes).  

All 72 machines are connected via Infiniband.  In the logs, odfs001 is the MDS, o5056 is a client node.

Do I have a misconfiguration here?  Or did I break something at the transport layer?

daniel
--

dmayfield at zdmayfield lustre_dmesg]$grep Lustre *|grep Dropping |cut -f 1 -d :|uniq -c
      2 o5056.dmesg
     18 odfs001.dmesg
    590 odfs002.dmesg
     78 odfs003.dmesg
     94 odfs004.dmesg
     92 odfs005.dmesg
     98 odfs006.dmesg
     97 odfs007.dmesg
     98 odfs008.dmesg
    110 odfs010.dmesg
     97 odfs011.dmesg
    145 odfs012.dmesg
    107 odfs013.dmesg
    103 odfs014.dmesg
    113 odfs015.dmesg
    632 odfs016.dmesg
dmayfield at zdmayfield lustre_dmesg]$grep Lustre * |grep Dropping |head
o5056.dmesg:Lustre: 25671:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-192.168.50.238 at tcp: peer not alive
o5056.dmesg:Lustre: 25671:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-192.168.50.238 at tcp: peer not alive
odfs001.dmesg:Lustre: 18202:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-0 at lo portal 26 match 1359131017745740 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6822:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.55 at o2ib portal 26 match 1359226700706635 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6846:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.55 at tcp portal 12 match 1359226700706636 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6818:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.55 at o2ib portal 26 match 1359226700706643 offset 0 length 368: 2
odfs001.dmesg:Lustre: 6846:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.55 at tcp portal 12 match 1359226700706652 offset 0 length 368: 2
odfs001.dmesg:Lustre: 6820:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.56 at o2ib portal 26 match 1359136284362079 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6843:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.56 at tcp portal 12 match 1359136284362080 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6810:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.56 at o2ib portal 26 match 1359136284362088 offset 0 length 368: 2



---------------------------------------------------------------
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.




More information about the lustre-discuss mailing list