[Lustre-discuss] bug or transport problem?
Daniel Mayfield
dmayfield at rgmadvisors.com
Wed Mar 9 11:44:21 PST 2011
I have a small cluster here to test the viability of Lustre for our purposes. I have 56 client nodes, an active/standby MDS pair, and 14 OSSes. One of the users started up a job on the client nodes, and the cluster promptly went nuts (I ended up having to reboot a bunch of nodes).
All 72 machines are connected via Infiniband. In the logs, odfs001 is the MDS, o5056 is a client node.
Do I have a misconfiguration here? Or did I break something at the transport layer?
daniel
--
dmayfield at zdmayfield lustre_dmesg]$grep Lustre *|grep Dropping |cut -f 1 -d :|uniq -c
2 o5056.dmesg
18 odfs001.dmesg
590 odfs002.dmesg
78 odfs003.dmesg
94 odfs004.dmesg
92 odfs005.dmesg
98 odfs006.dmesg
97 odfs007.dmesg
98 odfs008.dmesg
110 odfs010.dmesg
97 odfs011.dmesg
145 odfs012.dmesg
107 odfs013.dmesg
103 odfs014.dmesg
113 odfs015.dmesg
632 odfs016.dmesg
dmayfield at zdmayfield lustre_dmesg]$grep Lustre * |grep Dropping |head
o5056.dmesg:Lustre: 25671:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-192.168.50.238 at tcp: peer not alive
o5056.dmesg:Lustre: 25671:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-192.168.50.238 at tcp: peer not alive
odfs001.dmesg:Lustre: 18202:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-0 at lo portal 26 match 1359131017745740 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6822:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.55 at o2ib portal 26 match 1359226700706635 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6846:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.55 at tcp portal 12 match 1359226700706636 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6818:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.55 at o2ib portal 26 match 1359226700706643 offset 0 length 368: 2
odfs001.dmesg:Lustre: 6846:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.55 at tcp portal 12 match 1359226700706652 offset 0 length 368: 2
odfs001.dmesg:Lustre: 6820:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.56 at o2ib portal 26 match 1359136284362079 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6843:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-10.4.2.56 at tcp portal 12 match 1359136284362080 offset 0 length 192: 2
odfs001.dmesg:Lustre: 6810:0:(lib-move.c:1826:lnet_parse_put()) Dropping PUT from 12345-192.168.50.56 at o2ib portal 26 match 1359136284362088 offset 0 length 368: 2
---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.
More information about the lustre-discuss
mailing list