[Lustre-discuss] o2ib possible network problems??

Kevin Van Maren Kevin.Vanmaren at Sun.COM
Tue Sep 23 10:17:27 PDT 2008


Ms. Megan Larko wrote:
> I changed the IB cable on the problem box using the same IB card, PCI
> slot and slot on the IB SilverStorm switch.
> The errors I now see on the clients are the same but the server OSS
> for crew8-OST0000 thru crew8-OST0011 are:
> ib0: multicast join failed for
> ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22
> LustreError: 4346:0:(filter.c:2674:filter_destroy_precreated())
> LustreError:4486:0:(ldmlm_lib.c:1442:target_send_reply_msg(()@@processing
> error -107
>
> Perhaps it could be the IB card?   It is a Mellanox Technologies
> MT25204 [InfiniHost III Lx HCA].  This is the same card in many, but
> not all, of our other systems.  I can try a new IB card on Monday.
>   


Which subnet manager are you using?  You should look a the log files to 
see why you are
getting these "multicast join failed" messages - which are indications 
that there is something
pretty wrong with the infiniband fabric.

For some reason (like the nodes do not support the speed used for the 
multicast group), they
could not join the group.  This is especially critical as this 
particular multicast group is used for all
IPv4 broadcast traffic (eg, IPv4 ARP requests).


Since infiniband multicast is not well understood, let me summarize:

The SM assigns a multicast LID for each multicast group.  Most switches 
only support 1024 multicast LIDs,
and some SMs cannot map more than one group to the same LID, so 
multicast sometimes breaks when you get
too many groups (ie, more than ~900 nodes with just link-local IPv6 
addresses - see below).

When a node first joins a multicast group, it selects the group speed 
(typically SDR 4x or DDR 4x).
Nodes that do not support (at least) that speed are not allowed to join 
later, as all multicast messages for that
LID are sent at that speed (ie, an SDR node cannot joing a DDR mcast 
group, as it could not keep up).

With IPv6, ARPs are done using multicast (which is perfect for broadcast 
LANs, where only the target
node takes an interrupt to process the ARP request), which can lead to a 
multicast group being created
per IPv6 address.  Note that IPv4 also uses a few multicast groups.

With infiniband, it is a little messy, where the node has to query the 
MC list from the SM to know the
LID to use to send the multicast ARP.

Try checking the link speeds, and looking at "saquery -g"

Kevin Van Maren




More information about the lustre-discuss mailing list