[Lustre-discuss] MDS hangs with OFED

Cliff White cliffw at whamcloud.com
Thu Mar 17 12:26:00 PDT 2011


Unfortunately, we've had lot's of reports of IB instability.  It does appear
to happen
quite a bit, and generally is not a Lustre problem at all.
- Check all mechanical connections, cables, etc. - replace if need be - many
issues have been cable-related.
- Check firmware versions of all IB cards, find the best version for yours.
- Make sure your IB cards are in the proper (best performing) slots in your
backplane.
- If you have an IB switch with monitoring/error reporting you may be able
to get more data.
cliffw


On Thu, Mar 17, 2011 at 10:54 AM, Kevin Hildebrand <kevin at umd.edu> wrote:

>
> We've been seeing occasional hangs on our MDS and I'd like to see if
> anyone else is seeing this or can provide suggestions on where to look.
> This might not even be a Lustre problem at all.
>
> We're running Lustre 1.8.4 with OFED 1.5.2, and kernel version
> 2.6.18-194.3.1.el5_lustre.1.8.4.
>
> The problem is that at some point it appears that something in the IB
> stack is going out to lunch- pings to the IPoIB interface time out, and
> anything that touches IB (perfquery, etc) goes into a hard hang and cannot
> be killed.
>
> The only solution to the problem once it occurs is to power-cycle the
> machine, as shutdown/reboot hang as well.
>
> >From what I can see, the first abnormal entries in the system logs on
> the MDS are messages showing that connections to the OSSes are timing out.
>
> Any insight would be appreciated.
>
> Thanks,
>
> Kevin
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>



-- 
cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110317/ba1c71fc/attachment.htm>


More information about the lustre-discuss mailing list