[Lustre-discuss] OFED 1.5.1 on Clients

Andreas Dilger andreas.dilger at oracle.com
Fri Jun 18 12:45:01 PDT 2010


Since the event is unknown it is hard to know in advance whether it  
can be ignored or not. Some protocols encode in the message type  
whether it is 'mandatory' to handle or 'optional', or as Lustre does  
it negotiates in advance what operations are understood and never  
sends unknown requests to peers.  I have no idea whether IB does this  
or not.

In the absence of such information, the safest behaviour (if not the  
most robust) is to fail since the unknown event may be critical to the  
correct behaviour of the system.

Cheers, Andreas

On 2010-06-18, at 12:48, Roger Spellman <Roger.Spellman at terascala.com>  
wrote:

> Jason (or anyone else),
>
> Patch 23498 ( https://bugzilla.lustre.org/attachment.cgi?id=23498 )
> says:
>
> Index: ./lnet/klnds/o2iblnd/o2iblnd_cb.c
> ===================================================================
> RCS file: /cvsroot/cfs/lnet/klnds/o2iblnd/o2iblnd_cb.c,v
> retrieving revision 1.12.6.1.2.5
> diff -u -p -u -p -r1.12.6.1.2.5 o2iblnd_cb.c
> --- ./lnet/klnds/o2iblnd/o2iblnd_cb.c    20 Nov 2008 09:29:34 -0000
> 1.12.6.1.2.5
> +++ ./lnet/klnds/o2iblnd/o2iblnd_cb.c    15 May 2009 12:26:07 -0000
> @@ -2654,6 +2654,8 @@ kiblnd_cm_callback(struct rdma_cm_id *cm
>
>    switch (event->event) {
>    default:
> +                CERROR("Unexpected event: %d, status: %d\n",
> +                       event->event, event->status);
>                 LBUG();
>
> Why should we LBUG just for an unexpected event?  Couldn't it just be
> ignored?
>
> -Roger
>
>> -----Original Message-----
>> From: Jason Rappleye [mailto:jason.rappleye at nasa.gov]
>> Sent: Friday, June 18, 2010 2:16 PM
>> To: Roger Spellman
>> Cc: lustre-discuss at lists.lustre.org
>> Subject: Re: [Lustre-discuss] OFED 1.5.1 on Clients
>>
>>
>> On Jun 18, 2010, at 7:49 AM, Roger Spellman wrote:
>>
>>> Jason,
>>> Thanks for this response.  This brings up another question:
>>
>> np
>>
>>> The bug number you referred to mentions an LBUG in OFED 1.4.1.  Are
>>> you
>>> saying that the same LBUG would occur with OFED 1.5.1 too without
> the
>>> patch?
>>
>> Yes. The patch handles new RDMA CM events that appear in OFED 1.4(.
>> 1?). They are also in 1.5.1. Without the patch, receipt of one of
>> those events will result in an LBUG.
>>
>> Jason
>>
>>>
>>> -Roger
>>>
>>>> -----Original Message-----
>>>> From: Jason Rappleye [mailto:jason.rappleye at nasa.gov]
>>>> Sent: Thursday, June 17, 2010 5:02 PM
>>>> To: Roger Spellman
>>>> Cc: lustre-discuss at lists.lustre.org
>>>> Subject: Re: [Lustre-discuss] OFED 1.5.1 on Clients
>>>>
>>>>
>>>> On Jun 17, 2010, at 1:23 PM, Roger Spellman wrote:
>>>>
>>>>> Hi,
>>>>> Can anyone share their experiences using OFED 1.5.1 on Lustre
>>>>> Clients?  This is needed because RHAT 5.5 does not support OFED
>>> 1.4.2.
>>>>
>>>> We're using it with ~9300 clients running Lustre 1.6.6 and haven't
>>>> identified any OFED 1.5.1-specific issues. If you're using 1.6.x
> and
>>>> haven't done so already, you'll want to apply bug 19520 attach
> 23498.
>>>>
>>>> We just deployed 1.8.2 on a separate cluster with ~130 clients and
>>>> haven't seen any OFED-specific issues there, either. While we did
> see
>>>> some failures when running acc-sm with the stack of software we use
>>>> here, none of those had anything to do with the version of OFED we
>>>> were running.
>>>>
>>>
>>
>> --
>> Jason Rappleye
>> System Administrator
>> NASA Advanced Supercomputing Division
>> NASA Ames Research Center
>> Moffett Field, CA 94035
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list