[Lustre-discuss] Lustre NOT HEALTHY

Brock Palen brockp at umich.edu
Wed Jan 14 07:07:45 PST 2009


Ok thanks,

It happened again last night, sooner than normal.  I will send a new  
message with the details.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On Jan 13, 2009, at 11:09 PM, Cliff White wrote:

> Brock Palen wrote:
>> How common is it for servers to go NOT HEALTHY?  I feel it is   
>> happening much more often than it should be with us.  A few times  
>> a  month.
> It should not happen at all, in the normal case. It indicates a  
> problem.
>
>> If this happens, we reboot the servers.  Should we do something   
>> else?  Maybe it depends on what the problem was?
>
> Well, determining what the actual problem that caused the NOT  
> HEALTHY would be quite useful, yes. I would not just reboot.
>
> -Examine consoles of _all_ servers for any error indications
> - Examine syslogs of _all_ servers for any LustreErrors or LBUG
> - Check network and hardware health. Are your disks happy?
> Is your network dropping packets?
>
> Try to figure out what was happening on the cluster. Does this  
> relate to
> a specific user workload or system load condition? Can you reproduce
> the situation? Does it happen at a specific time of day, time of  
> month?
>> If we should not be getting NOT HEALTHY that often, what  
>> information  should I collect to report to CFS?
>
> The lustre-diagnostics package is good start for general system  
> config.
> Beyond that, most of what we would need is listed above.
> cliffw
>
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>




More information about the lustre-discuss mailing list