[Lustre-discuss] Lustre NOT HEALTHY

Cliff White Cliff.White at Sun.COM
Tue Jan 13 20:09:27 PST 2009

Brock Palen wrote:
> How common is it for servers to go NOT HEALTHY?  I feel it is  
> happening much more often than it should be with us.  A few times a  
> month.
It should not happen at all, in the normal case. It indicates a problem.

> If this happens, we reboot the servers.  Should we do something  
> else?  Maybe it depends on what the problem was?

Well, determining what the actual problem that caused the NOT HEALTHY 
would be quite useful, yes. I would not just reboot.

-Examine consoles of _all_ servers for any error indications
- Examine syslogs of _all_ servers for any LustreErrors or LBUG
- Check network and hardware health. Are your disks happy?
Is your network dropping packets?

Try to figure out what was happening on the cluster. Does this relate to
a specific user workload or system load condition? Can you reproduce
the situation? Does it happen at a specific time of day, time of month?
> If we should not be getting NOT HEALTHY that often, what information  
> should I collect to report to CFS?

The lustre-diagnostics package is good start for general system config.
Beyond that, most of what we would need is listed above.

> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

More information about the lustre-discuss mailing list