[Lustre-discuss] Lustre NOT HEALTHY
Cliff White
Cliff.White at Sun.COM
Tue Jan 13 20:09:27 PST 2009
Brock Palen wrote:
> How common is it for servers to go NOT HEALTHY? I feel it is
> happening much more often than it should be with us. A few times a
> month.
>
It should not happen at all, in the normal case. It indicates a problem.
> If this happens, we reboot the servers. Should we do something
> else? Maybe it depends on what the problem was?
Well, determining what the actual problem that caused the NOT HEALTHY
would be quite useful, yes. I would not just reboot.
-Examine consoles of _all_ servers for any error indications
- Examine syslogs of _all_ servers for any LustreErrors or LBUG
- Check network and hardware health. Are your disks happy?
Is your network dropping packets?
Try to figure out what was happening on the cluster. Does this relate to
a specific user workload or system load condition? Can you reproduce
the situation? Does it happen at a specific time of day, time of month?
>
> If we should not be getting NOT HEALTHY that often, what information
> should I collect to report to CFS?
The lustre-diagnostics package is good start for general system config.
Beyond that, most of what we would need is listed above.
cliffw
>
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list