[Lustre-discuss] Lustre crashes periodically

Dilger, Andreas andreas.dilger at intel.com
Wed Oct 9 23:38:36 PDT 2013


On 2013/10/09 6:25 PM, "Abraham.Alawi at csiro.au" <Abraham.Alawi at csiro.au>
wrote:
>Did you run
>lfsck against it?

To be honest, I can't think of any reason why a crashing server would be
fixed
by running lfsck.  In some rare cases it might be that running e2fsck
could fix
a crash (if there is incorrect error handling in the ldiskfs or Lustre
code).

> 
>No kernel crash dumps?

The first thing to do is to connect a serial console, or enable
netconsole/netdump.
It is possible to cross-cable two serial ports between failover server
pairs and
use mgetty or maybe conman to capture the kernel oops messages.  Without
that, it
is almost impossible to figure out what the problem is.

Cheers, Andreas

> 
>Maybe it¹s not Lustre related problem? If you have no Active/Passive MDS
>setup, Lustre file system will be unusable if the MDS server crashes for
>whatever reason.

> 
>Abraham Alawi
>Linux/UNIX Systems and Storage Specialist |
> STACC Project |
> Information Management & Technology (IMT) |
>CSIRO
> 
>From:
> lustre-discuss-bounces at lists.lustre.org
>[mailto:lustre-discuss-bounces at lists.lustre.org]
>On Behalf Of Arya Mazaheri
>Sent: Wednesday, 9 October 2013 6:52 PM
>To: lustre-discuss at lists.lustre.org
>Subject: [Lustre-discuss] Lustre crashes periodically
>
> 
>Hi everyone, 
>
>I have a problem lately with our Lustre 1.8 deployment. It crashes
>periodically in a way that the nodes can mount the storage and I can't
>access the Lustre server machine neither. So I have to manually restart
>the machine every time to
> make everything normal again. I tried to see the logs, memory usage and
>locks count to see whether these issues may have the cause of the
>problem. But, I don't think they account for this issue.
>
>An interesting symptom I see every time this problem happens is the
>Infiniband switch network usage lights which blink very fast. I think a
>huge traffic on the Infiniband network to the lustre server may cause the
>server crash. Does this
> relevance seems logical?
>
> 
>
>Anyway, I hope some of you may have experience this problem before and
>could help me understand what is happening and how to avoid crashing the
>server again!
>
> 
>
>Thanks,
>
>
>


Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division





More information about the lustre-discuss mailing list