[lustre-discuss] Filesystem started crashing recently

Wed Jan 23 18:36:03 PST 2019

This sounds similar to LU-11613:
  https://jira.whamcloud.com/browse/LU-11613

If so, it was fixed for us by upgrading to 2.10.6.  You may be able
to work around it by disabling quotas.

Regards,

Marion

> From: Steve Barnet <barnet at icecube.wisc.edu>
> To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
> Date: Wed, 23 Jan 2019 13:48:29 -0600
> Subject: [lustre-discuss] Filesystem started crashing recently
> 
> Hi all,
> 
>     Since early last summer, we have been running a 2.10.4 filesystem
> pretty much without incident. Then about 2 weeks ago, it started
> crashing for no immediately obvious reason. There are no indications
> of hardware related problems in the logs, and the work loads have
> not changed significantly as far as we can tell.
> 
>     I can't rule out hardware or system performance problems, but if
> that is the case, there are no obvious pointers as to what those
> would be. We had one workload that seemed to trigger the problem
> (a couple dozen jobs running du on parts of the filesystem), but
> that had been running for months, and even after we killed that
> we had a couple crashes.
> 
>     Since the first crash (on 7 January) we have experience
> these crashes sporadically. Sometimes days between crashes,
> other times, hours.
> 
>     The symptoms are the filesystem becoming unresponsive, and a
> load spike on the MDS and one OSS (we have 8x OSS). The OSS
> affected seems to be somewhat random. In the system logs, we see
> hung_task timeouts and stack traces, followed shortly by lustre-log
> dumps. The only real commonality I have seen is that on the MDS,
> the first hung task is in jbd2_journal_commit_transaction.
> 
>     To recover the filesystems, I have done e2fsck on the MDT,
> and any affected OSTs. They have come back cleanly every time.
> 
>     I have attached snippets of the log files at the time of
> the most recent crash.
> 
> A high level summary of our system:
> 
> MDS (1x) & OSS (8x)
>     OS: CentOS Linux release 7.6.1810 (Core)
>     kernel: 3.10.0-862.2.3.el7_lustre.x86_64
>     Lustre: 2.10.4 (ldiskfs)
> 
> Clients: a mix, but predominantly CentOS 7.x running 2.10.4
> 
> Any insights would be greatly appreciated. There are lots of
> logs, so if they would be helpful, I can certainly make them
> available. In particular, that first lustre-log is pretty big,
> so I just grabbed the lines in closest proximity to the crash.
> 
> Also, if there's a way to get more debugging level
> information from lustre, I'm happy to try that as well.
> 
> And I realize this is all at a very high level, so I'll be
> happy to provide any additional info needed to help me figure
> this out.
> 
> Thanks much for taking the time!
> 
> Best,
> 
> ---Steve
> 
> 
> 
>