[lustre-discuss] Filesystem started crashing recently

Marion Hakanson hakansom at ohsu.edu
Wed Jan 23 18:36:03 PST 2019

This sounds similar to LU-11613:

If so, it was fixed for us by upgrading to 2.10.6.  You may be able
to work around it by disabling quotas.



> From: Steve Barnet <barnet at icecube.wisc.edu>
> To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
> Date: Wed, 23 Jan 2019 13:48:29 -0600
> Subject: [lustre-discuss] Filesystem started crashing recently
> Hi all,
>     Since early last summer, we have been running a 2.10.4 filesystem
> pretty much without incident. Then about 2 weeks ago, it started
> crashing for no immediately obvious reason. There are no indications
> of hardware related problems in the logs, and the work loads have
> not changed significantly as far as we can tell.
>     I can't rule out hardware or system performance problems, but if
> that is the case, there are no obvious pointers as to what those
> would be. We had one workload that seemed to trigger the problem
> (a couple dozen jobs running du on parts of the filesystem), but
> that had been running for months, and even after we killed that
> we had a couple crashes.
>     Since the first crash (on 7 January) we have experience
> these crashes sporadically. Sometimes days between crashes,
> other times, hours.
>     The symptoms are the filesystem becoming unresponsive, and a
> load spike on the MDS and one OSS (we have 8x OSS). The OSS
> affected seems to be somewhat random. In the system logs, we see
> hung_task timeouts and stack traces, followed shortly by lustre-log
> dumps. The only real commonality I have seen is that on the MDS,
> the first hung task is in jbd2_journal_commit_transaction.
>     To recover the filesystems, I have done e2fsck on the MDT,
> and any affected OSTs. They have come back cleanly every time.
>     I have attached snippets of the log files at the time of
> the most recent crash.
> A high level summary of our system:
> MDS (1x) & OSS (8x)
>     OS: CentOS Linux release 7.6.1810 (Core)
>     kernel: 3.10.0-862.2.3.el7_lustre.x86_64
>     Lustre: 2.10.4 (ldiskfs)
> Clients: a mix, but predominantly CentOS 7.x running 2.10.4
> Any insights would be greatly appreciated. There are lots of
> logs, so if they would be helpful, I can certainly make them
> available. In particular, that first lustre-log is pretty big,
> so I just grabbed the lines in closest proximity to the crash.
> Also, if there's a way to get more debugging level
> information from lustre, I'm happy to try that as well.
> And I realize this is all at a very high level, so I'll be
> happy to provide any additional info needed to help me figure
> this out.
> Thanks much for taking the time!
> Best,
> ---Steve

More information about the lustre-discuss mailing list