[lustre-discuss] Filesystem started crashing recently

Stephane Thiell sthiell at stanford.edu
Thu Jan 24 08:13:05 PST 2019


Hi Steve,

This could be LU-5152 (https://jira.whamcloud.com/browse/LU-5152), which tentatively tried to fix unprivileged chgrp -R. The patch introduced some kind of dependency between servers in the quota handling. It has been reverted in 2.10.6, however it’s not clear to me what the plan for chgrp -R is at this point. Perhaps someone at Whamcloud could clarify. We definitively have users doing chgrp -R occasionally.

In your case, I would recommend upgrading to 2.10.6, in my experience it's painless to upgrade between 2.10.x, we do that in a rolling upgrade fashion by failing over targets to avoid any significant downtime.

Stephane



> On Jan 23, 2019, at 11:48 AM, Steve Barnet <barnet at icecube.wisc.edu> wrote:
> 
> Hi all,
> 
>   Since early last summer, we have been running a 2.10.4 filesystem
> pretty much without incident. Then about 2 weeks ago, it started
> crashing for no immediately obvious reason. There are no indications
> of hardware related problems in the logs, and the work loads have
> not changed significantly as far as we can tell.
> 
>   I can't rule out hardware or system performance problems, but if
> that is the case, there are no obvious pointers as to what those
> would be. We had one workload that seemed to trigger the problem
> (a couple dozen jobs running du on parts of the filesystem), but
> that had been running for months, and even after we killed that
> we had a couple crashes.
> 
>   Since the first crash (on 7 January) we have experience
> these crashes sporadically. Sometimes days between crashes,
> other times, hours.
> 
>   The symptoms are the filesystem becoming unresponsive, and a
> load spike on the MDS and one OSS (we have 8x OSS). The OSS
> affected seems to be somewhat random. In the system logs, we see
> hung_task timeouts and stack traces, followed shortly by lustre-log
> dumps. The only real commonality I have seen is that on the MDS,
> the first hung task is in jbd2_journal_commit_transaction.
> 
>   To recover the filesystems, I have done e2fsck on the MDT,
> and any affected OSTs. They have come back cleanly every time.
> 
>   I have attached snippets of the log files at the time of
> the most recent crash.
> 
> A high level summary of our system:
> 
> MDS (1x) & OSS (8x)
>   OS: CentOS Linux release 7.6.1810 (Core)
>   kernel: 3.10.0-862.2.3.el7_lustre.x86_64
>   Lustre: 2.10.4 (ldiskfs)
> 
> Clients: a mix, but predominantly CentOS 7.x running 2.10.4
> 
> Any insights would be greatly appreciated. There are lots of
> logs, so if they would be helpful, I can certainly make them
> available. In particular, that first lustre-log is pretty big,
> so I just grabbed the lines in closest proximity to the crash.
> 
> Also, if there's a way to get more debugging level
> information from lustre, I'm happy to try that as well.
> 
> And I realize this is all at a very high level, so I'll be
> happy to provide any additional info needed to help me figure
> this out.
> 
> Thanks much for taking the time!
> 
> Best,
> 
> ---Steve
> 
> 
> 
> <oss-messages.txt><mds-messages.txt><lustre-log.1547807192.63647-snippet.gz>_______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list