[lustre-discuss] MDT LBUG after every restart

Mon Mar 26 05:09:37 PDT 2018

For the record:

This was triggered by one job script using a working directory in a directory tree under MDT0 and 
redirecting stderr of gzip commands to a directory in a treen under MDT1.
Once the user put all of it in either one or the other tree, the problem disappeared.

Thomas

On 03/12/2018 09:54 AM, Thomas Roth wrote:
> Hi all,
> 
> our production system running Lustre 2.5.3 has broken down, and I'm quite clueless.
> 
> The second (of two) MDTs crashed and after reboot + recovery LBUGs again with:
> 
> 
> Mar 11 20:02:37 lxmds15 kernel: Lustre: nyx-MDT0001: Recovery over after 1:36, of 720 clients 720 
> recovered and 0 were evicted.
> 
> Mar 11 20:02:37 lxmds15 kernel: LustreError: 
> 6705:0:(osp_precreate.c:719:osp_precreate_cleanup_orphans()) nyx-OST0001-osc-MDT0001: cannot cleanup
> orphans: rc = -108
> 
> Mar 11 20:02:37 lxmds15 kernel: LustreError: 
> 6705:0:(osp_precreate.c:719:osp_precreate_cleanup_orphans()) Skipped 74 previous similar messages
> 
> Mar 11 20:02:37 lxmds15 kernel: LustreError: 6574:0:(mdt_handler.c:2706:mdt_object_lock0()) ASSERTION( 
> !(ibits & (MDS_INODELOCK_UPDATE |
> MDS_INODELOCK_PERM)) ) failed: nyx-MDT0001: wrong bit 0x2 for remote obj [0x5100027c70:0x17484:0x0]
> 
> Mar 11 20:02:37 lxmds15 kernel: LustreError: 6574:0:(mdt_handler.c:2706:mdt_object_lock0()) LBUG
> 
> 
> 
> This seems to be LU-6071, but I am wondering what actually causes it - there should be no ongoing 
> attempts from a client to create a directory on the second MDT.
> 
> 
> After doing an e2fsck on the MDT, it mounts and then crashes with a different FID each time. (If 
> mounted without fsck, the crashing FID remains the same.)
> 
> 
> Is there any way we can find out more about the cause?
> 
> If it is a finite number of troubling inodes, is there a trick to manipulate/clear these?
> 
> 
> Regards,
> Thomas
> 
>