[Lustre-discuss] clients hanging -- found exisiting inode...

Mon Oct 27 12:44:58 PDT 2008

On Oct 25, 2008  11:03 -0400, cwalker at fas.harvard.edu wrote:
> We're having problems with clients hanging with the following messages on the
> client:
> 
> Oct 25 07:22:55 herologin1 kernel: LustreError: 16556:0
> (osc_request.c:2866:osc_set_data_with_check()) ##
> # inconsistent l_ast_data found ns: circelfs-OST0017-osc-ffff81021d4dbc00 lock:
> ffff81017dc9a600/0x45756a
> e33592b057 lrc: 3/1,0 mode: PR/PR res: 690850/0 rrc: 2 type: EXT
> [0->18446744073709551615] (req 0->184467
> 44073709551615) flags: 100000 remote: 0x310e24cdd40141c8 expref: -99 pid: 16385
> Oct 25 07:22:55 herologin1 kernel: LustreError:
> 16556:0:(osc_request.c:2872:osc_set_data_with_check()) AS
> SERTION(old_inode->i_state & I_FREEING) failed:Found existing inode
> ffff810171540638/130588871/1546649987
>  state 1 in lock: setting data to ffff81015f5275f8/130588869/1546649985
> Oct 25 07:22:55 herologin1 kernel: LustreError:
> 16556:0:(osc_request.c:2872:osc_set_data_with_check()) LB
> 
> 
> followed by the client hanging.  Nothing appears on the MDS or on the OSS in
> question.  These symptoms were reported by another user, but there was no
> resolution or workaround.  We're getting this about once per day on our head
> nodes.  Has anyone had any luck with this issue?

This means you may have a corrupted back-end filesystem.  The inode
130588871 and inode 130588869 both are using the same OST object ID.
If the user knows which files they are accessing then the easiest
soltion is to just delete those two files.  Failing that, you can
find out the pathnames for these files on the MDS with:

	debugfs -c -R "ncheck 130588871 130588869" /dev/{mdsdev}

and take off the "/ROOT" part of the pathname provided.

If you want to make a copy of the file (one of them will likely be
corrupted, or a duplicate of the other) then make a copy of ONE file
on one node, and the OTHER file on another node and then delete both
of the files on their respective nodes.  Accessing both files on a
single node will trigger this assertion again.

The "lfsck" tool will also detect and fix this, but it is much slower
than doing it by hand unless there is a large amount of corruption.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.