[Lustre-discuss] file system instability after fsck and lfsck

Dan dan at nerp.net
Mon Oct 26 09:24:42 PDT 2009


Hi all,

I'm running Lustre 1.6.7.2 on RHEL 4. I ran fsck and lfsck because of
several hard shutdowns due to power fails in the server room.  Prior to
the repairs I was getting a few of the ASSERTION errors listed below on
some clients when certain files were accessed.  This almost always locks
the client.  How can I find these "bad" files?  Even running ls can lock
a client.  Unsurprisingly running ps indicates ls is hanging with status
D or D+.

After repairs 51,833 files were found orphaned and in /lustre/lost
+found.  Also, lfsck reported 414,000 duplicate files when run with -n.
I stopped lfsck when creating the duplicates in lost+found/duplicates
since I didn't have enough space on the FS to create them all!

Some users started reporting that when files are created sometimes they
appear w/o any data.  All permissions, size owner info is all ????.
Many other files are created and access successfully.  Existing files
can be read ok.  The filesystem is currently unusable because nearly all
jobs hang the client, how do I fix this?  


I typically get this error on clients:

Oct 20 16:11:49 node05 kernel:
LustreError:26409:0:(osc_request.c:2974:osc_set_data_with_check())
ASSERTION(old_inode->i_state & I_FREEING) failed: Found existing inode
0000010051c5a278/6590499/4091507727 state 1 in lock: setting data to
0000010051c5acf8/13570878/674587622

LustreError: 5842:0:(lib-move.c:110:lnet_try_match_md()) Matching packet
from 12345-192.168.0.27 at tcp, match 162853 length 1456 too big: 1360
allowed
Lustre: Request x162853 sent from
filesystem-MDT0000-mdc-000001007dc6d400 to NID 192.168.0.27 at tcp 100s ago
has timed out (lmit 100s).
Lustre: filesystem-MDT0000-mdc000001007dc6d400: Connection to service
filesystem-MDT0000 via nid 192.168.0.26 at tcp was lost; in progress
operations using this service will wait for recovery to compelete.


I see a lot of this on the OSSs:



Oct 20 16:25:44 OSS2 kernel: Lustre Error:
7857:0:(osc_request.c:2898:osc_set_data_with_check()) ### inconsistent
l_ast_data found ns: oss2-OST0004-osc-----1041da90c00 lock:
00000103ef8b1040/0x205ca9465c341c75 lrc:3/1,0 mode PR/PR res: 2/0 rrc:2
type: EXT [0->18446744073709551615] (req 0-> 18446744073709551615)
flags: 100000 remote:oxdcf241da6ca3e60a expref: -99 pid:5289

Oct 20 16:26:22 OSS2 kernel: LustreError:
4991:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
resource 482767: rc -2


Thank you,

Dan




More information about the lustre-discuss mailing list