[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

Fri Oct 9 17:40:15 PDT 2009

Hi,

Did you get to the bottom of this?

We are having exactly the same problem with our lustre-1.6.6 (rhel4)  file
systems. Recently it got worst and MDS crashes quite frequently, when we run
e2fsck there are errors that are being fixed. However after some time we
still are seeing  the same errors in the logs about missing objects and
files get corrupted (?-----------) Also clients LBUGs quite frequently with
this message (osc_request.c:2904:osc_set_data_with_check()) LBUG
This looks like serious lustre problem but so far I didn't find any clues on
that even after long search through lustre bugzilla.

Our MDSs and OSSs are UPSed, RAID is behaving OK, we don't see any errors in
the syslog.

I will be grateful for some hints on this one

Wojciech

2009/8/24 rishi pathak <mailmaverick666 at gmail.com>

> Hi,
>
> Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover. Client as
> well as servers run lustre-1.6 and kernel 2.6.9-18.
>
>        Doing a ls -ltr for a directory in lustre fs throws following errors
> (as got from lustre logs) on client
>
> 00000008:00020000:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set_data_with_check())
> ### inconsistent l_ast_data found ns: scratch-OST0005-osc-ffff81201e8dd800
> lock: ffff811f9af04
> 000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2 type:
> EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 100000
> remote: 0xb79b445e381bc9e6 expref: -99 p
> id: 22878
> 00000008:00040000:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
> ASSERTION(old_inode->i_state & I_FREEING) failed:Found existing inode
> ffff811f2cf693b8/1972725
> 44/1895600178 state 0 in lock: setting data to
> ffff8118ef8ed5f8/207519777/1771835328
> 00000000:00040000:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
> LBUG
>
>
> On scratch-OST0005 OST it shows
>
> Aug 24 10:22:53 yn266 kernel: LustreError:
> 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
> ce 569204: rc -2
> Aug 24 10:22:53 yn266 kernel: LustreError:
> 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous similar
>  messages
> Aug 24 12:40:43 yn266 kernel: LustreError:
> 2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
> ce 569195: rc -2
> Aug 24 12:44:59 yn266 kernel: LustreError:
> 2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
> ce 569198: rc -2
>
> These kind of errors we are getting for many clients.
>
> ##History ##
> Prior to thsese occurences, our MDS showed signs of failure in way that cpu
> load was shooting above 100 (on a quad core quad socket system) and users
> were complaining about slow storage performance. We took it offline and did
> fsck on unmounted MDS and OSTs. fsck on OSTs went fine but it showed some
> errors which were fixed. For data integrity check, mdsdb and ostdb were
> built and lfsck was run on a client(client was mounted with abort_recov).
>
> lfsck was run in following order:
> lfsck with no fix - reported dangling inodes and orphaned objects
> lfsck with -l (backup orphaned objects)
> lfsck with -d and -c (delete orphaned objects and create missing OST
> objects referenced by MDS)
>
> After above operations, on clients we were seeing file in red and blinking.
> Doing a stat came out with an error stating 'no such file or directory'.
>
> My question is whether the order in which lfsck was run (should lfsck be
> run multiple times) and  the errors we are getting are related or not.
>
>
>
>
> --
> Regards--
> Rishi Pathak
> National PARAM Supercomputing Facility
> Center for Development of Advanced Computing(C-DAC)
> Pune University Campus,Ganesh Khind Road
> Pune-Maharastra
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091010/b9988c35/attachment.htm>