[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

Mon Aug 24 01:54:25 PDT 2009

Hi,

Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover. Client as
well as servers run lustre-1.6 and kernel 2.6.9-18.

       Doing a ls -ltr for a directory in lustre fs throws following errors
(as got from lustre logs) on client

00000008:00020000:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set_data_with_check())
### inconsistent l_ast_data found ns: scratch-OST0005-osc-ffff81201e8dd800
lock: ffff811f9af04
000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2 type: EXT
[0->18446744073709551615] (req 0->18446744073709551615) flags: 100000
remote: 0xb79b445e381bc9e6 expref: -99 p
id: 22878
00000008:00040000:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
ASSERTION(old_inode->i_state & I_FREEING) failed:Found existing inode
ffff811f2cf693b8/1972725
44/1895600178 state 0 in lock: setting data to
ffff8118ef8ed5f8/207519777/1771835328
00000000:00040000:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
LBUG

On scratch-OST0005 OST it shows

Aug 24 10:22:53 yn266 kernel: LustreError:
3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
ce 569204: rc -2
Aug 24 10:22:53 yn266 kernel: LustreError:
3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous similar
 messages
Aug 24 12:40:43 yn266 kernel: LustreError:
2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
ce 569195: rc -2
Aug 24 12:44:59 yn266 kernel: LustreError:
2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
ce 569198: rc -2

These kind of errors we are getting for many clients.

##History ##
Prior to thsese occurences, our MDS showed signs of failure in way that cpu
load was shooting above 100 (on a quad core quad socket system) and users
were complaining about slow storage performance. We took it offline and did
fsck on unmounted MDS and OSTs. fsck on OSTs went fine but it showed some
errors which were fixed. For data integrity check, mdsdb and ostdb were
built and lfsck was run on a client(client was mounted with abort_recov).

lfsck was run in following order:
lfsck with no fix - reported dangling inodes and orphaned objects
lfsck with -l (backup orphaned objects)
lfsck with -d and -c (delete orphaned objects and create missing OST objects
referenced by MDS)

After above operations, on clients we were seeing file in red and blinking.
Doing a stat came out with an error stating 'no such file or directory'.

My question is whether the order in which lfsck was run (should lfsck be run
multiple times) and  the errors we are getting are related or not.

-- 
Regards--
Rishi Pathak
National PARAM Supercomputing Facility
Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
Pune-Maharastra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090824/a07e0db6/attachment.htm>