[Lustre-discuss] Problem replacing an OST in 1.6.7

Nirmal Seenu nirmal at fnal.gov
Wed Mar 4 10:36:17 PST 2009



Andreas Dilger wrote:
> On Mar 03, 2009  17:15 -0600, Nirmal Seenu wrote:
>> mkfs.lustre --fsname=lqcdproj --ost --mgsnode=iblustre1 at tcp1 
>> --mkfsoptions="-m 0" --index=0000 --reformat /dev/md2
>>
>> I received these error messages when I tried to mount it for the first time:
>>
>> Mar  3 16:19:53 lustre1 kernel: Lustre: OST lqcdproj-OST0000 now serving 
>> dev (lqcdproj-OST0000/a968f0cc-a66b-bbf7-458f-9b8759c60ef5) with 
>> recovery enabled
> 
> So, the new OST has started up after being reformatted.
> 
>> Mar  3 16:19:56 lustre1 kernel: Lustre: MDS lqcdproj-MDT0000: 
>> lqcdproj-OST0000_UUID now active, resetting orphans
> 
> Here, the MDS (which doesn't know that the OST was reformatted)
> is trying to recreate the objects that are missing from the OST
> (this might be several millions, because it doesn't know you
> reformatted the filesystem).
> 
>> Mar  3 16:19:58 lustre1 kernel: LustreError: 
>> 6359:0:(filter.c:3138:filter_precreate()) create failed rc = -28
> 
> Here, the OST has run out of inodes, because it was trying to
> create some millions of objects.
> 
> 
> This is probably a situation that Lustre could handle more gracefully,
> by just refusing to recreate those missing objects if the count is too
> high and accept the MDS's word for it that those objects were previously
> used.  It isn't ideal, since the number of times an OST is reformatted
> like this is very small.
> 
> Can you please file a bug at bugzilla.lustre.org with the detailed
> procedure you followed.
> 
> In the meantime I suggest you just format your new OST and add it
> without specifying an OST index, and permanently mark OST0000 inactive
> (steps to do so were recently discussed on the list).
> 
This was the first thing that I tried. To permanently remove the OST0000 
by using the command:

lctl --device 5 conf_param lqcdproj-OST0000.osc.active=0

The problem that I ran into at that point was that the execution of "lfs 
check servers" basically hosed the worker nodes and I had to reboot all 
the nodes. I am still running 1.6.6 version of patchless client on a 
RHEL4 machine with 2.6.21 kernel.org kernel on the worker nodes.

Has this issue been fixed in 1.6.7? It will be at the least a couple of 
weeks before I could upgrade all the clients to 1.6.7.

In the mean time I am going to try reformatting my OST to have more 
inodes and see if that fixes this problem.

Thanks
Nirmal



More information about the lustre-discuss mailing list