[Lustre-discuss] Problem replacing an OST in 1.6.7
Nirmal Seenu
nirmal at fnal.gov
Wed Mar 4 10:36:17 PST 2009
Andreas Dilger wrote:
> On Mar 03, 2009 17:15 -0600, Nirmal Seenu wrote:
>> mkfs.lustre --fsname=lqcdproj --ost --mgsnode=iblustre1 at tcp1
>> --mkfsoptions="-m 0" --index=0000 --reformat /dev/md2
>>
>> I received these error messages when I tried to mount it for the first time:
>>
>> Mar 3 16:19:53 lustre1 kernel: Lustre: OST lqcdproj-OST0000 now serving
>> dev (lqcdproj-OST0000/a968f0cc-a66b-bbf7-458f-9b8759c60ef5) with
>> recovery enabled
>
> So, the new OST has started up after being reformatted.
>
>> Mar 3 16:19:56 lustre1 kernel: Lustre: MDS lqcdproj-MDT0000:
>> lqcdproj-OST0000_UUID now active, resetting orphans
>
> Here, the MDS (which doesn't know that the OST was reformatted)
> is trying to recreate the objects that are missing from the OST
> (this might be several millions, because it doesn't know you
> reformatted the filesystem).
>
>> Mar 3 16:19:58 lustre1 kernel: LustreError:
>> 6359:0:(filter.c:3138:filter_precreate()) create failed rc = -28
>
> Here, the OST has run out of inodes, because it was trying to
> create some millions of objects.
>
>
> This is probably a situation that Lustre could handle more gracefully,
> by just refusing to recreate those missing objects if the count is too
> high and accept the MDS's word for it that those objects were previously
> used. It isn't ideal, since the number of times an OST is reformatted
> like this is very small.
>
> Can you please file a bug at bugzilla.lustre.org with the detailed
> procedure you followed.
>
> In the meantime I suggest you just format your new OST and add it
> without specifying an OST index, and permanently mark OST0000 inactive
> (steps to do so were recently discussed on the list).
>
This was the first thing that I tried. To permanently remove the OST0000
by using the command:
lctl --device 5 conf_param lqcdproj-OST0000.osc.active=0
The problem that I ran into at that point was that the execution of "lfs
check servers" basically hosed the worker nodes and I had to reboot all
the nodes. I am still running 1.6.6 version of patchless client on a
RHEL4 machine with 2.6.21 kernel.org kernel on the worker nodes.
Has this issue been fixed in 1.6.7? It will be at the least a couple of
weeks before I could upgrade all the clients to 1.6.7.
In the mean time I am going to try reformatting my OST to have more
inodes and see if that fixes this problem.
Thanks
Nirmal
More information about the lustre-discuss
mailing list