[Lustre-discuss] Lustre client error

Jagga Soorma jagga13 at gmail.com
Tue Feb 15 16:09:07 PST 2011


This OST is 100% now with only 12GB remaining and something is actively
writing to this volume.  What would be the appropriate thing to do in this
scenario?  If I set this to read only on the mds then some of my clients
start hanging up.

Should I be running "lfs find -O OST_UID /lustre" and then move the files
out of this filesystem and re-add them back?  But then there is no gurantee
that they will not be written to this specific OST.

Any help would be greately appreciated.

Thanks,
-J

On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> I might be looking at the wrong OST.  What is the best way to map the
> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>
> Thanks,
> -J
>
>
> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> Also, it looks like the client is reporting a different %used compared to
>> the oss server itself:
>>
>> client:
>> reshpc101:~ # lfs df -h | grep -i 0007
>> reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84% /reshpcfs[OST:7]
>>
>> oss:
>> /dev/mapper/mpath7    2.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7
>>
>> Here is how the data seems to be distributed on one of the OSS's:
>> --
>> /dev/mapper/mpath5    2.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
>> /dev/mapper/mpath6    2.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
>> /dev/mapper/mpath7    2.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
>> /dev/mapper/mpath8    2.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
>> /dev/mapper/mpath9    2.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
>> --
>>
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> I did deactivate this OST on the MDS server.  So how would I deal with a
>>> OST filling up?  The OST's don't seem to be filling up evenly either.  How
>>> does lustre handle a OST that is at 100%?  Would it not use this specific
>>> OST for writes if there are other OST available with capacity?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <adilger at whamcloud.com>wrote:
>>>
>>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>>> > Client situation depends on where you deactivated the OST - if you
>>>> deactivate on the MDS only, clients should be able to read.
>>>> >
>>>> > What is best to do when an OST fills up really depends on what else
>>>> you are doing at the time, and how much control you have over what the
>>>> clients are doing and other things.  If you can solve the space issue with a
>>>> quick rm -rf, best to leave it online, likewise if all your clients are
>>>> trying to bang on it and failing, best to turn things off. YMMV
>>>>
>>>> In theory, with 1.8 the full OST should be skipped for new object
>>>> allocations, but this is not robust in the face of e.g. a single very large
>>>> file being written to the OST that takes it from "average" usage to being
>>>> full.
>>>>
>>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at gmail.com>
>>>> wrote:
>>>> > Hi Guys,
>>>> >
>>>> > One of my clients got a hung lustre mount this morning and I saw the
>>>> following errors in my logs:
>>>> >
>>>> > --
>>>> > ..snip..
>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
>>>> failed with -28
>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
>>>> previous similar messages
>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
>>>> failed with -28
>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141
>>>> previous similar messages
>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
>>>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
>>>> operations using this service will wait for recovery to complete.
>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
>>>> failed with -16
>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779
>>>> previous similar messages
>>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
>>>> failed with -16
>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
>>>> similar messages
>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
>>>> failed with -16
>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
>>>> similar messages
>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
>>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
>>>> failed with -16
>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
>>>> similar messages
>>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
>>>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>>>> > --
>>>> >
>>>> > Due to disk space issues on my lustre filesystem one of the OST's were
>>>> full and I deactivated that OST this morning.  I thought that operation just
>>>> puts it in a read only state and that clients can still access the data from
>>>> that OST.  After activating this OST again the client connected again and
>>>> was okay after this.  How else would you deal with a OST that is close to
>>>> 100% full?  Is it okay to leave the OST active and the clients will know not
>>>> to write data to that OST?
>>>> >
>>>> > Thanks,
>>>> > -J
>>>> >
>>>> > _______________________________________________
>>>> > Lustre-discuss mailing list
>>>> > Lustre-discuss at lists.lustre.org
>>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > Lustre-discuss mailing list
>>>> > Lustre-discuss at lists.lustre.org
>>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>>> Cheers, Andreas
>>>> --
>>>> Andreas Dilger
>>>> Principal Engineer
>>>> Whamcloud, Inc.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110215/2338bd81/attachment.htm>


More information about the lustre-discuss mailing list