[Lustre-discuss] Lustre client error

Jagga Soorma jagga13 at gmail.com
Wed Feb 16 09:39:02 PST 2011


Another thing that I just noticed is that after deactivating a OST on the
MDS, I am no longer able to check the quota's for users.  Here is the
message I receive:

--
Disk quotas for user testuser (uid 17229):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit
grace
      /lustre     [0]     [0]     [0]             [0]     [0]     [0]

Some errors happened when getting quota info. Some devices may be not
working or deactivated. The data in "[]" is inaccurate.
--

Is this normal and expected?  Or am I missing something here?

Thanks for all your support.  It is much appreciated.

Regards,
-J

On Tue, Feb 15, 2011 at 4:25 PM, Cliff White <cliffw at whamcloud.com> wrote:

> you can use lfs find or lfs getstripe to identify where files are.
> If you move the files out and move them back, the QOS policy should
> re-distribute them evenly, but it very much depends. If you have clients
> using a stripe count of 1,
> a single large file can fill up one OST.
> df on the client reports space for the entire filesystem, df on the OSS
> reports space for the targets
> attached to that server, so yes the results will be different.
> cliffw
>
>
> On Tue, Feb 15, 2011 at 4:09 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> This OST is 100% now with only 12GB remaining and something is actively
>> writing to this volume.  What would be the appropriate thing to do in this
>> scenario?  If I set this to read only on the mds then some of my clients
>> start hanging up.
>>
>> Should I be running "lfs find -O OST_UID /lustre" and then move the files
>> out of this filesystem and re-add them back?  But then there is no gurantee
>> that they will not be written to this specific OST.
>>
>> Any help would be greately appreciated.
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> I might be looking at the wrong OST.  What is the best way to map the
>>> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>>
>>>> Also, it looks like the client is reporting a different %used compared
>>>> to the oss server itself:
>>>>
>>>> client:
>>>> reshpc101:~ # lfs df -h | grep -i 0007
>>>> reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84%
>>>> /reshpcfs[OST:7]
>>>>
>>>> oss:
>>>> /dev/mapper/mpath7    2.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7
>>>>
>>>> Here is how the data seems to be distributed on one of the OSS's:
>>>> --
>>>> /dev/mapper/mpath5    2.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
>>>> /dev/mapper/mpath6    2.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
>>>> /dev/mapper/mpath7    2.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
>>>> /dev/mapper/mpath8    2.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
>>>> /dev/mapper/mpath9    2.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
>>>> --
>>>>
>>>> -J
>>>>
>>>>
>>>> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>
>>>>> I did deactivate this OST on the MDS server.  So how would I deal with
>>>>> a OST filling up?  The OST's don't seem to be filling up evenly either.  How
>>>>> does lustre handle a OST that is at 100%?  Would it not use this specific
>>>>> OST for writes if there are other OST available with capacity?
>>>>>
>>>>> Thanks,
>>>>> -J
>>>>>
>>>>>
>>>>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <
>>>>> adilger at whamcloud.com> wrote:
>>>>>
>>>>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>>>>> > Client situation depends on where you deactivated the OST - if you
>>>>>> deactivate on the MDS only, clients should be able to read.
>>>>>> >
>>>>>> > What is best to do when an OST fills up really depends on what else
>>>>>> you are doing at the time, and how much control you have over what the
>>>>>> clients are doing and other things.  If you can solve the space issue with a
>>>>>> quick rm -rf, best to leave it online, likewise if all your clients are
>>>>>> trying to bang on it and failing, best to turn things off. YMMV
>>>>>>
>>>>>> In theory, with 1.8 the full OST should be skipped for new object
>>>>>> allocations, but this is not robust in the face of e.g. a single very large
>>>>>> file being written to the OST that takes it from "average" usage to being
>>>>>> full.
>>>>>>
>>>>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at gmail.com>
>>>>>> wrote:
>>>>>> > Hi Guys,
>>>>>> >
>>>>>> > One of my clients got a hung lustre mount this morning and I saw the
>>>>>> following errors in my logs:
>>>>>> >
>>>>>> > --
>>>>>> > ..snip..
>>>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_write
>>>>>> operation failed with -28
>>>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
>>>>>> previous similar messages
>>>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_write
>>>>>> operation failed with -28
>>>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141
>>>>>> previous similar messages
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>>>> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
>>>>>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
>>>>>> operations using this service will wait for recovery to complete.
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779
>>>>>> previous similar messages
>>>>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>>>> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>>>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
>>>>>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>>>>>> > --
>>>>>> >
>>>>>> > Due to disk space issues on my lustre filesystem one of the OST's
>>>>>> were full and I deactivated that OST this morning.  I thought that operation
>>>>>> just puts it in a read only state and that clients can still access the data
>>>>>> from that OST.  After activating this OST again the client connected again
>>>>>> and was okay after this.  How else would you deal with a OST that is close
>>>>>> to 100% full?  Is it okay to leave the OST active and the clients will know
>>>>>> not to write data to that OST?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > -J
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Lustre-discuss mailing list
>>>>>> > Lustre-discuss at lists.lustre.org
>>>>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Lustre-discuss mailing list
>>>>>> > Lustre-discuss at lists.lustre.org
>>>>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>>>
>>>>>> Cheers, Andreas
>>>>>> --
>>>>>> Andreas Dilger
>>>>>> Principal Engineer
>>>>>> Whamcloud, Inc.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110216/632fe16f/attachment.htm>


More information about the lustre-discuss mailing list