[Lustre-discuss] Lustre client error
Jagga Soorma
jagga13 at gmail.com
Wed Feb 16 09:39:02 PST 2011
Another thing that I just noticed is that after deactivating a OST on the
MDS, I am no longer able to check the quota's for users. Here is the
message I receive:
--
Disk quotas for user testuser (uid 17229):
Filesystem kbytes quota limit grace files quota limit
grace
/lustre [0] [0] [0] [0] [0] [0]
Some errors happened when getting quota info. Some devices may be not
working or deactivated. The data in "[]" is inaccurate.
--
Is this normal and expected? Or am I missing something here?
Thanks for all your support. It is much appreciated.
Regards,
-J
On Tue, Feb 15, 2011 at 4:25 PM, Cliff White <cliffw at whamcloud.com> wrote:
> you can use lfs find or lfs getstripe to identify where files are.
> If you move the files out and move them back, the QOS policy should
> re-distribute them evenly, but it very much depends. If you have clients
> using a stripe count of 1,
> a single large file can fill up one OST.
> df on the client reports space for the entire filesystem, df on the OSS
> reports space for the targets
> attached to that server, so yes the results will be different.
> cliffw
>
>
> On Tue, Feb 15, 2011 at 4:09 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> This OST is 100% now with only 12GB remaining and something is actively
>> writing to this volume. What would be the appropriate thing to do in this
>> scenario? If I set this to read only on the mds then some of my clients
>> start hanging up.
>>
>> Should I be running "lfs find -O OST_UID /lustre" and then move the files
>> out of this filesystem and re-add them back? But then there is no gurantee
>> that they will not be written to this specific OST.
>>
>> Any help would be greately appreciated.
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 3:05 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>
>>> I might be looking at the wrong OST. What is the best way to map the
>>> actual /dev/mapper/mpath[X] to what OST ID is used for that volume?
>>>
>>> Thanks,
>>> -J
>>>
>>>
>>> On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>>>
>>>> Also, it looks like the client is reporting a different %used compared
>>>> to the oss server itself:
>>>>
>>>> client:
>>>> reshpc101:~ # lfs df -h | grep -i 0007
>>>> reshpcfs-OST0007_UUID 2.0T 1.7T 202.7G 84%
>>>> /reshpcfs[OST:7]
>>>>
>>>> oss:
>>>> /dev/mapper/mpath7 2.0T 1.9T 40G 98% /gnet/lustre/oss02/mpath7
>>>>
>>>> Here is how the data seems to be distributed on one of the OSS's:
>>>> --
>>>> /dev/mapper/mpath5 2.0T 1.2T 688G 65% /gnet/lustre/oss02/mpath5
>>>> /dev/mapper/mpath6 2.0T 1.7T 224G 89% /gnet/lustre/oss02/mpath6
>>>> /dev/mapper/mpath7 2.0T 1.9T 41G 98% /gnet/lustre/oss02/mpath7
>>>> /dev/mapper/mpath8 2.0T 1.3T 671G 65% /gnet/lustre/oss02/mpath8
>>>> /dev/mapper/mpath9 2.0T 1.3T 634G 67% /gnet/lustre/oss02/mpath9
>>>> --
>>>>
>>>> -J
>>>>
>>>>
>>>> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at gmail.com>wrote:
>>>>
>>>>> I did deactivate this OST on the MDS server. So how would I deal with
>>>>> a OST filling up? The OST's don't seem to be filling up evenly either. How
>>>>> does lustre handle a OST that is at 100%? Would it not use this specific
>>>>> OST for writes if there are other OST available with capacity?
>>>>>
>>>>> Thanks,
>>>>> -J
>>>>>
>>>>>
>>>>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <
>>>>> adilger at whamcloud.com> wrote:
>>>>>
>>>>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>>>>> > Client situation depends on where you deactivated the OST - if you
>>>>>> deactivate on the MDS only, clients should be able to read.
>>>>>> >
>>>>>> > What is best to do when an OST fills up really depends on what else
>>>>>> you are doing at the time, and how much control you have over what the
>>>>>> clients are doing and other things. If you can solve the space issue with a
>>>>>> quick rm -rf, best to leave it online, likewise if all your clients are
>>>>>> trying to bang on it and failing, best to turn things off. YMMV
>>>>>>
>>>>>> In theory, with 1.8 the full OST should be skipped for new object
>>>>>> allocations, but this is not robust in the face of e.g. a single very large
>>>>>> file being written to the OST that takes it from "average" usage to being
>>>>>> full.
>>>>>>
>>>>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at gmail.com>
>>>>>> wrote:
>>>>>> > Hi Guys,
>>>>>> >
>>>>>> > One of my clients got a hung lustre mount this morning and I saw the
>>>>>> following errors in my logs:
>>>>>> >
>>>>>> > --
>>>>>> > ..snip..
>>>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_write
>>>>>> operation failed with -28
>>>>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836
>>>>>> previous similar messages
>>>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_write
>>>>>> operation failed with -28
>>>>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141
>>>>>> previous similar messages
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>>>> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
>>>>>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
>>>>>> operations using this service will wait for recovery to complete.
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779
>>>>>> previous similar messages
>>>>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>>>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>>>>> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>>>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>>>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error
>>>>>> occurred while communicating with 10.0.250.47 at o2ib3. The ost_connect
>>>>>> operation failed with -16
>>>>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
>>>>>> similar messages
>>>>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>>>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
>>>>>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>>>>>> > --
>>>>>> >
>>>>>> > Due to disk space issues on my lustre filesystem one of the OST's
>>>>>> were full and I deactivated that OST this morning. I thought that operation
>>>>>> just puts it in a read only state and that clients can still access the data
>>>>>> from that OST. After activating this OST again the client connected again
>>>>>> and was okay after this. How else would you deal with a OST that is close
>>>>>> to 100% full? Is it okay to leave the OST active and the clients will know
>>>>>> not to write data to that OST?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > -J
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Lustre-discuss mailing list
>>>>>> > Lustre-discuss at lists.lustre.org
>>>>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > Lustre-discuss mailing list
>>>>>> > Lustre-discuss at lists.lustre.org
>>>>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>>>
>>>>>> Cheers, Andreas
>>>>>> --
>>>>>> Andreas Dilger
>>>>>> Principal Engineer
>>>>>> Whamcloud, Inc.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110216/632fe16f/attachment.htm>
More information about the lustre-discuss
mailing list