[Lustre-discuss] Lustre client error

Tue Feb 15 15:05:21 PST 2011

I might be looking at the wrong OST.  What is the best way to map the actual
/dev/mapper/mpath[X] to what OST ID is used for that volume?

Thanks,
-J

On Tue, Feb 15, 2011 at 3:01 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> Also, it looks like the client is reporting a different %used compared to
> the oss server itself:
>
> client:
> reshpc101:~ # lfs df -h | grep -i 0007
> reshpcfs-OST0007_UUID      2.0T      1.7T    202.7G   84% /reshpcfs[OST:7]
>
> oss:
> /dev/mapper/mpath7    2.0T  1.9T   40G  98% /gnet/lustre/oss02/mpath7
>
> Here is how the data seems to be distributed on one of the OSS's:
> --
> /dev/mapper/mpath5    2.0T  1.2T  688G  65% /gnet/lustre/oss02/mpath5
> /dev/mapper/mpath6    2.0T  1.7T  224G  89% /gnet/lustre/oss02/mpath6
> /dev/mapper/mpath7    2.0T  1.9T   41G  98% /gnet/lustre/oss02/mpath7
> /dev/mapper/mpath8    2.0T  1.3T  671G  65% /gnet/lustre/oss02/mpath8
> /dev/mapper/mpath9    2.0T  1.3T  634G  67% /gnet/lustre/oss02/mpath9
> --
>
> -J
>
>
> On Tue, Feb 15, 2011 at 2:37 PM, Jagga Soorma <jagga13 at gmail.com> wrote:
>
>> I did deactivate this OST on the MDS server.  So how would I deal with a
>> OST filling up?  The OST's don't seem to be filling up evenly either.  How
>> does lustre handle a OST that is at 100%?  Would it not use this specific
>> OST for writes if there are other OST available with capacity?
>>
>> Thanks,
>> -J
>>
>>
>> On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <adilger at whamcloud.com>wrote:
>>
>>> On 2011-02-15, at 12:20, Cliff White wrote:
>>> > Client situation depends on where you deactivated the OST - if you
>>> deactivate on the MDS only, clients should be able to read.
>>> >
>>> > What is best to do when an OST fills up really depends on what else you
>>> are doing at the time, and how much control you have over what the clients
>>> are doing and other things.  If you can solve the space issue with a quick
>>> rm -rf, best to leave it online, likewise if all your clients are trying to
>>> bang on it and failing, best to turn things off. YMMV
>>>
>>> In theory, with 1.8 the full OST should be skipped for new object
>>> allocations, but this is not robust in the face of e.g. a single very large
>>> file being written to the OST that takes it from "average" usage to being
>>> full.
>>>
>>> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at gmail.com>
>>> wrote:
>>> > Hi Guys,
>>> >
>>> > One of my clients got a hung lustre mount this morning and I saw the
>>> following errors in my logs:
>>> >
>>> > --
>>> > ..snip..
>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
>>> failed with -28
>>> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
>>> similar messages
>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_write operation
>>> failed with -28
>>> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
>>> similar messages
>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>>> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
>>> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
>>> operations using this service will wait for recovery to complete.
>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
>>> failed with -16
>>> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
>>> similar messages
>>> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
>>> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
>>> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
>>> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
>>> failed with -16
>>> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
>>> similar messages
>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
>>> failed with -16
>>> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
>>> similar messages
>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
>>> failed with -16
>>> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
>>> similar messages
>>> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
>>> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
>>> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
>>> > --
>>> >
>>> > Due to disk space issues on my lustre filesystem one of the OST's were
>>> full and I deactivated that OST this morning.  I thought that operation just
>>> puts it in a read only state and that clients can still access the data from
>>> that OST.  After activating this OST again the client connected again and
>>> was okay after this.  How else would you deal with a OST that is close to
>>> 100% full?  Is it okay to leave the OST active and the clients will know not
>>> to write data to that OST?
>>> >
>>> > Thanks,
>>> > -J
>>> >
>>> > _______________________________________________
>>> > Lustre-discuss mailing list
>>> > Lustre-discuss at lists.lustre.org
>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> >
>>> >
>>> > _______________________________________________
>>> > Lustre-discuss mailing list
>>> > Lustre-discuss at lists.lustre.org
>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Principal Engineer
>>> Whamcloud, Inc.
>>>
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110215/0e689396/attachment.htm>