[Lustre-discuss] Lustre client error

Jagga Soorma jagga13 at gmail.com
Tue Feb 15 14:37:50 PST 2011


I did deactivate this OST on the MDS server.  So how would I deal with a OST
filling up?  The OST's don't seem to be filling up evenly either.  How does
lustre handle a OST that is at 100%?  Would it not use this specific OST for
writes if there are other OST available with capacity?

Thanks,
-J

On Tue, Feb 15, 2011 at 11:45 AM, Andreas Dilger <adilger at whamcloud.com>wrote:

> On 2011-02-15, at 12:20, Cliff White wrote:
> > Client situation depends on where you deactivated the OST - if you
> deactivate on the MDS only, clients should be able to read.
> >
> > What is best to do when an OST fills up really depends on what else you
> are doing at the time, and how much control you have over what the clients
> are doing and other things.  If you can solve the space issue with a quick
> rm -rf, best to leave it online, likewise if all your clients are trying to
> bang on it and failing, best to turn things off. YMMV
>
> In theory, with 1.8 the full OST should be skipped for new object
> allocations, but this is not robust in the face of e.g. a single very large
> file being written to the OST that takes it from "average" usage to being
> full.
>
> > On Tue, Feb 15, 2011 at 10:57 AM, Jagga Soorma <jagga13 at gmail.com>
> wrote:
> > Hi Guys,
> >
> > One of my clients got a hung lustre mount this morning and I saw the
> following errors in my logs:
> >
> > --
> > ..snip..
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_write operation failed
> with -28
> > Feb 15 09:38:07 reshpc116 kernel: LustreError: Skipped 4755836 previous
> similar messages
> > Feb 15 09:48:07 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_write operation failed
> with -28
> > Feb 15 09:48:07 reshpc116 kernel: LustreError: Skipped 4649141 previous
> similar messages
> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261945 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
> > Feb 15 10:16:54 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-ffff8830175c8400: Connection to service
> reshpcfs-OST0005 via nid 10.0.250.47 at o2ib3 was lost; in progress
> operations using this service will wait for recovery to complete.
> > Feb 15 10:16:54 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:16:54 reshpc116 kernel: LustreError: Skipped 2888779 previous
> similar messages
> > Feb 15 10:16:55 reshpc116 kernel: Lustre:
> 6254:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
> x1360125198261947 sent from reshpcfs-OST0005-osc-ffff8830175c8400 to NID
> 10.0.250.47 at o2ib3 1344s ago has timed out (1344s prior to deadline).
> > Feb 15 10:18:11 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:18:11 reshpc116 kernel: LustreError: Skipped 10 previous
> similar messages
> > Feb 15 10:20:45 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:20:45 reshpc116 kernel: LustreError: Skipped 21 previous
> similar messages
> > Feb 15 10:25:46 reshpc116 kernel: LustreError: 11-0: an error occurred
> while communicating with 10.0.250.47 at o2ib3. The ost_connect operation
> failed with -16
> > Feb 15 10:25:46 reshpc116 kernel: LustreError: Skipped 42 previous
> similar messages
> > Feb 15 10:31:43 reshpc116 kernel: Lustre:
> reshpcfs-OST0005-osc-ffff8830175c8400: Connection restored to service
> reshpcfs-OST0005 using nid 10.0.250.47 at o2ib3.
> > --
> >
> > Due to disk space issues on my lustre filesystem one of the OST's were
> full and I deactivated that OST this morning.  I thought that operation just
> puts it in a read only state and that clients can still access the data from
> that OST.  After activating this OST again the client connected again and
> was okay after this.  How else would you deal with a OST that is close to
> 100% full?  Is it okay to leave the OST active and the clients will know not
> to write data to that OST?
> >
> > Thanks,
> > -J
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Engineer
> Whamcloud, Inc.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110215/57a3fbe5/attachment.htm>


More information about the lustre-discuss mailing list