[lustre-discuss] "Not on preferred path" error

Tue Sep 20 10:39:14 PDT 2016

Thanks very much for the suggestions. dmesg output is here:
http://pastebin.com/jCafCZiZ
We don't see any disk-related stuff there, and also our GUI shows all the RAID 
arrays as being fine.

If anything in there jumps out at you, I'd really appreciate your thoughts! We 
are almost certainly going to reboot the affected OSS later today to see how 
that goes.

We're a fairly small team (12 people or so) so I have a good feel what everyone 
is doing and they should not be abusing it too badly... We did recently ask 
people to delete small files they may have, do you think deletion of a lot of 
small files could trigger such issues? Thanks again!

-lewis

On 9/20/16 12:29 PM, Joe Landman wrote:
> On 09/20/2016 12:21 PM, Lewis Hyatt wrote:
>
>> We do not know if it's related, but this same OSS is in a very bad
>> state, with very high load average (200), very high I/O wait time, and
>> taking many seconds to respond to each read request, making the array
>> more or less unusable. That's the problem we are trying to fix.
>
> This sounds like a storage system failure.  Queuing up of IOs to drive the load
> to 200 usually means something is broken elsewhere in the stack at a lower
> level.  Not always ... sometimes you have users who like to write several
> million/billion small ( < 100 byte ) files.
>
> What does dmesg report?  Try to do a pastebin/gist of it, and point it to the
> list.
>
> Things that come to mind are
>
> a) offlined RAID (most likely):  This would explain the user load, and all
> sorts of strange messages about block devices and file systems in the logs
>
> b) A user DoS against the storage: usually someone writing many tiny files.
>
> There are other possibilities, but these seem more likely.
>
>
>