[Lustre-discuss] What's the human translation for: ost_write operation failed with -28
tguthmann at iseek.com.au
Sun Dec 4 23:30:13 PST 2011
Over the week-end on the client sides we started to see a lot of:
LustreError: 11-0: an error occurred while communicating with
192.168.1.32 at tcp. The ost_write operation failed with -28
LustreError: Skipped 23528 previous similar messages
What do they mean or imply, I guess -28 means a specific error ?
The host 192.168.1.32 is up and provides other lustre filesystems which
don't have this problem (finger crossed:).Once this error happened, I
couldn't write to the filesystem anymore despite 250GB free (lfs df -h).
Unmount / remount the lustrefs fixed the issue. But then the error came
back later. As usual everything went fine for 2 years until today ;)
Any ideas, leads I can follow to investigate more this issue ?
BTW on the server side we 'only' have the usual messages we had before
the disaster where xxxx is our lustrefs having the above issue.
Lustre: Skipped 2 previous similar messages
Lustre: xxxx-OST0004: slow direct_io 73s due to heavy IO load
Lustre: xxxx-OST0004: slow journal start 72s due to heavy IO load
Lustre: xxxx-OST0004: slow commitrw commit 72s due to heavy IO load
Lustre: xxxx-OST0003: slow journal start 146s due to heavy IO load
Lustre: xxxx-OST0003: slow brw_start 163s due to heavy IO load
Lustre: Skipped 1 previous similar message
Lustre: xxxx-OST0003: slow journal start 164s due to heavy IO load
Lustre: xxxx-OST0003: slow commitrw commit 164s due to heavy IO load
Lustre: xxxx-OST0003: slow direct_io 164s due to heavy IO load
centos5# rpm -qa |grep lustre
More information about the lustre-discuss