[Lustre-discuss] What's the human translation for: ost_write operation failed with -28

Thomas Guthmann tguthmann at iseek.com.au
Sun Dec 4 23:30:13 PST 2011


Hi,

Over the week-end on the client sides we started to see a lot of:

LustreError: 11-0: an error occurred while communicating with 
192.168.1.32 at tcp. The ost_write operation failed with -28
LustreError: Skipped 23528 previous similar messages

What do they mean or imply, I guess -28 means a specific error ?

The host 192.168.1.32 is up and provides other lustre filesystems which 
don't have this problem (finger crossed:).Once this error happened, I 
couldn't write to the filesystem anymore despite 250GB free (lfs df -h). 
Unmount / remount the lustrefs fixed the issue. But then the error came 
back later. As usual everything went fine for 2 years until today ;)

Any ideas, leads I can follow to investigate more this issue ?

BTW on the server side we 'only' have the usual messages we had before 
the disaster where xxxx is our lustrefs having the above issue.

Lustre: Skipped 2 previous similar messages
Lustre: xxxx-OST0004: slow direct_io 73s due to heavy IO load
Lustre: xxxx-OST0004: slow journal start 72s due to heavy IO load
Lustre: xxxx-OST0004: slow commitrw commit 72s due to heavy IO load
Lustre: xxxx-OST0003: slow journal start 146s due to heavy IO load
Lustre: xxxx-OST0003: slow brw_start 163s due to heavy IO load
Lustre: Skipped 1 previous similar message
Lustre: xxxx-OST0003: slow journal start 164s due to heavy IO load
Lustre: xxxx-OST0003: slow commitrw commit 164s due to heavy IO load
Lustre: xxxx-OST0003: slow direct_io 164s due to heavy IO load

centos5# rpm -qa |grep lustre
lustre-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5
lustre-ldiskfs-3.1.4-2.6.18_194.17.1.el5_lustre.1.8.5
kernel-devel-2.6.18-194.17.1.el5_lustre.1.8.5
kernel-2.6.18-194.17.1.el5_lustre.1.8.5
lustre-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5

Cheers,
Thomas



More information about the lustre-discuss mailing list