[Lustre-discuss] OSTs hanging while running IOR
Rafael David Tinoco
Rafael.Tinoco at Sun.COM
Wed Sep 9 15:30:18 PDT 2009
Im attaching the messages (only the error part) file so we don't have these mail formatting problems.
------
Can you provide a bit more of the log before the above so we can see what the stack trace is in reference to? Also, try to
eliminate the white-space between lines. Are you getting any other errors or messages from Lustre prior to that?
Perhaps you are getting some messages saying that various operations are "slow"?
>> Even beeing slow, the OST should respond right ? It "hangs".
Have you tuned these OSSes with respect to the number of OST threads needed to drive (and not over-drive) your disks? The lustre
iokit is useful for that tuning.
>> Ok, tuning for performance is okay, but hanging with 20 nodes (IOR MPI).. strange right ?
b.
-----
I'm using 3 raid 5 with 8 disks each and 256 OST threads on each OSS.
root at a02n00:~# cat /etc/mdadm.conf
ARRAY /dev/md10 level=raid5 num-devices=8 devices=/dev/dm-0,/dev/dm-1,/dev/dm-2,/dev/dm-3,/dev/dm-4,/dev/dm-5,/dev/dm-6,/dev/dm-7
ARRAY /dev/md11 level=raid5 num-devices=8
devices=/dev/dm-8,/dev/dm-9,/dev/dm-10,/dev/dm-11,/dev/dm-12,/dev/dm-13,/dev/dm-14,/dev/dm-15
ARRAY /dev/md12 level=raid5 num-devices=8
devices=/dev/dm-16,/dev/dm-17,/dev/dm-18,/dev/dm-19,/dev/dm-20,/dev/dm-21,/dev/dm-22,/dev/dm-23
All my OSTs were created with internal journal (for test pourposes).
mkfs.lustre --r --ost --fsname=work --mkfsoptions="-b 4096 -E stride=32,stripe-width=224 -m 0" --mgsnid=a03n00 at o2ib
--mgsnid=b03n00 at o2ib /dev/md[10|11|12]
Im using separete mdt and mgs:
# MGS
mkfs.lustre --fsname=work --r --mgs --mkfsoptions="-b 4096 -E stride=4,stripe-width=4 -m 0" --mountfsoptions=acl
--failnode=b03n00 at o2ib /dev/sdb1
# MDT
mkfs.lustre --fsname=work --r --mgsnid=a03n00 at o2ib --mgsnid=b03n00 at o2ib --mdt --mkfsoptions="-b 4096 -E stride=4,stripe-width=40 -m
0" --mountfsoptions=acl --failnode=b03n00 at o2ib /dev/sdc1
I'm using these packages on server:
----------
root at a03n00:~# rpm -aq | grep -i lustre
lustre-modules-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1
lustre-client-modules-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1
lustre-ldiskfs-3.0.9-2.6.18_128.1.14.el5_lustre.1.8.1
kernel-lustre-headers-2.6.18-128.1.14.el5_lustre.1.8.1
kernel-lustre-2.6.18-128.1.14.el5_lustre.1.8.1
lustre-client-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1
kernel-lustre-devel-2.6.18-128.1.14.el5_lustre.1.8.1
lustre-1.8.1-2.6.18_128.1.14.el5_lustre.1.8.1
kernel-ib-1.4.1-2.6.18_128.1.14.el5_lustre.1.8.1
----------
On client Ive compiled kernel 2.6.18-128.el5 without INFINIBAND support.
Then compiled OFED 1.4.1 and after that compile patchless client.
For the patchless client, compiled with:
--ofa-kernel=/usr/src/ofa_kernel
----------
* THE ERROR
Using:
root at b00n00:~# mpirun -hostfile ./lustre.hosts -np 20 /hpc/IOR -w -r -C -i 2 -b 1G -t 512k -F -o /work/stripe12/teste
for example starts "hanging" the OSTs and the filesystem "hangs".
Any atempt to rm or read a file (or df -kh) hangs and keeps forever (not even kill -9 solves).
With that.. I cannot umount my OSTs on the OSSs.
And I have to "reboot" the server, and my raids starts resyncing.
Tinoco
More information about the lustre-discuss
mailing list