[Lustre-discuss] Lustre FS Corruption
Charles Taylor
taylor at hpc.ufl.edu
Wed Oct 3 18:54:13 PDT 2007
We have a 4-way SMP server (dual opteron 275s) configured as a
combined MGS/MDS and OSS server as thus...
/dev/sda 205G 1.8G 192G 1% /lustre/mri/mdt0
/dev/f2c0l0/lv-f2c0l0
3.4T 2.3T 1.2T 67% /lustre/mri/ost0
/dev/f2c0l1/lv-f2c0l1
3.4T 3.0T 460G 87% /lustre/mri/ost1
/dev/f2c1l0/lv-f2c1l0
3.4T 3.0T 399G 89% /lustre/mri/ost2
/dev/f2c1l1/lv-f2c1l1
3.4T 3.0T 418G 88% /lustre/mri/ost3
/dev/f3c0l0/lv-f3c0l0
3.4T 3.0T 430G 88% /lustre/mri/ost4
/dev/f3c0l1/lv-f3c0l1
3.4T 3.0T 431G 88% /lustre/mri/ost5
/dev/f3c1l0/lv-f3c1l0
3.4T 3.0T 378G 90% /lustre/mri/ost6
/dev/f3c1l1/lv-f3c1l1
3.4T 3.0T 417G 88% /lustre/mri/ost7
Under heavy load our server has gone down several times (we think due
to bug 13438). Although we have successfully run e2fsck locally on
the MDS and each OSS AND run lfsck according to the documentation,
we still seem to be missing about 9TB of our storage. That is to say
that "du -s -h *" finds about 14TB but "df -h" says that the file
system is practically full.
[root at submit mri]# df -h .
Filesystem Size Used Avail Use% Mounted on
/mri/scratch 27T 23T 4.0T 86% /scratch/mri
Fortunately, we are in a position to wipe it out and reinitialize the
FS but still, this is a bit disconcerting. Also, we've incorporated
the patch suggested in bug report 13438 into our source and rebuilt
but we don't yet know if this will resolve the crashes. Anyone else
having stability and corruption issues with 1.6.2 on CentOS 4.5
(2.6.9-55.ELsmp)? with the tcp and o2ib (OFED 1.2) lnet modules?
I suppose the next thing to try (if the patch does not work) would be
to upgrade to the CentOS 4.5 update corresponding to the RPMs
(2.6.9-55.0.2) but since we had no problems building from source
against our patched kernel, I'm skeptical about that making much
difference.
Thanks,
Charlie Taylor
UF HPC Center
On Oct 3, 2007, at 6:17 PM, Andreas Dilger wrote:
> On Oct 01, 2007 12:30 +0100, Wojciech Turek wrote:
>> I have one server for MGS/MDS function and 4 server for OSS. All
>> machines are identical. MDS is connected to back and storage that is
>> serving two data LUN's. OSS's are connected to back end storage that
>> is serving 24 data LUN's. each server has two network interface
>> configured as follows.
>> OSS1(hostname=storage07) 10.143.245.7 at tcp0
>> OSS1(hostname=storage07) 10.142.10.7 at tcp1
>>
>> tcp0 is 10GbE
>> tcp1 is 1GbE
>>
>>
>> I would like to configure lustre in such a way that if tcp0 interface
>> will fail on the OSS or MDS, lustre will be able to use secondary
>> network to keep communication alive and at least some of the clients
>> could. Primary network should be 10GbE and secondary network 1GbE
>
> This will work as you want if tcp0 is listed first in modprobe.conf.
> LNET will only use tcp0 unless that fails, at which point it will use
> tcp1.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list