[Lustre-discuss] rpm problem on Lustre fs?

Craig Prescott prescott at hpc.ufl.edu
Mon Oct 29 06:21:13 PDT 2007


Hi;

We are running Lustre 1.6.3 using o2ib (OFED 1.2) and tcp networks.
Clients are CentOS 4.5 patchless clients, and the single server
(MGS/MDS/OSS combo) is a CentOS 5.0 with patched kernel (includes
proposed fix for bug 13438).  All nodes are x86_64.

We have run into a problem on the clients when one of our users
tries to install an rpm package into an rpm database that lives
on the Lustre filesystem.

The rpm install command hangs in I/O wait state (client is using
o2ib).  Attempts to access the rpm database directory from other processes
like ls also hang in D state.  ldlm_poold and pdflush are stuck:

[root at iogw1 ~]# ps auxww | grep ' D ' | grep -v grep
root        79  0.0  0.0     0    0 ?        D    Oct26   0:01 [pdflush]
uscms01   7155  0.0  0.0 16912 1684 ?        D    Oct26   0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm
uscms01  11712  0.0  0.0 16912 1684 ?        D    Oct27   0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm
uscms01  15363  0.0  0.0  2820  820 ?        D    Oct27   0:00 /usr/sbin/lsof /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib
uscms01  16589  0.0  0.1 12700 5564 ?        D    Oct26   0:02 rpm -Uvh --define _rpmlock_path /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm/__db.0 -r 
/scratch/mri/osg/app/cmssoft/cms --dbpath /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm --rcfile 
/scratch/mri/osg/app/cmssoft/cms/tmp/BOOTSTRAP/build/eulisse/rpm/PKGTOOLS/slc3_ia32_gcc323/external/rpm/4.4.2.1-CMS3/lib/rpm/rpmrc --nodeps --prefix 
/scratch/mri/osg/app/cmssoft/cms --ignoreos --ignorearch /scratch/mri/osg/app/cmssoft/cms/tmp/BOOTSTRAP/external+elfutils+0.128-CMS3-1-1007.slc3_ia32_gcc323.rpm
uscms01  21531  0.0  0.0 16912 1684 ?        D    Oct27   0:00 ls -al /scratch/mri/osg/app/cmssoft/cms/slc3_ia32_gcc323/var/lib/rpm
root     23827  0.0  0.0     0    0 ?        D    Oct26   0:02 [ldlm_poold]

Rebooting the client and remounting restores access to the rpm
database directory (ls works), but if the user starts their commands
again, the problem repeats.

We tried adding the 'flock' mount option, and the user was able to
do his software installation once, but the problem returned.

In the system logs on the client, I see some LustreErrors, but am
unsure if they correspond to the users activity (see appended).
They mention bug 11742, which does not appear to have a solution.

Has anyone ever seen this before or know of a fix?  Any help
would be appreciated.

Thanks,
Craig

 From the client:

Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 
10.13.24.85 at o2ib inum 18200061/3052028908 object 3065594/0 extent [446464-450559]
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 947c387a, server csum bdfdb0f6, client csum now 279c1071
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error  req at 000001011f910e00 x390550/t31236588 
o4->mri-OST0004_UUID at 10.13.24.85@o2ib:28 lens 384/352 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 
10.13.24.85 at o2ib inum 18200059/1461467594 object 3307767/0 extent [20480-24575]
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 38323908, server csum 8ecf8abb, client csum now 8e60616c
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error  req at 000001011e530a00 x390536/t34503458 
o4->mri-OST0000_UUID at 10.13.24.85@o2ib:28 lens 384/352 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 
10.13.24.85 at o2ib inum 18200060/1314702654 object 3279245/0 extent [1204224-1318911]
Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 3 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 18528fa8, server csum 4cda3c4e, client csum now 7b2e3f37
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 3 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error  req at 00000100bfda0e00 x390556/t29531809 
o4->mri-OST0001_UUID at 10.13.24.85@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 3 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 
10.13.24.85 at o2ib inum 18200060/1314702654 object 3279245/0 extent [1204224-1318911]
Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 4 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum 6fc1808f, server csum f9c02df2, client csum now 271aff8c
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 4 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error  req at 000001011e189e00 x390559/t29531810 
o4->mri-OST0001_UUID at 10.13.24.85@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 4 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit AND doesn't match the original - likely false positive due to mmap IO (bug 11742): from 
10.13.24.85 at o2ib inum 18200061/3052028908 object 3065594/0 extent [446464-450559]
Oct 26 12:56:33 iogw1 kernel: LustreError: Skipped 2 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) original client csum be7ec5be, server csum 952e539d, client csum now 7c80e126
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1087:check_write_checksum()) Skipped 2 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) too many checksum retries, returning error
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) @@@ redo for checksum error  req at 000001012673d000 x390567/t29531812 
o4->mri-OST0001_UUID at 10.13.24.85@o2ib:28 lens 432/360 ref 2 fl Complete:R/0/0 rc 0/0
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1281:osc_brw_redo_request()) Skipped 3 previous similar messages
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) too many checksum retries, returning error
Oct 26 12:56:33 iogw1 kernel: LustreError: 23766:0:(osc_request.c:1277:osc_brw_redo_request()) Skipped 1 previous similar message






More information about the lustre-discuss mailing list