[Lustre-discuss] Nodes claim error with files, then say everything is fine.

Chris Worley worleys at gmail.com
Wed Aug 6 06:26:27 PDT 2008


This seems to be caused by a specific file or directory.  The files in
this directory cycle through having errors trying to see the file,
then, moments later, it sees the file with the wrong size, then it
sees it fine.  For example (<dir> and <user> replace the actual
username and directory), these three "ls" commands were done in quick
succession, at first et[06-10,11,13,15] claim an I/O error trying to
"ls" the file, then et[07,09,13] see the file at 0 length, then they
see it correctly (with et11 still showing an I/O error... which still
doesn't seem to be going away):

[root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l
/lfs/<user>/<dir>/zre.sh | dshbak -c
et07: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et07: ssh exited with exit code 2
et11: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et11: ssh exited with exit code 2
et06: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et06: ssh exited with exit code 2
et09: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et09: ssh exited with exit code 2
et15: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et15: ssh exited with exit code 2
et08: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et08: ssh exited with exit code 2
et13: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et13: ssh exited with exit code 2
----------------
et[01-05,14,16-18]
----------------
 -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07 /lfs/<user>/<dir>/zre.sh
[root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l
/lfs/<user>/<dir>/zre.sh | dshbak -c
et06: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call
pdsh at ln: et06: ssh exited with exit code 2
et15: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call
pdsh at ln: et15: ssh exited with exit code 2
et11: ls: /lfs/<user>/<dir>/zre.sh: Interrupted system call
pdsh at ln: et11: ssh exited with exit code 2
----------------
et[07-09,13]
----------------
 -rwxr-xr-x 1 <user> <user> 0 Jul 25 06:07 /lfs/<user>/<dir>/zre.sh
----------------
et[01-05,14,16-18]
----------------
 -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07 /lfs/<user>/<dir>/zre.sh
[root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l
/lfs/<user>/<dir>/zre.sh | dshbak -c
et11: ls: /lfs/<user>/<dir>/zre.sh: Input/output error
pdsh at ln: et11: ssh exited with exit code 2
----------------
et[01-09,13-18]
----------------
 -rwxr-xr-x 1 <user> <user> 69 Jul 25 06:07 /lfs/<user>/<dir>/zre.sh

Looking at the directory, I still see errors:

[root at ln <dir>]# pdsh -w et[01-18] -x et[10,12] ls -l
/lfs/<user>/<dir>/ | dshbak -c
et09: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown
pdsh at ln: et09: ssh exited with exit code 2
et08: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown
pdsh at ln: et08: ssh exited with exit code 2
et11: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown
pdsh at ln: et11: ssh exited with exit code 2
et06: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown
pdsh at ln: et06: ssh exited with exit code 2
et07: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown
pdsh at ln: et07: ssh exited with exit code 2
et15: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown
pdsh at ln: et15: ssh exited with exit code 2
et13: ls: /lfs/<user>/<dir>/: Cannot send after transport endpoint shutdown
pdsh at ln: et13: ssh exited with exit code 2
----------------
et[01-05,14,16-18]
----------------
 total 20
 drwxrwx--- 4 <user> <user> 4096 Aug  5 09:47 Gaussian-E.01
 -rwxr-x--- 1 <user> <user> 9920 Aug  5 09:25 run_benchmarks
 -rwxr-xr-x 1 <user> <user>   69 Jul 25 06:07 zre.sh

Any ideas?

I'm using Lustre 1.6.5.1, with OSS's running 1.6.4.3 (with RHEL5 with
a 2.6.18-92.1.6 kernel).  I'm not seeing this on RHEL4 (kernel
2.6.9-67.0.22.ELsmp-lfs-1.6.5.1) clients also running 1.6.5.1.

Chris



More information about the lustre-discuss mailing list