[Lustre-discuss] Whats wrong with this client?

David Noriega tsk133 at my.utsa.edu
Fri Feb 11 11:01:05 PST 2011


I know the subject line isn't the best, but I don't know what to say
other then a luster client is acting up while others are fine. This
client is our 'file' server. It runs a nfs and samba server on top of
the lustre mount.

/etc/fstab
92.168.5.104 at tcp0:192.168.5.105 at tcp0:/lustre   /lustre lustre
defaults,localflock,_netdev 0 0

Right now lfs df -h shows all the oss as resource unavailable, yet
lctl dl says they are up
lctl dl
  0 UP mgc MGC192.168.5.104 at tcp adc80ed6-e9a1-6791-e3aa-9a699e11275d 5
  1 UP lov lustre-clilov-ffff81032f9a0400 db1e9918-482f-063d-1b42-c2c394a4c81b 4
  2 UP mdc lustre-MDT0000-mdc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5
  3 UP osc lustre-OST0000-osc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5
  4 UP osc lustre-OST0001-osc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5
  5 UP osc lustre-OST0002-osc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5
  6 UP osc lustre-OST0003-osc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5

On the cluster, all nodes are connected just fine, so it seems to just
be this client.  This is what I'm seeing from dmesg:

Alot of these messages:
LustreError: 4462:0:(llite_nfs.c:96:search_inode_for_lustre()) failure
-2 inode 560441703

Then these messages when the 'disconnect' happens

Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1353138276259258 sent from
lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101 at tcp 7s ago
has timed out (7s prior to deadline).
  req at ffff8101f18cd000 x1353138276259258/t0
o101->lustre-OST0003_UUID at 192.168.5.101@tcp:28/4 lens 296/544 e 0 to 1
dl 1297448442 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service
lustre-OST0003 via nid 192.168.5.101 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection restored to
service lustre-OST0003 using nid 192.168.5.101 at tcp.
Lustre: 24416:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1353138276259591 sent from
lustre-OST0002-osc-ffff81032f9a0400 to NID 192.168.5.101 at tcp 8s ago
has timed out (7s prior to deadline).
  req at ffff810292c74c00 x1353138276259591/t0
o101->lustre-OST0002_UUID at 192.168.5.101@tcp:28/4 lens 296/544 e 0 to 1
dl 1297448442 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: 24416:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1
previous similar message
Lustre: lustre-OST0002-osc-ffff81032f9a0400: Connection to service
lustre-OST0002 via nid 192.168.5.101 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0002-osc-ffff81032f9a0400: Connection restored to
service lustre-OST0002 using nid 192.168.5.101 at tcp.
Lustre: Skipped 1 previous similar message
Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1353138276259258 sent from
lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101 at tcp 7s ago
has timed out (7s prior to deadline).
  req at ffff8101f18cd000 x1353138276259258/t0
o101->lustre-OST0003_UUID at 192.168.5.101@tcp:28/4 lens 296/544 e 0 to 1
dl 1297448449 ref 1 fl Rpc:/2/0 rc 0/0
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service
lustre-OST0003 via nid 192.168.5.101 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection restored to
service lustre-OST0003 using nid 192.168.5.101 at tcp.
Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1353138276318758 sent from
lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101 at tcp 0s ago
has failed due to network error (7s prior to deadline).
  req at ffff810321140800 x1353138276318758/t0
o101->lustre-OST0003_UUID at 192.168.5.101@tcp:28/4 lens 296/544 e 0 to 1
dl 1297448467 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service
lustre-OST0003 via nid 192.168.5.101 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
LustreError: 3897:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error
-104 reading HELLO from 192.168.5.100
LustreError: 11b-b: Connection to 192.168.5.100 at tcp at host
192.168.5.100 on port 988 was reset: is it running a compatible
version of Lustre and is 192.168.5.100 at tcp one of its NIDs?
Lustre: 3904:0:(import.c:517:import_select_connection())
lustre-OST0002-osc-ffff81032f9a0400: tried all connections, increasing
latency to 2s
Lustre: 3903:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1353138276318825 sent from lustre-OST0000-osc-ffff81032f9a0400 to NID
192.168.5.101 at tcp 0s ago has failed due to network error (6s prior to
deadline).
  req at ffff8102e8277000 x1353138276318825/t0
o8->lustre-OST0000_UUID at 192.168.5.101@tcp:28/4 lens 368/584 e 0 to 1
dl 1297448473 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 3903:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 7
previous similar messages
LustreError: 3899:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error
-104 reading HELLO from 192.168.5.101
LustreError: 3899:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped
1 previous similar message
LustreError: 11b-b: Connection to 192.168.5.101 at tcp at host
192.168.5.101 on port 988 was reset: is it running a compatible
version of Lustre and is 192.168.5.101 at tcp one of its NIDs?

Which now just repeats. How can I get this client reconnected?

-- 
Personally, I liked the university. They gave us money and facilities,
we didn't have to produce anything! You've never been out of college!
You don't know what it's like out there! I've worked in the private
sector. They expect results. -Ray Ghostbusters



More information about the lustre-discuss mailing list