[Lustre-discuss] Client errors then reboot

David Noriega tsk133 at my.utsa.edu
Mon Nov 22 11:04:27 PST 2010


We've got the latest lustre running(1.8.4) and kernel
2.6.18-194.3.1.el5. I call it our primary client as it is what exposes
the file system for others to use via nfs/samba. Today the machine
seeminly rebooted on its own and checking the logs I see these
messages


Nov 22 12:25:52 cajal kernel: LustreError:
3909:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 8 previous
similar messages
Nov 22 12:25:52 cajal kernel: LustreError: 11b-b: Connection to
192.168.5.101 at tcp at host 192.168.5.101 on port 988 was reset: is it
running a compatible version of Lustre and is 192.168.5.101 at tcp one of its NIDs?
Nov 22 12:25:52 cajal kernel: LustreError: Skipped 8 previous similar messages
Nov 22 12:31:22 cajal kernel: LustreError:
5870:0:(llite_nfs.c:96:search_inode_for_lustre()) failure -2 inode
565846402
Nov 22 12:31:22 cajal kernel: LustreError:
5870:0:(llite_nfs.c:96:search_inode_for_lustre()) Skipped 490 previous
similar messages
Nov 22 12:33:40 cajal mountd[5959]: /lustre/home and /home have same
filehandle for 10.0.0.0/255.0.0.0, using first
Nov 22 12:36:31 cajal kernel: LustreError:
3908:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading
HELLO from 192.168.5.101
Nov 22 12:36:31 cajal kernel: LustreError:
3908:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 9 previous
similar messages
Nov 22 12:36:31 cajal kernel: LustreError: 11b-b: Connection to
192.168.5.101 at tcp at host 192.168.5.101 on port 988 was reset: is it
running a compatible version of Lustre and is 192.168.5.101 at tcp one of
its NIDs?
Nov 22 12:36:31 cajal kernel: LustreError: Skipped 9 previous similar messages
Nov 22 12:37:34 cajal mountd[5959]: authenticated mount request from
129.115.117.22:723 for /lustre/home/qyu926 (/lustre/home)
Nov 22 12:38:38 cajal mountd[5959]: /lustre/home and /home have same
filehandle for 129.115.0.0/255.255.0.0, using first
Nov 22 12:40:20 cajal rpc.idmapd[3669]: nss_getpwnam: name '500' does
not map into domain 'cbi.utsa.edu'
Nov 22 12:41:23 cajal kernel: LustreError:
5466:0:(llite_nfs.c:96:search_inode_for_lustre()) failure -2 inode
565846402
Nov 22 12:41:23 cajal kernel: LustreError:
5466:0:(llite_nfs.c:96:search_inode_for_lustre()) Skipped 503 previous
similar messages

This is the last entry before system reboots and you get the normal
kernel boot messages

This is what I see on 192.168.5.101

Nov 22 12:25:22 data2 kernel: LustreError:
4726:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading
HELLO from 129.115.117.8
Nov 22 12:25:22 data2 kernel: LustreError:
4726:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 8 previous
similar messages
Nov 22 12:36:02 data2 kernel: LustreError:
4725:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading
HELLO from 129.115.117.8
Nov 22 12:36:02 data2 kernel: LustreError:
4725:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 9 previous
similar messages
Nov 22 12:43:39 data2 kernel: Lustre:
23762:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1351344462337868 sent from lustre-OST0002 to NID 129.115.117.8 at tcp 7s
ago has timed out (7s prior to deadline).
Nov 22 12:43:39 data2 kernel:   req at ffff81004c42d800
x1351344462337868/t0 o104->@NET_0x2000081737508_UUID:15/16 lens
296/384 e 0 to 1 dl 1290451419 ref 1 fl Rpc:N/0/0 rc 0/0
Nov 22 12:43:39 data2 kernel: LustreError: 138-a: lustre-OST0002: A
client on nid 129.115.117.8 at tcp was evicted due to a lock blocking
callback to 129.115.117.8 at tcp timed out: rc -107
Nov 22 12:44:38 data2 kernel: Lustre:
23569:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1351344462337882 sent from lustre-OST0003 to NID 129.115.117.8 at tcp 0s
ago has failed due to network error (7s prior to deadline).
Nov 22 12:44:38 data2 kernel:   req at ffff810111b38400
x1351344462337882/t0 o104->@NET_0x2000081737508_UUID:15/16 lens
296/384 e 0 to 1 dl 1290451485 ref 1 fl Rpc:N/0/0 rc 0/0
Nov 22 12:44:38 data2 kernel: LustreError: 138-a: lustre-OST0003: A
client on nid 129.115.117.8 at tcp was evicted due to a lock blocking
callback to 129.115.117.8 at tcp timed out: rc -107


Whats going on?
Thanks
David
-- 
Personally, I liked the university. They gave us money and facilities,
we didn't have to produce anything! You've never been out of college!
You don't know what it's like out there! I've worked in the private
sector. They expect results. -Ray Ghostbusters



More information about the lustre-discuss mailing list