[Lustre-discuss] Lustre clients unable to communicate with server over IB (lustre 1.6.6)
Greg Mason
gmason at msu.edu
Tue Jul 27 15:04:54 PDT 2010
We are currently in the middle of upgrading to lustre 1.8.3 on the server, but are still running lustre 1.6.6 in production on the servers. We are on-track for an upgrade in the next couple of months, but it needs to happen during our next center-wide outage, so I need to fix an issue with this older version of Lustre.
We have 6 lustre clients that recently ceased to be able to communicate with an OSS via the o2ib lustre interface. I am able to tcp ping and ibping in both directions, but receive an Input/Output error with "lctl ping nid". The OSTs that are on this OSS show as inactive.
The clients can talk to the 'inactive' OSTs via the tcp interface if the o2ib lustre interface is disabled on the clients.
scw-045:~ # lfs df -h
UUID bytes Used Available Use% Mounted on
ls09-MDT0000_UUID 61.7G 2.5G 48.9G 4% /mnt/lustre_scratch_2009[MDT:0]
ls09-OST0000_UUID 1.8T 1.1T 622.7G 60% /mnt/lustre_scratch_2009[OST:0]
ls09-OST0001_UUID 1.8T 1.1T 608.9G 61% /mnt/lustre_scratch_2009[OST:1]
ls09-OST0002_UUID 1.8T 1.1T 649.5G 59% /mnt/lustre_scratch_2009[OST:2]
ls09-OST0003_UUID 1.8T 1000.8G 739.9G 54% /mnt/lustre_scratch_2009[OST:3]
ls09-OST0004_UUID 1.8T 1.1T 602.0G 62% /mnt/lustre_scratch_2009[OST:4]
ls09-OST0005_UUID 1.8T 960.2G 780.5G 52% /mnt/lustre_scratch_2009[OST:5]
ls09-OST0006_UUID 1.8T 1.1T 570.0G 63% /mnt/lustre_scratch_2009[OST:6]
ls09-OST0007_UUID 1.8T 1.2T 519.4G 66% /mnt/lustre_scratch_2009[OST:7]
ls09-OST0008_UUID 1.8T 888.3G 852.4G 48% /mnt/lustre_scratch_2009[OST:8]
ls09-OST0009_UUID 1.8T 951.3G 789.4G 51% /mnt/lustre_scratch_2009[OST:9]
ls09-OST000a_UUID 1.8T 1.0T 688.8G 57% /mnt/lustre_scratch_2009[OST:10]
ls09-OST000b_UUID 1.8T 969.9G 770.8G 52% /mnt/lustre_scratch_2009[OST:11]
ls09-OST000c_UUID 1.8T 1.0T 695.4G 56% /mnt/lustre_scratch_2009[OST:12]
ls09-OST000d_UUID 1.8T 1.0T 680.1G 57% /mnt/lustre_scratch_2009[OST:13]
ls09-OST000e_UUID 1.8T 901.8G 838.8G 49% /mnt/lustre_scratch_2009[OST:14]
ls09-OST000f_UUID 1.8T 1.0T 695.4G 56% /mnt/lustre_scratch_2009[OST:15]
ls09-OST0010_UUID 1.8T 995.2G 745.4G 54% /mnt/lustre_scratch_2009[OST:16]
ls09-OST0011_UUID 1.8T 919.9G 820.7G 50% /mnt/lustre_scratch_2009[OST:17]
ls09-OST0012_UUID : inactive device
ls09-OST0013_UUID : inactive device
ls09-OST0014_UUID : inactive device
ls09-OST0015_UUID : inactive device
ls09-OST0016_UUID : inactive device
ls09-OST0017_UUID : inactive device
ls09-OST0018_UUID : inactive device
ls09-OST0019_UUID : inactive device
ls09-OST001a_UUID : inactive device
In the client, I'm seeing these log messages when attempting to mount the filesystem:
Jul 27 11:17:38 scw-045 kernel: LustreError: 2947:0:(o2iblnd_cb.c:2468:kiblnd_rejected()) 10.0.0.45 at o2ib rejected: consumer defined fatal error
Jul 27 11:17:38 scw-045 kernel: LustreError: 2947:0:(o2iblnd_cb.c:2468:kiblnd_rejected()) Skipped 48 previous similar messages
Jul 27 11:17:38 scw-045 kernel: LustreError: 2947:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 req at ffff81021d0aa800 x2157/t0 o8->ls09-OST
001a_UUID at 10.0.0.45@o2ib:6/4 lens 240/400 e 0 to 1 dl 1280243863 ref 2 fl Rpc:N/0/0 rc 0/0
Jul 27 11:17:38 scw-045 kernel: LustreError: 2947:0:(events.c:66:request_out_callback()) Skipped 223 previous similar messages
Jul 27 11:17:38 scw-045 kernel: Lustre: Request x2157 sent from ls09-OST001a-osc-ffff810219166800 to NID 10.0.0.45 at o2ib 0s ago has timed out (limit 5s).
Jul 27 11:17:38 scw-045 kernel: Lustre: Skipped 223 previous similar messages
Jul 27 11:18:53 scw-045 kernel: Lustre: 4746:0:(import.c:507:import_select_connection()) ls09-OST0012-osc-ffff810219166800: tried all connections, increasin
g latency to 50s
Jul 27 11:18:53 scw-045 kernel: Lustre: 4746:0:(import.c:507:import_select_connection()) Skipped 224 previous similar messages
Rebooting an affected client has not resolved the issue. All other clients and servers are fully functional, and I was not able to find any notable errors in our IB network.
I'm trying to avoid it, but my next step is most likely rebooting the OSS. I'd like to avoid that if I can, since we have some rather unique applications our users have that would be quite sensitive to a 15-minute IO pause.
Thanks,
-Greg
--
Greg Mason
HPC Administrator
Michigan State University
High Performance Computing Center
web: www.hpcc.msu.edu
email: gmason at msu.edu
More information about the lustre-discuss
mailing list