[Lustre-discuss] Lustre clients unable to communicate with server over IB (lustre 1.6.6)

Tue Jul 27 15:04:54 PDT 2010

We are currently in the middle of upgrading to lustre 1.8.3 on the server, but are still running lustre 1.6.6 in production on the servers. We are on-track for an upgrade in the next couple of months, but it needs to happen during our next center-wide outage, so I need to fix an issue with this older version of Lustre.

We have 6 lustre clients that recently ceased to be able to communicate with an OSS via the o2ib lustre interface. I am able to tcp ping and ibping in both directions, but receive an Input/Output error with "lctl ping nid". The OSTs that are on this OSS show as inactive.

The clients can talk to the 'inactive' OSTs via the tcp interface if the o2ib lustre interface is disabled on the clients.

scw-045:~ # lfs df -h
UUID                     bytes      Used Available  Use% Mounted on
ls09-MDT0000_UUID        61.7G      2.5G     48.9G    4% /mnt/lustre_scratch_2009[MDT:0]
ls09-OST0000_UUID         1.8T      1.1T    622.7G   60% /mnt/lustre_scratch_2009[OST:0]
ls09-OST0001_UUID         1.8T      1.1T    608.9G   61% /mnt/lustre_scratch_2009[OST:1]
ls09-OST0002_UUID         1.8T      1.1T    649.5G   59% /mnt/lustre_scratch_2009[OST:2]
ls09-OST0003_UUID         1.8T   1000.8G    739.9G   54% /mnt/lustre_scratch_2009[OST:3]
ls09-OST0004_UUID         1.8T      1.1T    602.0G   62% /mnt/lustre_scratch_2009[OST:4]
ls09-OST0005_UUID         1.8T    960.2G    780.5G   52% /mnt/lustre_scratch_2009[OST:5]
ls09-OST0006_UUID         1.8T      1.1T    570.0G   63% /mnt/lustre_scratch_2009[OST:6]
ls09-OST0007_UUID         1.8T      1.2T    519.4G   66% /mnt/lustre_scratch_2009[OST:7]
ls09-OST0008_UUID         1.8T    888.3G    852.4G   48% /mnt/lustre_scratch_2009[OST:8]
ls09-OST0009_UUID         1.8T    951.3G    789.4G   51% /mnt/lustre_scratch_2009[OST:9]
ls09-OST000a_UUID         1.8T      1.0T    688.8G   57% /mnt/lustre_scratch_2009[OST:10]
ls09-OST000b_UUID         1.8T    969.9G    770.8G   52% /mnt/lustre_scratch_2009[OST:11]
ls09-OST000c_UUID         1.8T      1.0T    695.4G   56% /mnt/lustre_scratch_2009[OST:12]
ls09-OST000d_UUID         1.8T      1.0T    680.1G   57% /mnt/lustre_scratch_2009[OST:13]
ls09-OST000e_UUID         1.8T    901.8G    838.8G   49% /mnt/lustre_scratch_2009[OST:14]
ls09-OST000f_UUID         1.8T      1.0T    695.4G   56% /mnt/lustre_scratch_2009[OST:15]
ls09-OST0010_UUID         1.8T    995.2G    745.4G   54% /mnt/lustre_scratch_2009[OST:16]
ls09-OST0011_UUID         1.8T    919.9G    820.7G   50% /mnt/lustre_scratch_2009[OST:17]
ls09-OST0012_UUID   : inactive device
ls09-OST0013_UUID   : inactive device
ls09-OST0014_UUID   : inactive device
ls09-OST0015_UUID   : inactive device
ls09-OST0016_UUID   : inactive device
ls09-OST0017_UUID   : inactive device
ls09-OST0018_UUID   : inactive device
ls09-OST0019_UUID   : inactive device
ls09-OST001a_UUID   : inactive device

In the client, I'm seeing these log messages when attempting to mount the filesystem:
Jul 27 11:17:38 scw-045 kernel: LustreError: 2947:0:(o2iblnd_cb.c:2468:kiblnd_rejected()) 10.0.0.45 at o2ib rejected: consumer defined fatal error
Jul 27 11:17:38 scw-045 kernel: LustreError: 2947:0:(o2iblnd_cb.c:2468:kiblnd_rejected()) Skipped 48 previous similar messages
Jul 27 11:17:38 scw-045 kernel: LustreError: 2947:0:(events.c:66:request_out_callback()) @@@ type 4, status -113  req at ffff81021d0aa800 x2157/t0 o8->ls09-OST
001a_UUID at 10.0.0.45@o2ib:6/4 lens 240/400 e 0 to 1 dl 1280243863 ref 2 fl Rpc:N/0/0 rc 0/0
Jul 27 11:17:38 scw-045 kernel: LustreError: 2947:0:(events.c:66:request_out_callback()) Skipped 223 previous similar messages
Jul 27 11:17:38 scw-045 kernel: Lustre: Request x2157 sent from ls09-OST001a-osc-ffff810219166800 to NID 10.0.0.45 at o2ib 0s ago has timed out (limit 5s).
Jul 27 11:17:38 scw-045 kernel: Lustre: Skipped 223 previous similar messages
Jul 27 11:18:53 scw-045 kernel: Lustre: 4746:0:(import.c:507:import_select_connection()) ls09-OST0012-osc-ffff810219166800: tried all connections, increasin
g latency to 50s
Jul 27 11:18:53 scw-045 kernel: Lustre: 4746:0:(import.c:507:import_select_connection()) Skipped 224 previous similar messages

Rebooting an affected client has not resolved the issue. All other clients and servers are fully functional, and I was not able to find any notable errors in our IB network.

I'm trying to avoid it, but my next step is most likely rebooting the OSS. I'd like to avoid that if I can, since we have some rather unique applications our users have that would be quite sensitive to a 15-minute IO pause.

Thanks,

-Greg

--
Greg Mason
HPC Administrator
Michigan State University
High Performance Computing Center

web: www.hpcc.msu.edu
email: gmason at msu.edu