[Lustre-discuss] Multirail IB Configuration Issue

mages, brian brian.mages at emc.com
Tue Feb 26 10:04:06 PST 2013


Hi,

It appears that I've resolved the issue and therefore wanted to provide an update to this list.  As I noted in the description of my configuration, the client only has a single IB interface.  After changing the options for lnet in "/etc/modprobe.conf" (on the client) from "options lnet networks=o2ib0(ib0)" to "options lnet networks=o2ib0(ib0),o2ib1(ib0)", things started working.

Now, I said "appears" above because I am seeing an issue that I've not seen in the past.  Occasionally, while testing workloads with 8 concurrent clients, I see a client being evicted.  The stack trace is not always the same.  Here's an excerpt from "/var/log/messages":

Feb 26 11:26:05 bmr2-s14 kernel: Lustre: 7648:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1361895936/real 0]  req at ffff81013fe3d800 x1428048654366757/t0(0) o4->lustre2-OST0015-osc-ffff810229235c00 at 192.168.1.31@o2ib1:6/4 lens 456/416 e 0 to 1 dl 1361895943 ref 3 fl Rpc:X/0/ffffffff rc 0/-1
Feb 26 11:26:05 bmr2-s14 kernel: Lustre: 7648:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 10 previous similar messages
Feb 26 11:26:05 bmr2-s14 kernel: Lustre: lustre2-OST0010-osc-ffff810229235c00: Connection to lustre2-OST0010 (at 192.168.1.20 at o2ib) was lost;
in progress operations using this service will wait for recovery to complete
Feb 26 11:26:05 bmr2-s14 kernel: Lustre: Skipped 2 previous similar messages
Feb 26 11:26:21 bmr2-s14 kernel: Lustre: 7647:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1361895964/real 0]  req at ffff8102438a2800 x1428048654378315/t0(0) o400->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 192/192 e 0 to 1 dl 1361895981 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:26:21 bmr2-s14 kernel: Lustre: 7647:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 14 previous similar messages
Feb 26 11:26:30 bmr2-s14 kernel: Lustre: lustre2-OST0013-osc-ffff810229235c00: Connection restored to lustre2-OST0013 (at 192.168.1.31 at o2ib1)
Feb 26 11:26:30 bmr2-s14 kernel: Lustre: Skipped 8 previous similar messages
Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:2989:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:3052:kiblnd_check_conns()) Timed out RDMA with 192.168.1.20 at o2ib (55): c: 8, oc: 0, rc: 16
Feb 26 11:27:21 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1361896015/real 0]  req at ffff810082199800 x1428048654380582/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896041 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:27:21 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Feb 26 11:29:11 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1361896115/real 0]  req at ffff81009d25ec00 x1428048654380680/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896151 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:26:30 bmr2-s14 kernel: Lustre: Skipped 8 previous similar messages
Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:2989:kiblnd_check_txs_locked()) Timed out tx: active_txs, 3 seconds
Feb 26 11:26:32 bmr2-s14 kernel: LNetError: 7580:0:(o2iblnd_cb.c:3052:kiblnd_check_conns()) Timed out RDMA with 192.168.1.20 at o2ib (55): c: 8, oc: 0, rc: 16
Feb 26 11:27:21 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1361896015/real 0]  req at ffff810082199800 x1428048654380582/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896041 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:27:21 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Feb 26 11:29:11 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for sent delay: [sent 1361896115/real 0]  req at ffff81009d25ec00 x1428048654380680/t0(0) o8->lustre2-OST0010-osc-ffff810229235c00 at 192.168.1.20@o2ib:28/4 lens 368/512 e 0 to 1 dl 1361896151 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Feb 26 11:29:11 bmr2-s14 kernel: Lustre: 7644:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
Feb 26 11:29:15 bmr2-s14 kernel: INFO: task iozone:9201 blocked for more than 120 seconds.
Feb 26 11:29:15 bmr2-s14 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 11:29:15 bmr2-s14 kernel: iozone        D ffffffff801546d1     0  9201      1  9202    9269  7846 (NOTLB)
Feb 26 11:29:15 bmr2-s14 kernel:  ffff8101278f5aa8 0000000000000082 ffff8101278f5ab8 ffffffff80062ff2
Feb 26 11:29:15 bmr2-s14 kernel:  ffff81021dbaddf0 0000000000000007 ffff81014f521820 ffff810108617100
Feb 26 11:29:15 bmr2-s14 kernel:  00003976bb915dac 0000000000001fbe ffff81014f521a08 000000018006ec8f
Feb 26 11:29:15 bmr2-s14 kernel: Call Trace:
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff80062ff2>] thread_return+0x62/0xfe
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff8006ec8f>] do_gettimeofday+0x40/0x90
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff80028d0e>] sync_page+0x0/0x43
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff800637ce>] io_schedule+0x3f/0x67
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff80028d4c>] sync_page+0x3e/0x43
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff800639fa>] __wait_on_bit+0x40/0x6e
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff800350d9>] wait_on_page_bit+0x6c/0x72
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff800a2e8b>] wake_bit_function+0x0/0x23
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff80047cae>] pagevec_lookup_tag+0x1a/0x21
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff8001d19f>] mpage_writepages+0x18d/0x37d
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff88e7f850>] :lustre:ll_writepage+0x0/0x430
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff8005a8a6>] do_writepages+0x20/0x2f
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff8004f767>] __filemap_fdatawrite_range+0x50/0x5b
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff800c8cf4>] sync_page_range+0x3d/0xa0
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff800c8ff2>] generic_file_writev+0x8a/0xa3
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff88ea430d>] :lustre:vvp_io_write_start+0xfd/0x1b0
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff88aaea50>] :obdclass:cl_io_start+0x90/0xf0
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff88ab1718>] :obdclass:cl_io_loop+0x88/0x130
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff88e5d16e>] :lustre:ll_file_io_generic+0x43e/0x480
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff88e5d335>] :lustre:ll_file_writev+0x185/0x1f0
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff88e66a71>] :lustre:ll_file_write+0x121/0x190
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff80016b92>] vfs_write+0xce/0x174
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff8001745b>] sys_write+0x45/0x6e
Feb 26 11:29:15 bmr2-s14 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
Feb 26 11:29:15 bmr2-s14 kernel:

Here's some additional info showing loss of connection to 3 of the 6 OSTs located on this OSS (on the .20 at o2ib interface):

[root at bmr2-s14 ~]# cat /proc/fs/lustre/osc/lustre2-OST*-osc-ffff810229235c00/ost_conn_uuid
192.168.1.25 at o2ib
192.168.1.35 at o2ib1
192.168.1.25 at o2ib
192.168.1.35 at o2ib1
192.168.1.25 at o2ib
192.168.1.35 at o2ib1
192.168.1.26 at o2ib
192.168.1.36 at o2ib1
192.168.1.26 at o2ib
192.168.1.36 at o2ib1
192.168.1.26 at o2ib
192.168.1.36 at o2ib1
192.168.1.20 at o2ib
192.168.1.30 at o2ib1
192.168.1.20 at o2ib
192.168.1.30 at o2ib1
192.168.1.20 at o2ib
192.168.1.30 at o2ib1
192.168.1.21 at o2ib
192.168.1.31 at o2ib1
192.168.1.21 at o2ib
192.168.1.31 at o2ib1
192.168.1.21 at o2ib
192.168.1.31 at o2ib1
[root at bmr2-s14 ~]# cat /proc/fs/lustre/osc/lustre2-OST*-osc-ffff810229235c00/ost_server_uuid
lustre2-OST0000_UUID    FULL
lustre2-OST0001_UUID    FULL
lustre2-OST0002_UUID    FULL
lustre2-OST0003_UUID    FULL
lustre2-OST0004_UUID    FULL
lustre2-OST0005_UUID    FULL
lustre2-OST0006_UUID    FULL
lustre2-OST0007_UUID    FULL
lustre2-OST0008_UUID    FULL
lustre2-OST0009_UUID    FULL
lustre2-OST000a_UUID    FULL
lustre2-OST000b_UUID    FULL
lustre2-OST000c_UUID    CONNECTING
lustre2-OST000d_UUID    FULL
lustre2-OST000e_UUID    CONNECTING
lustre2-OST000f_UUID    FULL
lustre2-OST0010_UUID    CONNECTING
lustre2-OST0011_UUID    FULL
lustre2-OST0012_UUID    FULL
lustre2-OST0013_UUID    FULL
lustre2-OST0014_UUID    FULL
lustre2-OST0015_UUID    FULL
lustre2-OST0016_UUID    FULL
lustre2-OST0017_UUID    FULL
[root at bmr2-s14 ~]#

Based on some research, I've experimented with setting "options ko2iblnd peer_credits=16 concurrent_sends=16" in /etc/modprobe.conf and this has made the issue occur less frequently.  However, it is still occurring.  I'm not sure if this has something to do with both server interfaces being located on the same network or something else.

Any input would be appreciated.

Thanks,
...Brian

From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of mages, brian
Sent: Tuesday, February 12, 2013 3:34 PM
To: lustre-discuss at lists.lustre.org
Subject: [Lustre-discuss] Multirail IB Configuration Issue

Hi,

I'm having difficulty getting one of my clients to work with a multirail IB configuration.  Here's what I've got:

        Host            OS Version      Lustre Version          Function                Storage                 Interface ib0           Interface ib1

 1.  bmr1-s7 CentOS 5.7      2.1.1                   MGS,MDS,OSS1    mdt,mdt2,ost1->6,ost13->18      192.168.1.25/24 192.168.1.35/24
 2.  bmr1-s8 CentOS 5.7      2.1.1                   OSS2                    ost7->12,ost19->24              192.168.1.26/24 192.168.1.36/24
 3.  bmr1-s5 CentOS 5.7      2.1.1                   OSS3                    ost25->30                       192.168.1.20/24 192.168.1.30/24
 4.  bmr1-s6 CentOS 5.7      2.1.1                   OSS4                    ost31->36                       192.168.1.21/24 192.168.1.31/24
 5.  bmr2-s9 CentOS 5.7      2.1.1                   Client                  n/a                             192.168.1.209/24

The "/lustre" filesystem consists of mdt and ost1->12 (using bmr1-s7 and bmr1-s8).
The "/lustre2" filesystem consists of mdt2 and ost13->36 (using bmr1-s7, bmr1-s8, bmr1-s5, and bmr1-s6).
On each OSS, half the OSTs are available only on ib0 and the other half only on ib1.

>From bmr1-s5 and bmr1-s6 (using as clients), I can successfully mount and access "/lustre".  I can also successfully mount "/lustre2".

>From bmr2-s9, I can neither mount "/lustre" nor "/lustre2".  Originally, the issue with bmr2-s9 was that it was running 1.8.6-wc1 (server on CentOS 5.6).  Since this config (i.e., multirail) wasn't supported on that version, I upgraded to 2.1.1.  Originally, I tried installing and testing the 2.1.1 client without success.  Then, since it had worked with the 2.1.1 server on both bmr1-s5 and bmr1-s6, I thought I'd try that next.  Unfortunately, it still didn't work.

1a) Here's what I see on the client when I try to mount "/lustre":

[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]#

1b) Here's an excerpt from "/var/log/messages" on the client (after executing the above command):

Feb 12 15:00:54 bmr2-s9 kernel: Lustre: 5512:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.25 at o2ib->MGC192.168.1.25 at o2ib_0<mailto:MGC192.168.1.25 at o2ib-%3eMGC192.168.1.25 at o2ib_0> netid 50000: select flavor null
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: Reactivating import
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(ldlm_lib.c:357:client_obd_setup()) can't add initial connection
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:522:class_setup()) setup lustre-OST0001-osc-ffff81045d783c00 failed (-2)
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5523:0:(obd_config.c:1361:class_config_llog_handler()) Err -2 on cfg command:
Feb 12 15:00:54 bmr2-s9 kernel: Lustre:    cmd=cf003 0:lustre-OST0001-osc  1:lustre-OST0001_UUID  2:192.168.1.35 at o2ib1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 15c-8: MGC192.168.1.25 at o2ib<mailto:MGC192.168.1.25 at o2ib>: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(llite_lib.c:950:ll_fill_super()) Unable to process log: -2
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 4923:0:(lov_obd.c:927:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, lovrc=1
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_config.c:567:class_cleanup()) Device 5 not setup
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Feb 12 15:00:54 bmr2-s9 kernel: Lustre: client ffff81045d783c00 umount complete
Feb 12 15:00:54 bmr2-s9 kernel: LustreError: 5512:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount  (-2)

1c) Here's an excerpt from "/var/log/messages" on the server (after executing the above command):

Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib<mailto:2e13dea0-ec9c-0fbd-0f95-7b16246f2626 at 192.168.1.209@o2ib> t0 exp 0000000000000000 cur 1360699254 last 0
Feb 12 15:00:54 bmr1-s7 kernel: Lustre: 25911:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x50000c0a801d1_UUID netid 50000: select flavor null

2a) Here's what I see on the client when I try to mount "/lustre" (using the other interface):

[root at bmr2-s9 ~]# mount -t lustre 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.25 at o2ib:/lustre<mailto:192.168.1.25 at o2ib:/lustre> at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
[root at bmr2-s9 ~]# mount -t lustre 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> /mnt/lustre
mount.lustre: mount 192.168.1.35 at o2ib:/lustre<mailto:192.168.1.35 at o2ib:/lustre> at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is 'lustre' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
[root at bmr2-s9 ~]#

2b) Here's an excerpt from "/var/log/messages" on the client (after executing the above command):

Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 5580:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGC192.168.1.35 at o2ib->MGC192.168.1.35 at o2ib_0<mailto:MGC192.168.1.35 at o2ib-%3eMGC192.168.1.35 at o2ib_0> netid 50000: select flavor null
Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721863 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699617] [real_sent 1360699617] [current 1360699617] [deadline 5s] [delay -5s]  req at ffff81043b76e400 x1426793186721863/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699622 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:06:57 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:06:57 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) Skipped 1 previous similar message
Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff81043b76e000 x1426793186721864/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:03 bmr2-s9 kernel: LustreError: 5580:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 5s
Feb 12 15:07:22 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721868 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699642] [real_sent 1360699642] [current 1360699642] [deadline 10s] [delay -10s]  req at ffff810430e30800 x1426793186721868/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699652 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:22 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) @@@ send limit expired   req at ffff81045d7ce800 x1426793186721867/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 296/352 e 0 to 0 dl 0 ref 2 fl Rpc:W/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:24 bmr2-s9 kernel: LustreError: 5591:0:(client.c:1049:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4949:0:(import.c:526:import_select_connection()) MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib>: tried all connections, increasing latency to 10s
Feb 12 15:07:47 bmr2-s9 kernel: Lustre: 4948:0:(client.c:1778:ptlrpc_expire_one_request()) @@@ Request x1426793186721872 sent from MGC192.168.1.35 at o2ib<mailto:MGC192.168.1.35 at o2ib> to NID 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> has failed due to network error: [sent 1360699667] [real_sent 1360699667] [current 1360699667] [deadline 15s] [delay -15s]  req at ffff810444576c00 x1426793186721872/t0(0) o-1->MGS at MGC192.168.1.35@o2ib_0:26/25<mailto:MGS at MGC192.168.1.35@o2ib_0:26/25> lens 368/512 e 0 to 1 dl 1360699682 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1
Feb 12 15:07:47 bmr2-s9 kernel: LustreError: 3074:0:(o2iblnd_cb.c:2615:kiblnd_rejected()) 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib> rejected: o2iblnd fatal error
Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 156-2: The client profile 'lustre-client' could not be read from the MGS.  Does that filesystem exist?
Feb 12 15:07:54 bmr2-s9 kernel: Lustre: client ffff81045f465800 umount complete
Feb 12 15:07:54 bmr2-s9 kernel: LustreError: 5580:0:(obd_mount.c:2164:lustre_fill_super()) Unable to mount  (-22)

2c) Here's an excerpt from "/var/log/messages" on the server (after executing the above command):

Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can't accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>
Feb 12 15:06:57 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Skipped 2 previous similar messages
Feb 12 15:07:22 bmr1-s7 kernel: LustreError: 9274:0:(o2iblnd_cb.c:2247:kiblnd_passive_connect()) Can't accept 192.168.1.209 at o2ib<mailto:192.168.1.209 at o2ib> on 192.168.1.25 at o2ib<mailto:192.168.1.25 at o2ib> (ib1:1:192.168.1.35): bad dst nid 192.168.1.35 at o2ib<mailto:192.168.1.35 at o2ib>

3) Here's what one of the MDTs looks like (the other is similarly configured):

[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdp
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     lustre-MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x5
              (MDT MGS )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>


   Permanent disk data:
Target:     lustre-MDT0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x105
              (MDT MGS writeconf )
Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1>

exiting before disk write.
[root at bmr1-s7 ~]#

4) Here's what one of the OSTs looks like (the others are similarly configured):

[root at bmr1-s7 ~]# tunefs.lustre --dryrun --writeconf /dev/sdf
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     lustre-OST0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0


   Permanent disk data:
Target:     lustre-OST0000
Index:      0
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x102
              (OST writeconf )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1<mailto:mgsnode=192.168.1.25 at o2ib,192.168.1.35 at o2ib1,10.244.78.88 at tcp,192.168.1.25 at tcp1> network=o2ib0

exiting before disk write.
[root at bmr1-s7 ~]#

I'd appreciate any help or direction on a potential resolution.  Let me know what additional information is needed, if any.  Hopefully, I'm just missing something simple.

Thanks in advance,
...Brian


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130226/779ce657/attachment.htm>


More information about the lustre-discuss mailing list