[Lustre-discuss] OSS misconfig and client connect

James Robnett jrobnett at aoc.nrao.edu
Wed Jul 31 09:14:57 PDT 2013


A bit more information.

The clients panicked when the first OST on the new OSS was added,
that's OST00028.

They now complain about getting to OST00028 when remounting Lustre.

You can see in the logs below the OSS still thinks recovery for this
client should be done over IB.  Specifically the line:

Jul 31 10:08:20 apathy kernel: Lustre:    cmd=cf003 0:lustre-OST0028-osc 
  1:lustre-OST0028_UUID  2:<ibaddr>@o2ib

That's the IB interface for that OST.  I suspect I have to unmount
that OST, clears it's logs and remount.  Unfortunately I only have
the most basic understanding of that procedure.

If that's the right procedure and somebody has the proper syntax
I'm all ears.

James


Jul 31 10:08:20 apathy kernel: Lustre: MGC10.64.1.161 at tcp: Reactivating 
import
Jul 31 10:08:20 apathy kernel: LustreError: 
5023:0:(ldlm_lib.c:331:client_obd_setup()) can't add initial connection
Jul 31 10:08:20 apathy kernel: LustreError: 
5023:0:(obd_config.c:372:class_setup()) setup 
lustre-OST0028-osc-ffff88022dcc4000 failed (-2)
Jul 31 10:08:20 apathy kernel: LustreError: 
5023:0:(obd_config.c:1199:class_config_llog_handler()) Err -2 on cfg 
command:
Jul 31 10:08:20 apathy kernel: Lustre:    cmd=cf003 0:lustre-OST0028-osc 
  1:lustre-OST0028_UUID  2:<ibaddr>@o2ib
Jul 31 10:08:20 apathy kernel: LustreError: 15c-8: MGC<gbitaddr>@tcp: 
The configuration from log 'lustre-client' failed (-2). This may be the 
result of communication errors between this node and the MGS, a bad 
configuration, or other errors. See the syslog for more information.
Jul 31 10:08:20 apathy kernel: LustreError: 
5013:0:(llite_lib.c:1095:ll_fill_super()) Unable to process log: -2
Jul 31 10:08:20 apathy kernel: LustreError: 
5013:0:(lov_obd.c:1009:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, 
lovrc=1
Jul 31 10:08:20 apathy kernel: LustreError: 
5013:0:(lov_obd.c:1009:lov_cleanup()) Skipped 39 previous similar messages
Jul 31 10:08:20 apathy kernel: LustreError: 
5013:0:(mdc_request.c:1498:mdc_precleanup()) client import never connected
Jul 31 10:08:20 apathy kernel: LustreError: 
5013:0:(obd_config.c:443:class_cleanup()) Device 43 not setup
Jul 31 10:08:20 apathy kernel: LustreError: 
5013:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from 
cancel RPC: canceling anyway
Jul 31 10:08:20 apathy kernel: LustreError: 
5013:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) 
ldlm_cli_cancel_list: -108
Jul 31 10:08:20 apathy kernel: Lustre: client 
lustre-client(ffff88022dcc4000) umount complete
Jul 31 10:08:20 apathy kernel: LustreError: 
5013:0:(obd_mount.c:2065:lustre_fill_super()) Unable to mount  (-2)

On 07/31/2013 09:55 AM, James Robnett wrote:
>
> We're running Lustre 1.8.7 on clients and servers.
>
> We recently added an 11th OSS to our lustre filesystem with 4 OSTs,
> unfortunately the modprobe.conf LNET line only listed an o2ib0(ib0)
> entry from testing, normally the line would look like:
>
> options lnet networks="o2ib0(ib0),tcp0(eth0),tcp1(eth2)"
>
> for IB, Gbit and 10Gbit respectively.
>
> As soon as the new OSTs on the 11th OSS were mounted and activated
> our 1gbit and 10gbit clients kernel panic'd, IB clients were fine.
> 1gbt and 10gbit clients would refuse to mount lustre after that
> since they couldn't get to the OSS.
>
> I unmounted the OSTs on that OSS, fixed the modprobe.conf line,
> rebooted, and ran
>
> tunefs.lustre --erase-param
> --mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1 --writeconf
> /dev/sd{b,c,d,e}
>
> Where <xxxaddr> is the appropriate IP address.
>
> That seemed to complete without issue and tunefs reports:
>
> Parameters:
> mgsnode=<ibaddr>@o2ib0,<gbitaddr>@tcp0,<10gbitaddr>@tcp1
>
> as expected.
>
> Unfortunately 1gbit and 10gbit clients still refuse to mount lustre.
>
> mount.lustre: mount <ipaddr>@tcp0:/lustre at /.lustre/mountpoint failed:
> No such file or directory
> Is the MGS specification correct?
> Is the filesystem name correct?
> If upgrading, is the copied client log valid? (see upgrade docs)
>
> The OSS can ping clients on the 1gbit and 10gbit networks so routing
> and networking is fine.
>
> I'm sure I'm simply panicked and missing something obvious.  What
> is the proper procedure to fix this mess.  I thought the tunefs.lustre
> would do it but it has not.
>
> James Robnett
> NRAO/AOC



More information about the lustre-discuss mailing list