[Lustre-discuss] Fw: Re: Unable to activate OST

Dusty Marks dustynmarks at gmail.com
Sat Jan 16 11:02:06 PST 2010


Got it working. The firewall was blocking lustre traffic. :( After disabling
it, it works.

Thanks all for the help!

On Sat, Jan 16, 2010 at 9:57 AM, Dusty Marks <dustynmarks at gmail.com> wrote:

> I've posted my /var/log/messages here before, but here it is again:
>
>
> --------------------------------------- /var/log/messages
> -----------------------------------------------------------
> Jan 14 22:41:05 oss kernel: Lustre: OBD class driver,
> http://www.lustre.org/
> Jan 14 22:41:05 oss kernel: Lustre:     Lustre Version: 1.8.1.1
> Jan 14 22:41:05 oss kernel: Lustre:     Build Version:
> 1.8.1.1-20091009075116-PRISTINE-2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007
> Jan 14 22:41:06 oss kernel: Lustre: Added LNI 192.168.0.3 at tcp [8/256/0/0]
> Jan 14 22:41:06 oss kernel: Lustre: Accept secure, port 988
> Jan 14 22:41:06 oss kernel: Lustre: Lustre Client File System;
> http://www.lustre.org/
> Jan 14 22:41:07 oss kernel: kjournald starting.  Commit interval 5 seconds
> Jan 14 22:41:07 oss kernel: LDISKFS FS on dm-2, internal journal
> Jan 14 22:41:07 oss kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jan 14 22:41:07 oss kernel: kjournald starting.  Commit interval 5 seconds
> Jan 14 22:41:07 oss kernel: LDISKFS FS on dm-2, internal journal
> Jan 14 22:41:07 oss kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Jan 14 22:41:07 oss kernel: LDISKFS-fs: file extents enabled
> Jan 14 22:41:07 oss kernel: LDISKFS-fs: mballoc enabled
>
> Jan 14 22:41:07 oss kernel: Lustre:
> 2846:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting
> 0.0.0.0/1023 -> 192.168.0.2/988
> Jan 14 22:41:07 oss kernel: Lustre:
> 2846:0:(acceptor.c:95:lnet_connect_console_error()) Connection to
> 192.168.0.2 at tcp at host 192.168.0.2 was unreachable: the network or that
> node may be down, or Lustre may be misconfigured.
> Jan 14 22:41:07 oss kernel: Lustre:
> 2846:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len
> 368 192.168.0.3 at tcp->192.168.0.2 at tcp
> Jan 14 22:41:12 oss kernel: Lustre:
> 2853:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
> x1324907721916417 sent from MGC192.168.0.2 at tcp to NID 192.168.0.2 at tcp 5s
> ago has timed out (limit 5s).
> Jan 14 22:41:12 oss kernel:   req at f5d7fe00 x1324907721916417/t0
> o250->MGS at MGC192.168.0.2@tcp_0:26/25 lens 368/584 e 0 to 1 dl 1263530472
> ref 1 fl Rpc:N/0/0 rc 0/0
> Jan 14 22:41:12 oss kernel: LustreError:
> 2819:0:(obd_mount.c:1085:server_start_targets()) Required registration
> failed for datafs-OSTffff: -5
> Jan 14 22:41:12 oss kernel: LustreError: 15f-b: Communication error with
> the MGS.  Is the MGS running?
> Jan 14 22:41:12 oss kernel: LustreError:
> 2819:0:(obd_mount.c:1629:server_fill_super()) Unable to start targets: -5
> Jan 14 22:41:12 oss kernel: LustreError:
> 2819:0:(obd_mount.c:1412:server_put_super()) no obd datafs-OSTffff
> Jan 14 22:41:12 oss kernel: LustreError:
> 2819:0:(obd_mount.c:136:server_deregister_mount()) datafs-OSTffff not
> registered
> Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0
> success)
> Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal
> hits, 0 2^N hits, 0 breaks, 0 lost
> Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 generated and it took 0
> Jan 14 22:41:12 oss kernel: LDISKFS-fs: mballoc: 0 preallocated, 0
> discarded
> Jan 14 22:41:12 oss kernel: Lustre: server umount datafs-OSTffff complete
> Jan 14 22:41:12 oss kernel: LustreError:
> 2819:0:(obd_mount.c:1997:lustre_fill_super()) Unable to mount  (-5)
>
>
> On Sat, Jan 16, 2010 at 5:14 AM, Christopher J. Walker <
> C.J.Walker at qmul.ac.uk> wrote:
>
>> Wojciech Turek wrote:
>>
>>> Hi,
>>>
>>> Could you please post output of the 'lctl list_nids' command on OSS
>>> system and on MDS system. This will show us which network was
>>> configured to work with lustre.
>>>
>>> Regarding entries in the modprobe.conf, they tell lnet module which
>>> NIC or multiple NICs will be configured to work with lustre.
>>>
>>
>> There's a gotcha here which I've been meaning to write up. We have a 10Gig
>> card as eth2 assigned a different IP address on the same subnet as eth0, a
>> 1Gig card. Whilst lustre correctly bound to the ip address of eth2, the
>> kernel decided (correctly) it could route packets via eth0. This worked, but
>> gave poor performance (partly due to a bottleneck on that art of the
>> network). The solution was to ensure that packets from eth2's IP address
>> were routed out of eth2.
>>
>> Chris
>>
>>
>>  If your
>>> modprobe.conf doesn't have lnet options line,  by default Lustre will
>>> configure the first NIC which is usually eth0.
>>> Below is a modprobe.conf entry from my lustre setup.
>>> My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC
>>> ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary
>>> Ethernet NIC
>>> options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0)
>>> So the line above means that:
>>>   first lustre network tcp0 is configured on interface ib0
>>>   second lustre network tcp1 is configured on interface eth1
>>>   third lustre network tcp2 is confiured on alias interface eth1:0
>>>
>>> eth0 is not mentioned on this line because I have chosen not to
>>> configure it to work with lustre.
>>>
>>>
>>> Once lnet module is loaded you can check which network or networks are
>>> configured to work with Lustre using 'lctl list_nids' command
>>>
>>> Cheers
>>>
>>> Wojciech
>>> 2010/1/15 Dusty Marks <dustynmarks at gmail.com>:
>>>
>>>> I did some googling and i found the command lctl ping. So i went on the
>>>> oss
>>>> and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O
>>>> error.
>>>>
>>>> It is quite obvious that i've simply misconfigured the network. Could
>>>> someone explain how to properly configure it?
>>>>
>>>> I don't understand what the entry in modprobe actually means, so i
>>>> cannot
>>>> say what should be entered.
>>>>
>>>> Each one of my machines has one NIC (eth0). What do i enter in
>>>> modprobe.conf? To make this work correctly? if i update the entry in
>>>> modprobe.conf, do i have to redo anything? or does lustre pickup on the
>>>> changes without restarting anything?
>>>>
>>>> Thanks all for the help so far.
>>>>
>>>> - Dusty
>>>>
>>>> On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com>
>>>> wrote:
>>>>
>>>>> I searched through the manual, and the only section i could find
>>>>> dealing
>>>>> with networking configuration is section 4.1.0.2 titled "Module Setup"
>>>>> in
>>>>> the Lustre 1.8 operations manual.
>>>>>
>>>>> It tells me to run the command modprobe -v lustre
>>>>> "networks=tcp0(eth0)",
>>>>> and i did such on the MDS, however it errored out with:
>>>>>
>>>>> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)"
>>>>> insmod
>>>>>
>>>>> /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko
>>>>> networks=tcp0(eth0)
>>>>> FATAL: Error inserting lustre
>>>>>
>>>>> (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko):
>>>>> Unknown symbol in module, or unknown parameter (see dmesg)
>>>>>
>>>>> dmesg says nothing, but message says this:
>>>>> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'
>>>>>
>>>>> I even tried adding "options lnet networks=tcp0(eth0)" however that
>>>>> didn't
>>>>> work either
>>>>>
>>>>> I'm terribly sorry for my incompetence, but i'm having a difficult time
>>>>> understanding lustre's abstractions.
>>>>>
>>>>> Each one of my nodes have a single ethernet card (eth0)
>>>>>
>>>>>
>>>>> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com>
>>>>> wrote:
>>>>>
>>>>>> On 2010-01-15, at 00:21, Arden Wiebe wrote:
>>>>>>
>>>>>>> Your mount command is wrong - try this format.
>>>>>>>
>>>>>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio
>>>>>>>
>>>>>>> So by substitution for supplied your mount line should
>>>>>>> read:
>>>>>>>
>>>>>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs
>>>>>>>
>>>>>> No, that isn't correct.  You are showing the mount command for a
>>>>>> client.  It is the OST that is failing to mount, likely because
>>>>>> the network is not configured correctly, and the OST needs to
>>>>>> contact the MGS node always on the first mount in order to join
>>>>>> the filesystem.
>>>>>>
>>>>>>  Enjoy the required reading and testing.  I found by
>>>>>>> naming things uniquely helped me clarify what was actually
>>>>>>> required.  Try calling your filesystem "Dusty" or
>>>>>>> "Mark" and that should make things clearer for you.
>>>>>>>
>>>>>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote:
>>>>>>>
>>>>>>>> On 2010-01-14, at 23:51, Dusty Marks wrote:
>>>>>>>>
>>>>>>>>> You are correct, there is information in messages.  Following are
>>>>>>>>> the
>>>>>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is
>>>>>>>>> unreachable makes sense, but what exactly is the problem? I entered
>>>>>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss
>>>>>>>>> and
>>>>>>>>> mds. The only difference was, i entered that line AFTER i setup
>>>>>>>>> lustre on the OSS. Could that be the problem? I don't see why that
>>>>>>>>> would be the problem, as the oss is trying to reach the MDS/MGS,
>>>>>>>>> which is 192.168.0.2.
>>>>>>>>>
>>>>>>>>> ---------------------------------------
>>>>>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c:
>>>>>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 ->
>>>>>>>>> 192.168.0.2/988
>>>>>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c:
>>>>>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at
>>>>>>>>> host 192.168.0.2 was unreachable: the network or that node may be
>>>>>>>>> down, or Lustre may be misconfigured.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Please read the chapter in the manual about network configuration.
>>>>>>>>  I
>>>>>>>> suspect the .0.2 network is not your eth0 network interface, and
>>>>>>>> your
>>>>>>>> modprobe.conf needs to be fixed.
>>>>>>>>
>>>>>>>
>>>>>> Cheers, Andreas
>>>>>> --
>>>>>> Andreas Dilger
>>>>>> Sr. Staff Engineer, Lustre Group
>>>>>> Sun Microsystems of Canada, Inc.
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> The graduate with a Science degree asks, "Why does it work?" The
>>>>> graduate
>>>>> with an Engineering degree asks, "How does it work?" The graduate with
>>>>> an
>>>>> Accounting degree asks, "How much will it cost?" The graduate with an
>>>>> Arts
>>>>> degree asks, "Do you want fries with that?"
>>>>>
>>>>
>>>>
>>>> --
>>>> The graduate with a Science degree asks, "Why does it work?" The
>>>> graduate
>>>> with an Engineering degree asks, "How does it work?" The graduate with
>>>> an
>>>> Accounting degree asks, "How much will it cost?" The graduate with an
>>>> Arts
>>>> degree asks, "Do you want fries with that?"
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
> --
> The graduate with a Science degree asks, "Why does it work?" The graduate
> with an Engineering degree asks, "How does it work?" The graduate with an
> Accounting degree asks, "How much will it cost?" The graduate with an Arts
> degree asks, "Do you want fries with that?"
>



-- 
The graduate with a Science degree asks, "Why does it work?" The graduate
with an Engineering degree asks, "How does it work?" The graduate with an
Accounting degree asks, "How much will it cost?" The graduate with an Arts
degree asks, "Do you want fries with that?"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100116/3de89e07/attachment.htm>


More information about the lustre-discuss mailing list