[Lustre-discuss] Fw: Re: Unable to activate OST

Wojciech Turek wjt27 at cam.ac.uk
Fri Jan 15 17:19:33 PST 2010


Could you also post here syslog messages from the OSS ?

2010/1/16 Wojciech Turek <wjt27 at cam.ac.uk>:
> Can you check if you can ping MDS and OSS using normal ping command?
>
>
> 2010/1/16 Dusty Marks <dustynmarks at gmail.com>:
>> the output of ltcl list_nids on the oss is
>>
>> [root at oss ~]# lctl list_nids
>> 192.168.0.3 at tcp
>>
>> and from the mds
>>
>> [root at mds ~]# lctl list_nids
>> 192.168.0.2 at tcp
>>
>> Thanks,
>> Dusty
>>
>> On Fri, Jan 15, 2010 at 5:39 PM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
>>>
>>> Hi,
>>>
>>> Could you please post output of the 'lctl list_nids' command on OSS
>>> system and on MDS system. This will show us which network was
>>> configured to work with lustre.
>>>
>>> Regarding entries in the modprobe.conf, they tell lnet module which
>>> NIC or multiple NICs will be configured to work with lustre. If your
>>> modprobe.conf doesn't have lnet options line,  by default Lustre will
>>> configure the first NIC which is usually eth0.
>>> Below is a modprobe.conf entry from my lustre setup.
>>> My OSS(s) and MDS(s) have 2 NICs eth0 and eth1 and an Infiniband NIC
>>> ib0. The IB is set to work as IPoIB so lustre treats it as an ordinary
>>> Ethernet NIC
>>> options lnet networks=tcp0(ib0),tcp1(eth1),tcp2(eth1:0)
>>> So the line above means that:
>>>   first lustre network tcp0 is configured on interface ib0
>>>   second lustre network tcp1 is configured on interface eth1
>>>   third lustre network tcp2 is confiured on alias interface eth1:0
>>>
>>> eth0 is not mentioned on this line because I have chosen not to
>>> configure it to work with lustre.
>>>
>>>
>>> Once lnet module is loaded you can check which network or networks are
>>> configured to work with Lustre using 'lctl list_nids' command
>>>
>>> Cheers
>>>
>>> Wojciech
>>> 2010/1/15 Dusty Marks <dustynmarks at gmail.com>:
>>> > I did some googling and i found the command lctl ping. So i went on the
>>> > oss
>>> > and typed in "lctl ping 192.168.0.2 at tcp". This errored out with an I/O
>>> > error.
>>> >
>>> > It is quite obvious that i've simply misconfigured the network. Could
>>> > someone explain how to properly configure it?
>>> >
>>> > I don't understand what the entry in modprobe actually means, so i
>>> > cannot
>>> > say what should be entered.
>>> >
>>> > Each one of my machines has one NIC (eth0). What do i enter in
>>> > modprobe.conf? To make this work correctly? if i update the entry in
>>> > modprobe.conf, do i have to redo anything? or does lustre pickup on the
>>> > changes without restarting anything?
>>> >
>>> > Thanks all for the help so far.
>>> >
>>> > - Dusty
>>> >
>>> > On Fri, Jan 15, 2010 at 10:36 AM, Dusty Marks <dustynmarks at gmail.com>
>>> > wrote:
>>> >>
>>> >> I searched through the manual, and the only section i could find
>>> >> dealing
>>> >> with networking configuration is section 4.1.0.2 titled "Module Setup"
>>> >> in
>>> >> the Lustre 1.8 operations manual.
>>> >>
>>> >> It tells me to run the command modprobe -v lustre
>>> >> "networks=tcp0(eth0)",
>>> >> and i did such on the MDS, however it errored out with:
>>> >>
>>> >> [root at mds ~]# modprobe -v lustre "networks=tcp0(eth0)"
>>> >> insmod
>>> >>
>>> >> /lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko
>>> >> networks=tcp0(eth0)
>>> >> FATAL: Error inserting lustre
>>> >>
>>> >> (/lib/modules/2.6.18-128.7.1.el5_lustre.1.8.1.1.20091003130007/kernel/fs/lustre/lustre.ko):
>>> >> Unknown symbol in module, or unknown parameter (see dmesg)
>>> >>
>>> >> dmesg says nothing, but message says this:
>>> >> Jan 15 10:27:48 mds kernel: lustre: Unknown parameter `networks'
>>> >>
>>> >> I even tried adding "options lnet networks=tcp0(eth0)" however that
>>> >> didn't
>>> >> work either
>>> >>
>>> >> I'm terribly sorry for my incompetence, but i'm having a difficult time
>>> >> understanding lustre's abstractions.
>>> >>
>>> >> Each one of my nodes have a single ethernet card (eth0)
>>> >>
>>> >>
>>> >> On Thu, Jan 14, 2010 at 11:32 PM, Andreas Dilger <adilger at sun.com>
>>> >> wrote:
>>> >>>
>>> >>> On 2010-01-15, at 00:21, Arden Wiebe wrote:
>>> >>>>
>>> >>>> Your mount command is wrong - try this format.
>>> >>>>
>>> >>>> mount -t lustre 192.168.0.7 at tcp0:/ioio /mnt/ioio
>>> >>>>
>>> >>>> So by substitution for supplied your mount line should
>>> >>>> read:
>>> >>>>
>>> >>>> mount -t datafs 192.168.0.2 at tcp0:/datafs /mnt/datafs
>>> >>>
>>> >>> No, that isn't correct.  You are showing the mount command for a
>>> >>> client.  It is the OST that is failing to mount, likely because
>>> >>> the network is not configured correctly, and the OST needs to
>>> >>> contact the MGS node always on the first mount in order to join
>>> >>> the filesystem.
>>> >>>
>>> >>>> Enjoy the required reading and testing.  I found by
>>> >>>> naming things uniquely helped me clarify what was actually
>>> >>>> required.  Try calling your filesystem "Dusty" or
>>> >>>> "Mark" and that should make things clearer for you.
>>> >>>>
>>> >>>> --- On Thu, 1/14/10, Andreas Dilger <adilger at sun.com> wrote:
>>> >>>>>
>>> >>>>> On 2010-01-14, at 23:51, Dusty Marks wrote:
>>> >>>>>>
>>> >>>>>> You are correct, there is information in messages.  Following are
>>> >>>>>> the
>>> >>>>>> entries related the lustre. The line that says 192.168.0.2 at tcp is
>>> >>>>>> unreachable makes sense, but what exactly is the problem? I entered
>>> >>>>>> the line "options lnet networks=tcp" in modprobe.conf on the oss
>>> >>>>>> and
>>> >>>>>> mds. The only difference was, i entered that line AFTER i setup
>>> >>>>>> lustre on the OSS. Could that be the problem? I don't see why that
>>> >>>>>> would be the problem, as the oss is trying to reach the MDS/MGS,
>>> >>>>>> which is 192.168.0.2.
>>> >>>>>>
>>> >>>>>> ---------------------------------------
>>> >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(linux-tcpip.c:
>>> >>>>>> 688:libcfs_sock_connect()) Error -113 connecting 0.0.0.0/1023 ->
>>> >>>>>> 192.168.0.2/988
>>> >>>>>> Jan 14 22:41:07 oss kernel: Lustre: 2846:0:(acceptor.c:
>>> >>>>>> 95:lnet_connect_console_error()) Connection to 192.168.0.2 at tcp at
>>> >>>>>> host 192.168.0.2 was unreachable: the network or that node may be
>>> >>>>>> down, or Lustre may be misconfigured.
>>> >>>>>
>>> >>>>>
>>> >>>>> Please read the chapter in the manual about network configuration.
>>> >>>>>  I
>>> >>>>> suspect the .0.2 network is not your eth0 network interface, and
>>> >>>>> your
>>> >>>>> modprobe.conf needs to be fixed.
>>> >>>
>>> >>>
>>> >>> Cheers, Andreas
>>> >>> --
>>> >>> Andreas Dilger
>>> >>> Sr. Staff Engineer, Lustre Group
>>> >>> Sun Microsystems of Canada, Inc.
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> The graduate with a Science degree asks, "Why does it work?" The
>>> >> graduate
>>> >> with an Engineering degree asks, "How does it work?" The graduate with
>>> >> an
>>> >> Accounting degree asks, "How much will it cost?" The graduate with an
>>> >> Arts
>>> >> degree asks, "Do you want fries with that?"
>>> >
>>> >
>>> >
>>> > --
>>> > The graduate with a Science degree asks, "Why does it work?" The
>>> > graduate
>>> > with an Engineering degree asks, "How does it work?" The graduate with
>>> > an
>>> > Accounting degree asks, "How much will it cost?" The graduate with an
>>> > Arts
>>> > degree asks, "Do you want fries with that?"
>>> >
>>> > _______________________________________________
>>> > Lustre-discuss mailing list
>>> > Lustre-discuss at lists.lustre.org
>>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> --
>>> Wojciech Turek
>>>
>>> Assistant System Manager
>>>
>>> High Performance Computing Service
>>> University of Cambridge
>>> Email: wjt27 at cam.ac.uk
>>> Tel: (+)44 1223 763517
>>
>>
>>
>> --
>> The graduate with a Science degree asks, "Why does it work?" The graduate
>> with an Engineering degree asks, "How does it work?" The graduate with an
>> Accounting degree asks, "How much will it cost?" The graduate with an Arts
>> degree asks, "Do you want fries with that?"
>>
>
>
>
> --
> --
> Wojciech Turek
>
> Assistant System Manager
>
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk
> Tel: (+)44 1223 763517
>



-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517



More information about the lustre-discuss mailing list