[Lustre-discuss] lustre OSS IP change

Wojciech Turek wjt27 at cam.ac.uk
Thu Jan 13 10:40:58 PST 2011


Kernel panics were due to bugs in the Lustre version you are running. If you
want to avoid this sort of troubles in the future and make your filesystem
stable then you should upgrade to 1.8.5.

On 13 January 2011 18:04, Brendon <b at brendon.com> wrote:

> Wojciech-
>
> Before this, I did read a bit about lustre, but not much. Just some
> high-level stuff. It was definitely a "crash course"
>
> It looks like version 1.6.5. I don't have stack traces. The kernel
> paniced each time and I don't have a console server.
>
> # uname -a
> Linux jupiter.nanostellar.com 2.6.18-53.1.14.el5_lustre.1.6.5.1smp #1
> SMP Wed Jun 18 19:45:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> -Brendon
>
> On Thu, Jan 13, 2011 at 10:01 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
> > Hi Brendon,
> >
> > So it looks like you Lustre was just stuck in recovery processes after
> all.
> > It is a bit concerning that you had kernel panics on MDS during recovery.
> > Which Lustre version are you using? Do you have stack traces from the
> kernel
> > panics?
> >
> > Wojciech
> >
> > On 13 January 2011 17:41, Brendon <b at brendon.com> wrote:
> >>
> >> On Tue, Jan 11, 2011 at 3:35 PM, Wojciech Turek <wjt27 at cam.ac.uk>
> wrote:
> >> > Hi Brendon,
> >> >
> >> > Can you please provide following:
> >> > 1) output of ifconfig run on each OSS MDS and at least one client
> >> > 2) output of lctl list_nids run on each OSS MDS and at least one
> client
> >> > 3) output of tunefs.lustre --print --dryrun /dev/<OST_block_device>
> from
> >> > each OSS
> >> >
> >> > Wojciech
> >>
> >> After someone looked at the emails I sent out, they grabbed me on IRC.
> >> We had a discussion and basically they interpreted the email as
> >> everything should be working, I just needed to wait for a repair to
> >> run and complete. What I then learned is that first, a client has to
> >> connect for a repair to initiate. Secondly, the code isn't perfect.
> >> The MDS kernel oops'ed twice before it finally completed a repair
> >> successfully. I was in the process of disabling panic on oops, but it
> >> finally completed successfully. Once that was done, I got a clean bill
> >> of health.
> >>
> >> Just to complete this discussion, I have listed the requested output.
> >> I might still learn something :)
> >>
> >> ...Looks like I did learn something. OSS0 has an issue with the root
> >> FS and was remounted RO which I discovered when running  tunefs.lustre
> >> --print --dryrun /dev/sda5.
> >>
> >> The fun never ends :)
> >> -Brendon
> >>
> >> 1) ifconfig info
> >> MDS: # ifconfig
> >> eth0      Link encap:Ethernet  HWaddr 00:15:17:5E:46:64
> >>          inet addr:10.1.1.1  Bcast:10.1.1.255  Mask:255.255.255.0
> >>          inet6 addr: fe80::215:17ff:fe5e:4664/64 Scope:Link
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:49140546 errors:0 dropped:0 overruns:0 frame:0
> >>          TX packets:63644404 errors:0 dropped:0 overruns:0 carrier:0
> >>          collisions:0 txqueuelen:1000
> >>          RX bytes:18963170801 (17.6 GiB)  TX bytes:65261762295 (60.7
> GiB)
> >>          Base address:0xcc00 Memory:f58e0000-f5900000
> >>
> >> eth1      Link encap:Ethernet  HWaddr 00:15:17:5E:46:65
> >>          inet addr:192.168.0.181  Bcast:192.168.0.255
>  Mask:255.255.255.0
> >>          inet6 addr: fe80::215:17ff:fe5e:4665/64 Scope:Link
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:236738842 errors:0 dropped:0 overruns:0 frame:0
> >>          TX packets:458503163 errors:0 dropped:0 overruns:0 carrier:0
> >>          collisions:0 txqueuelen:100
> >>          RX bytes:15562858193 (14.4 GiB)  TX bytes:686167422947 (639.0
> >> GiB)
> >>          Base address:0xc880 Memory:f5880000-f58a0000
> >>
> >> OSS : # ifconfig
> >> eth0      Link encap:Ethernet  HWaddr 00:1D:60:E0:5B:B2
> >>          inet addr:10.1.1.2  Bcast:10.1.1.255  Mask:255.255.255.0
> >>          inet6 addr: fe80::21d:60ff:fee0:5bb2/64 Scope:Link
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:3092588 errors:0 dropped:0 overruns:0 frame:0
> >>          TX packets:3547204 errors:0 dropped:0 overruns:0 carrier:0
> >>          collisions:0 txqueuelen:1000
> >>          RX bytes:1320521551 (1.2 GiB)  TX bytes:2670089148 (2.4 GiB)
> >>          Interrupt:233
> >>
> >> client: # ifconfig
> >> eth0      Link encap:Ethernet  HWaddr 00:1E:8C:39:E4:69
> >>          inet addr:10.1.1.5  Bcast:10.1.1.255  Mask:255.255.255.0
> >>          inet6 addr: fe80::21e:8cff:fe39:e469/64 Scope:Link
> >>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >>          RX packets:727922 errors:0 dropped:0 overruns:0 frame:0
> >>          TX packets:884188 errors:0 dropped:0 overruns:0 carrier:0
> >>          collisions:0 txqueuelen:1000
> >>          RX bytes:433349006 (413.2 MiB)  TX bytes:231985578 (221.2 MiB)
> >>          Interrupt:50
> >>
> >>
> >>
> >> 2) lctl list_nids
> >>
> >> client: lctl list_nids
> >> 10.1.1.5 at tcp
> >>
> >> MDS: lctl list_nids
> >> 10.1.1.1 at tcp
> >>
> >> OSS: lctl list_nids
> >> 10.1.1.2 at tcp
> >>
> >> 3) tunefs.lustre --print --dryrun /dev/sda5
> >> OSS0: ]# tunefs.lustre --print --dryrun /dev/sda5
> >> checking for existing Lustre data: found CONFIGS/mountdata
> >> tunefs.lustre: Can't create temporary directory /tmp/dirCZXt3k:
> >> Read-only file system
> >>
> >> tunefs.lustre FATAL: Failed to read previous Lustre data from /dev/sda5
> >> (30)
> >> tunefs.lustre: exiting with 30 (Read-only file system)
> >>
> >> OSS1: # tunefs.lustre --print --dryrun /dev/sda5
> >> checking for existing Lustre data: found CONFIGS/mountdata
> >> Reading CONFIGS/mountdata
> >>
> >>   Read previous values:
> >> Target:     mylustre-OST0001
> >> Index:      1
> >> Lustre FS:  mylustre
> >> Mount type: ldiskfs
> >> Flags:      0x2
> >>              (OST )
> >> Persistent mount opts: errors=remount-ro,extents,mballoc
> >> Parameters: mgsnode=10.1.1.1 at tcp
> >>
> >>
> >>   Permanent disk data:
> >> Target:     mylustre-OST0001
> >> Index:      1
> >> Lustre FS:  mylustre
> >> Mount type: ldiskfs
> >> Flags:      0x2
> >>              (OST )
> >> Persistent mount opts: errors=remount-ro,extents,mballoc
> >> Parameters: mgsnode=10.1.1.1 at tcp
> >>
> >> exiting before disk write.
> >>
> >>
> >> OSS2: # tunefs.lustre --print --dryrun /dev/sda5
> >> checking for existing Lustre data: found CONFIGS/mountdata
> >> Reading CONFIGS/mountdata
> >>
> >>   Read previous values:
> >> Target:     mylustre-OST0002
> >> Index:      2
> >> Lustre FS:  mylustre
> >> Mount type: ldiskfs
> >> Flags:      0x2
> >>              (OST )
> >> Persistent mount opts: errors=remount-ro,extents,mballoc
> >> Parameters: mgsnode=10.1.1.1 at tcp
> >>
> >>
> >>   Permanent disk data:
> >> Target:     mylustre-OST0002
> >> Index:      2
> >> Lustre FS:  mylustre
> >> Mount type: ldiskfs
> >> Flags:      0x2
> >>              (OST )
> >> Persistent mount opts: errors=remount-ro,extents,mballoc
> >> Parameters: mgsnode=10.1.1.1 at tcp
> >>
> >> exiting before disk write.
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> >
> > --
> > Wojciech Turek
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110113/925eec5f/attachment.htm>


More information about the lustre-discuss mailing list