[Lustre-discuss] Kernel panic on mounting an OST

Thu Dec 13 03:39:47 PST 2007

Hi,
On 13 Dec 2007, at 10:55, Ludovic Francois wrote:

> On 12 déc, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote:
>> Hello!
>>
>> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:
>>
>>> After a power outage, I get some difficulties to mount a OST.
>>> I am running a lustre 1.6.3 and I get a panic on the OSS when I  
>>> try to
>>> mount a OST.
>>
>> It would greatly help us if you show us panic message and possibly
>> stacktrace.
>
>
> Hi,
>
> Please find below all information we got this morning
>
> Environment
> ===========
>
> ,----
> | [root at oss01 ~]# uname -a
> | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun
> Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> | [root at oss01 ~]#
> `----
>
> Mount of this specific OST
> ==========================
>
> ,----
> | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1
> | Read from remote host oss01: Connection reset by peer
> | Connection to oss01 closed.
> | [ddn at admin01 ~]$
> `----
>
> /var/log/messages during the operation
> ======================================
>
> --8<---------------cut here---------------start------------->8---
> Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user
> root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting.
> Commit interval 5 seconds
> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13
> 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Dec 13 08:36:20 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec
> 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled
> Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving
> dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with
> recovery enabled

Ok because this is the only device I can see in this log  being  
mounted I assume that at this moment /dev/mpath0 = lustre-OST0002

> Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on  
> device /
> dev/mpath/mpath1 has started
> Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar
> messages
> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
> Interpret:/0/0 rc -19/0
> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 4 previous similar messages
> Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 000001021d1d68
> 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010006b95c
> 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000100cfe9ba
> 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar
> messages
> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010037e88e
> 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 5 previous similar messages
> Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002
> Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002
> device 'unknown-block(253,1)' read-only ***
> Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read-
> only
> Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down
> for failover; client state will be preserved.
> Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped.
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0
> success)
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned,
> 1 goal hits, 0 2^N hits, 0 breaks, 0 lost
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it
> took 12560
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 preallocated, 0
> discarded
> Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001)
> Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002
> complete
> Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user
> root by root(uid=0)
> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled
> Dec 13 08:43:52 oss01 syslogd 1.4.1: restart.
> Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded
> Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg
> started.
> --8<---------------cut here---------------end--------------->8---
>
> We have to do a power cycle to connect again
> ============================================
>
> ,----
> | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power
> cycle
> `----
>
>
> The OST fsck seems correct
> ==========================
>
> ,----
> | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0
> | e2fsck 1.40.2.cfs1 (12-Jul-2007)
> | lustre-OST0030: recovering journal
> | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320
> blocks
> | [root at oss01 log]#
> `----
>
This is after power cycle right? And now your mpath0 on the same  
server claims that it is lustre-OST30
Isn't this strange? My first shot would be that your multipath  
devices are being mixed up every time you reboot your server. Make  
sure that your multipath binding file is the same on all servers or  
you can create your own aliases based on WWID of each lun in /etc/ 
multipath.conf

> tunefs.lustre reads correctly mpath0 information
> ================================================
>
> ,----
> | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0
> | checking for existing Lustre data: found CONFIGS/mountdata
> | Reading CONFIGS/mountdata
> |
> |    Read previous values:
> | Target:     lustre-OST0030
> | Index:      48
> | Lustre FS:  lustre
> | Mount type: ldiskfs
> | Flags:      0x142
> |               (OST update writeconf )
> | Persistent mount opts: errors=remount-ro,extents,mballoc
> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
> |
> |
> |    Permanent disk data:
> | Target:     lustre-OST0030
> | Index:      48
> | Lustre FS:  lustre
> | Mount type: ldiskfs
> | Flags:      0x142
> |               (OST update writeconf )
> | Persistent mount opts: errors=remount-ro,extents,mballoc
> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
> |
> | Writing CONFIGS/mountdata
> | [root at oss01 log]#
> `----
>
>
> DDN lun is ready and working correctly
> ======================================
>
> ,----[ OSS view ]
> | [root at oss01 log]# multipath -l | grep mpath0
> | mpath0 (360001ff00fd4922302000800001d1c17)
> | [root at oss01 log]#
> `----
>
> ,----[ S2A9550 view ]
> | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i
> 0fd492230200
> |   2                     1    Ready          3815470 0FD492230200
> | [ddn at admin01 ~]$
> `----
>
> Stack trace (We got it from OSS02 via the serial line during a
> mounting try)
> ====================================================================== 
> ======
>
> --8<---------------cut here---------------start------------->8---
> LDISKFS-fs: file extents enabled
> LDISKFS-fs: mballoc enabled
> LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the
> wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged
> ?
This seem to confirm my theory about mixed up block devices?
> LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read
> last_rcvd: rc = -22
> LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1.
> c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h efreai ]le
> d- -(----22--)-
>  [please bite here ] -L-u-s--tr--eE--
> r                                                -
> or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast
> _csponifnilogc_lkl:o1g19_h
> ndler()) Err -22 on cfign cvaolmimdan
> odp:                                  a
> and: 0000 [1] SLMuPs tre:    cmd=cf003 0:lu<s4tr>
> OST0030  1:dev  2:type CP 3U: 3f
> ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he
> configuration from l ogo bd'lfuisltterer-OST0030' failed (-22).
> ( U)Make sure th
> is client a ndf stfihlet _MlGSdi askrfes running compatible
> ver(siUo)ns of
> Lustre.
>  oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati omng
> cfrom log 'lustre-OST003(0U') failed (-22). This may l dbies tkhfes r
> esult of communicatio(nU) errors between this nod el usantdr ethe MGS,
> a bad configur(aU)tion, or other errors.  Seloev the syslog for more
>
> inf(oU)rmation.
>  LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU):1082:server_start_targe
> tmds(c)) failed to start serv(eUr) lustre-OST0030: -22
>  ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc:
> 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a
> (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup())
> ( U)Device 2 not setup                                ss
>  lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U)
> autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U)
> pcmcia_c
> ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U)
> button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U)
> myri10ge(U)
>  bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U)
> megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U)
> multipath(U)
> Pid: 10286, comm: ptlrpcd Tainted: GF     2.6.9-55.0.9.EL_lustre.
> 1.6.3smp
> RIP: 0010:[<ffffffff80321465>] <ffffffff80321465>{__lock_text_start
> +32}
> RSP: 0018:0000010218cd9bc8  EFLAGS: 00010216
> RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000
> RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0
> RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc
> R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058
> R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58
> FS:  0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS:
> 0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0
> Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task
> 00000102170b4030)
> Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e
> ffffffffa03b32a0
>        000001021654e0b0 ffffffffa04d6510 0000008000000000
> 0000000000000000
>        0000000000000000 00000102203920c0
> Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49}
>        <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560}
>        <ffffffff80131923>{recalc_task_prio+337}
> <ffffffffa02586fd>{:obdclass:class_export_destroy+381}
>        <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150}
>        <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229}
> <ffffffffa031883a>{:ptlrpc:ptlrpcd+874}
>        <ffffffff80133566>{default_wake_function+0}
> <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
>        <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
> <ffffffff80133566>{default_wake_function+0}
>        <ffffffff80110de3>{child_rip+8}
> <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0}
>        <ffffffff80110ddb>{child_rip+0}
>
> Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00
> RIP <ffffffff80321465>{__lock_text_start+32} RSP <0000010218cd9bc8>
>  <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc
> 1 blocks 1 reqs (0 succ ess)                              :
> --8<---------------cut here---------------end--------------->8---
>
> If you need more information or debug, feel free to request us. The
> problem occurs only with this OST.
>
> Thanks, Ludo

I hope this help

Cheers,

Wojciech
>
> --
> Ludovic Francois                 +33 (0)6 14 77 26 93
> System Engineer                  DataDirect Networks
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071213/9665eaf7/attachment.htm>