[Lustre-discuss] Kernel panic on mounting an OST

Thu Dec 13 03:53:45 PST 2007

Hi,

I just would like add that you could do very simple test to see if  
mpath is working correctly. On your server oss1 run tunefs.lustre -- 
print /dev/<all_mpath_devices> then write down target name for each  
mpath device. Reboot the server and do the same and compare if the  
mpath -> target map is the same as it was before reboot.

Cheers

Wojciech
On 13 Dec 2007, at 10:55, Ludovic Francois wrote:

> On 12 déc, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote:
>> Hello!
>>
>> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:
>>
>>> After a power outage, I get some difficulties to mount a OST.
>>> I am running a lustre 1.6.3 and I get a panic on the OSS when I  
>>> try to
>>> mount a OST.
>>
>> It would greatly help us if you show us panic message and possibly
>> stacktrace.
>
>
> Hi,
>
> Please find below all information we got this morning
>
> Environment
> ===========
>
> ,----
> | [root at oss01 ~]# uname -a
> | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun
> Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
> | [root at oss01 ~]#
> `----
>
> Mount of this specific OST
> ==========================
>
> ,----
> | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1
> | Read from remote host oss01: Connection reset by peer
> | Connection to oss01 closed.
> | [ddn at admin01 ~]$
> `----
>
> /var/log/messages during the operation
> ======================================
>
> --8<---------------cut here---------------start------------->8---
> Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user
> root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting.
> Commit interval 5 seconds
> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13
> 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Dec 13 08:36:20 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec
> 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
> data mode.
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled
> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled
> Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving
> dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with
> recovery enabled
> Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on  
> device /
> dev/mpath/mpath1 has started
> Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar
> messages
> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
> Interpret:/0/0 rc -19/0
> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 4 previous similar messages
> Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 000001021d1d68
> 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010006b95c
> 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 00000100cfe9ba
> 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID 'lustre-
> OST0030_UUID' is not available  for connect (no target)
> Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar
> messages
> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) @@@ processing error (-19)
> req at 0000010037e88e
> 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
> -19/0
> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
> 1437:target_send_reply_msg()) Skipped 5 previous similar messages
> Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002
> Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002
> device 'unknown-block(253,1)' read-only ***
> Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read-
> only
> Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down
> for failover; client state will be preserved.
> Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped.
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0
> success)
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned,
> 1 goal hits, 0 2^N hits, 0 breaks, 0 lost
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it
> took 12560
> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256 preallocated, 0
> discarded
> Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001)
> Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002
> complete
> Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user
> root by root(uid=0)
> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
> seconds
> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
> ordered data mode.
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled
> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled
> Dec 13 08:43:52 oss01 syslogd 1.4.1: restart.
> Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded
> Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg
> started.
> --8<---------------cut here---------------end--------------->8---
>
> We have to do a power cycle to connect again
> ============================================
>
> ,----
> | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power
> cycle
> `----
>
>
> The OST fsck seems correct
> ==========================
>
> ,----
> | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0
> | e2fsck 1.40.2.cfs1 (12-Jul-2007)
> | lustre-OST0030: recovering journal
> | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320
> blocks
> | [root at oss01 log]#
> `----
>
> tunefs.lustre reads correctly mpath0 information
> ================================================
>
> ,----
> | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0
> | checking for existing Lustre data: found CONFIGS/mountdata
> | Reading CONFIGS/mountdata
> |
> |    Read previous values:
> | Target:     lustre-OST0030
> | Index:      48
> | Lustre FS:  lustre
> | Mount type: ldiskfs
> | Flags:      0x142
> |               (OST update writeconf )
> | Persistent mount opts: errors=remount-ro,extents,mballoc
> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
> |
> |
> |    Permanent disk data:
> | Target:     lustre-OST0030
> | Index:      48
> | Lustre FS:  lustre
> | Mount type: ldiskfs
> | Flags:      0x142
> |               (OST update writeconf )
> | Persistent mount opts: errors=remount-ro,extents,mballoc
> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
> |
> | Writing CONFIGS/mountdata
> | [root at oss01 log]#
> `----
>
>
> DDN lun is ready and working correctly
> ======================================
>
> ,----[ OSS view ]
> | [root at oss01 log]# multipath -l | grep mpath0
> | mpath0 (360001ff00fd4922302000800001d1c17)
> | [root at oss01 log]#
> `----
>
> ,----[ S2A9550 view ]
> | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i
> 0fd492230200
> |   2                     1    Ready          3815470 0FD492230200
> | [ddn at admin01 ~]$
> `----
>
> Stack trace (We got it from OSS02 via the serial line during a
> mounting try)
> ====================================================================== 
> ======
>
> --8<---------------cut here---------------start------------->8---
> LDISKFS-fs: file extents enabled
> LDISKFS-fs: mballoc enabled
> LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the
> wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged
> ?
> LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read
> last_rcvd: rc = -22
> LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1.
> c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h efreai ]le
> d- -(----22--)-
>  [please bite here ] -L-u-s--tr--eE--
> r                                                -
> or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast
> _csponifnilogc_lkl:o1g19_h
> ndler()) Err -22 on cfign cvaolmimdan
> odp:                                  a
> and: 0000 [1] SLMuPs tre:    cmd=cf003 0:lu<s4tr>
> OST0030  1:dev  2:type CP 3U: 3f
> ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he
> configuration from l ogo bd'lfuisltterer-OST0030' failed (-22).
> ( U)Make sure th
> is client a ndf stfihlet _MlGSdi askrfes running compatible
> ver(siUo)ns of
> Lustre.
>  oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati omng
> cfrom log 'lustre-OST003(0U') failed (-22). This may l dbies tkhfes r
> esult of communicatio(nU) errors between this nod el usantdr ethe MGS,
> a bad configur(aU)tion, or other errors.  Seloev the syslog for more
>
> inf(oU)rmation.
>  LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU):1082:server_start_targe
> tmds(c)) failed to start serv(eUr) lustre-OST0030: -22
>  ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc:
> 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a
> (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup())
> ( U)Device 2 not setup                                ss
>  lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U)
> autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U)
> pcmcia_c
> ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U)
> button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U)
> myri10ge(U)
>  bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U)
> megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U)
> multipath(U)
> Pid: 10286, comm: ptlrpcd Tainted: GF     2.6.9-55.0.9.EL_lustre.
> 1.6.3smp
> RIP: 0010:[<ffffffff80321465>] <ffffffff80321465>{__lock_text_start
> +32}
> RSP: 0018:0000010218cd9bc8  EFLAGS: 00010216
> RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000
> RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0
> RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc
> R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058
> R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58
> FS:  0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS:
> 0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0
> Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task
> 00000102170b4030)
> Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e
> ffffffffa03b32a0
>        000001021654e0b0 ffffffffa04d6510 0000008000000000
> 0000000000000000
>        0000000000000000 00000102203920c0
> Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49}
>        <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560}
>        <ffffffff80131923>{recalc_task_prio+337}
> <ffffffffa02586fd>{:obdclass:class_export_destroy+381}
>        <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150}
>        <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229}
> <ffffffffa031883a>{:ptlrpc:ptlrpcd+874}
>        <ffffffff80133566>{default_wake_function+0}
> <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
>        <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
> <ffffffff80133566>{default_wake_function+0}
>        <ffffffff80110de3>{child_rip+8}
> <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0}
>        <ffffffff80110ddb>{child_rip+0}
>
> Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00
> RIP <ffffffff80321465>{__lock_text_start+32} RSP <0000010218cd9bc8>
>  <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc
> 1 blocks 1 reqs (0 succ ess)                              :
> --8<---------------cut here---------------end--------------->8---
>
> If you need more information or debug, feel free to request us. The
> problem occurs only with this OST.
>
> Thanks, Ludo
>
> --
> Ludovic Francois                 +33 (0)6 14 77 26 93
> System Engineer                  DataDirect Networks
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071213/09c2b346/attachment.htm>