[Lustre-discuss] Kernel panic on mounting an OST

Thu Dec 13 04:17:40 PST 2007

Hi Wojciech,

Here is more infos :

[root at oss01 ~]# multipath -l mpath0
mpath0 (360001ff00fd4922302000800001d1c17)
[size=3726 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active]
  \_ 2:0:0:2  sdaa 65:160 [active]
\_ round-robin 0 [enabled]
  \_ 1:0:0:2  sdc  8:32   [active]

[root at oss01 ~]# tunefs.lustre --print /dev/sdaa
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     lustre-OST0030
Index:      48
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp  
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp  
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80

    Permanent disk data:
Target:     lustre-OST0030
Index:      48
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp  
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp  
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80

exiting before disk write.

[root at oss01 ~]# tunefs.lustre --print /dev/sdc
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

    Read previous values:
Target:     lustre-OST0030
Index:      48
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp  
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp  
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80

    Permanent disk data:
Target:     lustre-OST0030
Index:      48
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x2
               (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp  
failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp  
mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80

exiting before disk write.

Any ideas ?

Regards

Franck

Le 13 déc. 07 à 12:53, Wojciech Turek a écrit :

> Hi,
>
> I just would like add that you could do very simple test to see if  
> mpath is working correctly. On your server oss1 run tunefs.lustre -- 
> print /dev/<all_mpath_devices> then write down target name for each  
> mpath device. Reboot the server and do the same and compare if the  
> mpath -> target map is the same as it was before reboot.
>
> Cheers
>
> Wojciech
> On 13 Dec 2007, at 10:55, Ludovic Francois wrote:
>
>> On 12 déc, 17:51, Oleg Drokin <Oleg.Dro... at Sun.COM> wrote:
>>> Hello!
>>>
>>> On Dec 12, 2007, at 11:39 AM, Franck Martinaux wrote:
>>>
>>>> After a power outage, I get some difficulties to mount a OST.
>>>> I am running a lustre 1.6.3 and I get a panic on the OSS when I  
>>>> try to
>>>> mount a OST.
>>>
>>> It would greatly help us if you show us panic message and possibly
>>> stacktrace.
>>
>>
>> Hi,
>>
>> Please find below all information we got this morning
>>
>> Environment
>> ===========
>>
>> ,----
>> | [root at oss01 ~]# uname -a
>> | Linux oss01.data.cluster 2.6.9-55.0.9.EL_lustre.1.6.3smp #1 SMP Sun
>> Oct 7 20:08:31 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>> | [root at oss01 ~]#
>> `----
>>
>> Mount of this specific OST
>> ==========================
>>
>> ,----
>> | [root at oss01 ~]# mount -t lustre /dev/mpath/mpath0 /mnt/lustre/ost1
>> | Read from remote host oss01: Connection reset by peer
>> | Connection to oss01 closed.
>> | [ddn at admin01 ~]$
>> `----
>>
>> /var/log/messages during the operation
>> ======================================
>>
>> --8<---------------cut here---------------start------------->8---
>> Dec 13 08:36:04 oss01 sshd(pam_unix)[13469]: session opened for user
>> root by root(uid=0)Dec 13 08:36:20 oss01 kernel: kjournald starting.
>> Commit interval 5 seconds
>> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journal
>> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: recovery complete.Dec 13
>> 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
>> data mode.
>> Dec 13 08:36:20 oss01 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Dec 13 08:36:20 oss01 kernel: LDISKFS FS on dm-1, internal journalDec
>> 13 08:36:20 oss01 kernel: LDISKFS-fs: mounted filesystem with ordered
>> data mode.
>> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: file extents enabled
>> Dec 13 08:36:20 oss01 kernel: LDISKFS-fs: mballoc enabled
>> Dec 13 08:36:20 oss01 kernel: Lustre: OST lustre-OST0002 now serving
>> dev (lustre-OST0002/0258906d-8eca-ba98-4e3d-19adfa472914) with
>> recovery enabled
>> Dec 13 08:36:20 oss01 kernel: Lustre: Server lustre-OST0002 on  
>> device /
>> dev/mpath/mpath1 has started
>> Dec 13 08:36:21 oss01 kernel: LustreError: 137-5: UUID 'lustre-
>> OST0030_UUID' is not available  for connect (no target)
>> Dec 13 08:36:21 oss01 kernel: LustreError: Skipped 4 previous similar
>> messages
>> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 00000102244efe00 x146203/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl
>> Interpret:/0/0 rc -19/0
>> Dec 13 08:36:21 oss01 kernel: LustreError: 13664:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) Skipped 4 previous similar messages
>> Dec 13 08:36:41 oss01 kernel: LustreError: 137-5: UUID 'lustre-
>> OST0030_UUID' is not available  for connect (no target)
>> Dec 13 08:36:41 oss01 kernel: LustreError: 13665:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 000001021d1d68
>> 00 x146233/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
>> -19/0
>> Dec 13 08:37:01 oss01 kernel: LustreError: 137-5: UUID 'lustre-
>> OST0030_UUID' is not available  for connect (no target)
>> Dec 13 08:37:01 oss01 kernel: LustreError: 13666:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 0000010006b95c
>> 00 x146264/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
>> -19/0
>> Dec 13 08:37:21 oss01 kernel: LustreError: 137-5: UUID 'lustre-
>> OST0030_UUID' is not available  for connect (no target)
>> Dec 13 08:37:21 oss01 kernel: LustreError: 13667:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 00000100cfe9ba
>> 00 x146300/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
>> -19/0
>> Dec 13 08:37:41 oss01 kernel: LustreError: 137-5: UUID 'lustre-
>> OST0030_UUID' is not available  for connect (no target)
>> Dec 13 08:37:41 oss01 kernel: LustreError: Skipped 5 previous similar
>> messages
>> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) @@@ processing error (-19)
>> req at 0000010037e88e
>> 00 x146373/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc
>> -19/0
>> Dec 13 08:37:41 oss01 kernel: LustreError: 13668:0:(ldlm_lib.c:
>> 1437:target_send_reply_msg()) Skipped 5 previous similar messages
>> Dec 13 08:37:47 oss01 kernel: Lustre: Failing over lustre-OST0002
>> Dec 13 08:37:47 oss01 kernel: Lustre: *** setting obd lustre-OST0002
>> device 'unknown-block(253,1)' read-only ***
>> Dec 13 08:37:47 oss01 kernel: Turning device dm-1 (0xfd00001) read-
>> only
>> Dec 13 08:37:47 oss01 kernel: Lustre: lustre-OST0002: shutting down
>> for failover; client state will be preserved.
>> Dec 13 08:37:47 oss01 kernel: Lustre: OST lustre-OST0002 has stopped.
>> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 blocks 1 reqs (0
>> success)
>> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 extents scanned,
>> 1 goal hits, 0 2^N hits, 0 breaks, 0 lost
>> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 1 generated and it
>> took 12560
>> Dec 13 08:37:47 oss01 kernel: LDISKFS-fs: mballoc: 256  
>> preallocated, 0
>> discarded
>> Dec 13 08:37:47 oss01 kernel: Removing read-only on dm-1 (0xfd00001)
>> Dec 13 08:37:47 oss01 kernel: Lustre: server umount lustre-OST0002
>> complete
>> Dec 13 08:37:57 oss01 sshd(pam_unix)[13946]: session opened for user
>> root by root(uid=0)
>> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
>> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
>> ordered data mode.
>> Dec 13 08:38:18 oss01 kernel: kjournald starting.  Commit interval 5
>> seconds
>> Dec 13 08:38:18 oss01 kernel: LDISKFS FS on dm-0, internal journal
>> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mounted filesystem with
>> ordered data mode.
>> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: file extents enabled
>> Dec 13 08:38:18 oss01 kernel: LDISKFS-fs: mballoc enabled
>> Dec 13 08:43:52 oss01 syslogd 1.4.1: restart.
>> Dec 13 08:43:52 oss01 syslog: syslogd startup succeeded
>> Dec 13 08:43:52 oss01 kernel: klogd 1.4.1, log source = /proc/kmsg
>> started.
>> --8<---------------cut here---------------end--------------->8---
>>
>> We have to do a power cycle to connect again
>> ============================================
>>
>> ,----
>> | # ipmitool -I lan -H 192.168.99.101 -U $login -P $password power
>> cycle
>> `----
>>
>>
>> The OST fsck seems correct
>> ==========================
>>
>> ,----
>> | [root at oss01 log]# fsck.ext2 /dev/mpath/mpath0
>> | e2fsck 1.40.2.cfs1 (12-Jul-2007)
>> | lustre-OST0030: recovering journal
>> | lustre-OST0030: clean, 227/244195328 files, 15614685/976760320
>> blocks
>> | [root at oss01 log]#
>> `----
>>
>> tunefs.lustre reads correctly mpath0 information
>> ================================================
>>
>> ,----
>> | [root at oss01 log]# tunefs.lustre /dev/mpath/mpath0
>> | checking for existing Lustre data: found CONFIGS/mountdata
>> | Reading CONFIGS/mountdata
>> |
>> |    Read previous values:
>> | Target:     lustre-OST0030
>> | Index:      48
>> | Lustre FS:  lustre
>> | Mount type: ldiskfs
>> | Flags:      0x142
>> |               (OST update writeconf )
>> | Persistent mount opts: errors=remount-ro,extents,mballoc
>> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
>> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
>> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
>> |
>> |
>> |    Permanent disk data:
>> | Target:     lustre-OST0030
>> | Index:      48
>> | Lustre FS:  lustre
>> | Mount type: ldiskfs
>> | Flags:      0x142
>> |               (OST update writeconf )
>> | Persistent mount opts: errors=remount-ro,extents,mballoc
>> | Parameters: mgsnode=10.143.0.5 at tcp mgsnode=10.143.0.6 at tcp
>> failover.node=10.143.0.2 at tcp sys.timeout=80 mgsnode=10.143.0.5 at tcp
>> mgsnode=10.143.0.6 at tcp failover.node=10.143.0.2 at tcp sys.timeout=80
>> |
>> | Writing CONFIGS/mountdata
>> | [root at oss01 log]#
>> `----
>>
>>
>> DDN lun is ready and working correctly
>> ======================================
>>
>> ,----[ OSS view ]
>> | [root at oss01 log]# multipath -l | grep mpath0
>> | mpath0 (360001ff00fd4922302000800001d1c17)
>> | [root at oss01 log]#
>> `----
>>
>> ,----[ S2A9550 view ]
>> | [ddn at admin01 ~]$ s2a -h 10.141.0.92 -e "lun list" | grep -i
>> 0fd492230200
>> |   2                     1    Ready          3815470 0FD492230200
>> | [ddn at admin01 ~]$
>> `----
>>
>> Stack trace (We got it from OSS02 via the serial line during a
>> mounting try)
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> = 
>> =====================================================================
>>
>> --8<---------------cut here---------------start------------->8---
>> LDISKFS-fs: file extents enabled
>> LDISKFS-fs: mballoc enabled
>> LustreError: 134-6: Trying to start OBD lustre-OST0030_UUID using the
>> wrong disk lustre-OST0000_UUID. Were the /dev/ assignments rearranged
>> ?
>> LustreError: 10203:0:(filter.c:1022:filter_prep()) cannot read
>> last_rcvd: rc = -22
>> LustreEr<ro4>re:i p10:2 f0f3f:0ff:f(fobfad_0c3aon2f12ig1.
>> c:325:class_setup()) set--up- -l-u--s-t-re--- OS[Tcu00t 30h  
>> efreai ]le
>> d- -(----22--)-
>>  [please bite here ] -L-u-s--tr--eE--
>> r                                                -
>> or: 10203:0:(obd_config.Kcer:n1e06l2 B:cUlG asast
>> _csponifnilogc_lkl:o1g19_h
>> ndler()) Err -22 on cfign cvaolmimdan
>> odp:                                  a
>> and: 0000 [1] SLMuPs tre:    cmd=cf003 0:lu<s4tr>
>> OST0030  1:dev  2:type CP 3U: 3f
>> ustreError: 15b-f: MGC1M0o.d1ul4e3s. 0l.i5 at nktcepd : inT:he
>> configuration from l ogo bd'lfuisltterer-OST0030' failed (-22).
>> ( U)Make sure th
>> is client a ndf stfihlet _MlGSdi askrfes running compatible
>> ver(siUo)ns of
>> Lustre.
>>  oLussttreError: 15c-8: MGC10.<144>(3.U0).5 at tcp: The configurati omng
>> cfrom log 'lustre-OST003(0U') failed (-22). This may l dbies tkhfes r
>> esult of communicatio(nU) errors between this nod el usantdr ethe  
>> MGS,
>> a bad configur(aU)tion, or other errors.  Seloev the syslog for more
>>
>> inf(oU)rmation.
>>  LlqusutotraeError: 10203:0:(obd_mo<u4nt>.(cU): 
>> 1082:server_start_targe
>> tmds(c)) failed to start serv(eUr) lustre-OST0030: -22
>>  ksocklnd(LU)ustreError: 10203:0:(obd<4_>m oupnttl.rpcc:
>> 1573:server_fill_super((U))) Unable to start target so:bd -c2l2a
>> (ULu)streError: 10203:0:(obd<_4c>o nlfneitg.c:392:class_cleanup())
>> ( U)Device 2 not setup                                ss
>>  lvfs(U) libcfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U)
>> autofs4(U) i2c_dev(U) i2c_core(U) sunrpc(U) ds(U) yenta_socket(U)
>> pcmcia_c
>> ore(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) joydev(U)
>> button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U)
>> myri10ge(U)
>>  bnx2(U) ext3(U) jbd(U) dm_mod(U) qla2400(U) ata_piix(U)
>> megaraid_sas(U) qla2xxx(U) scsi_transport_fc(U) sd_mod(U)
>> multipath(U)
>> Pid: 10286, comm: ptlrpcd Tainted: GF     2.6.9-55.0.9.EL_lustre.
>> 1.6.3smp
>> RIP: 0010:[<ffffffff80321465>] <ffffffff80321465>{__lock_text_start
>> +32}
>> RSP: 0018:0000010218cd9bc8  EFLAGS: 00010216
>> RAX: 0000000000000016 RBX: 000001021654e4bc RCX: 0000000000020000
>> RDX: 000000000000baa7 RSI: 0000000000000246 RDI: ffffffff80396fc0
>> RBP: 000001021654e4a0 R08: 00000000fffffffe R09: 000001021654e4bc
>> R10: 0000000000000000 R11: 0000000000000000 R12: 00000102196e6058
>> R13: 00000102196e6000 R14: 0000010218cd9eb8 R15: 0000010218cd9e58
>> FS:  0000002a9557ab00(0000) GS:ffffffff804a6880(0000) knlGS:
>> 0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 0000002a95557000 CR3: 0000000228514000 CR4: 00000000000006e0
>> Process ptlrpcd (pid: 10286, threadinfo 0000010218cd8000, task
>> 00000102170b4030)
>> Stack: 000001021654e4bc ffffffffa03a2121 000001021a99304e
>> ffffffffa03b32a0
>>        000001021654e0b0 ffffffffa04d6510 0000008000000000
>> 0000000000000000
>>        0000000000000000 00000102203920c0
>> Call Trace:<ffffffffa03a2121>{:lquota:filter_quota_clearinfo+49}
>>        <ffffffffa04d6510>{:obdfilter:filter_destroy_export+560}
>>        <ffffffff80131923>{recalc_task_prio+337}
>> <ffffffffa02586fd>{:obdclass:class_export_destroy+381}
>>        <ffffffffa025c336>{:obdclass:obd_zombie_impexp_cull+150}
>>        <ffffffffa0318345>{:ptlrpc:ptlrpcd_check+229}
>> <ffffffffa031883a>{:ptlrpc:ptlrpcd+874}
>>        <ffffffff80133566>{default_wake_function+0}
>> <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
>>        <ffffffffa02eb450>{:ptlrpc:ptlrpc_expired_set+0}
>> <ffffffff80133566>{default_wake_function+0}
>>        <ffffffff80110de3>{child_rip+8}
>> <ffffffffa03184d0>{:ptlrpc:ptlrpcd+0}
>>        <ffffffff80110ddb>{child_rip+0}
>>
>> Code: 0f 0b 04 c2 33 80 ff ff ff ff 77 00 f0 ff 0b 0f 88 8b 03 00
>> RIP <ffffffff80321465>{__lock_text_start+32} RSP <0000010218cd9bc8>
>>  <0>Kernel pani<4c> -L DnIoStKF sSy-nfcs:i nmg:b aOlolopsc
>> 1 blocks 1 reqs (0 succ ess)                              :
>> --8<---------------cut here---------------end--------------->8---
>>
>> If you need more information or debug, feel free to request us. The
>> problem occurs only with this OST.
>>
>> Thanks, Ludo
>>
>> --
>> Ludovic Francois                 +33 (0)6 14 77 26 93
>> System Engineer                  DataDirect Networks
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service
> email: wjt27 at cam.ac.uk
> tel. +441223763517
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss