[Lustre-discuss] recovering formatted OST

Tue Oct 26 18:40:32 PDT 2010

On 2010-10-27, at 03:27, Wojciech Turek wrote:
> After the recovery the OST has around 95000 objects left but LAST_ID is set to 2490599 which is the highest object number left on that OST
> 
> What is worrying me now is that the old OST's LAST_ID value is quite high

The OST ID values are sequential and are only used once, so the LAST_ID value being higher than the number of existing objects is totally normal.

> [root at mds03 ~]# lctl get_param osc.*.prealloc_last_id | grep OST0010
> osc.ddn_data-OST0010-osc-ffff8101dc723c00.prealloc_last_id=1

This is the "client filesystem mount" OSC, so the value here is irrelevant.

> osc.scratch2-OST0010-osc.prealloc_last_id=2490631

This is the MDS OSC, and it looks correct.

> Is this going to affect the operation of that OST or is this OK  and OST will carry on from that number with no problems?

Yes, it appears that it is working correctly.

> On 26 October 2010 19:55, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
> In that case LAST_ID seem to be fine as OST show 2490599 and MDT shows 2490688 so the difference is 89, I don't understand why you said that difference is over 100000
> 
> 
>  [root at oss09 ~]# od -Ax -td8 /tmp/LAST_ID
>   000000              2490599
>   000008
> 
>  [root at mds03 ~]# od -Ax -td8 /tmp/lov_objid
>   000000              2073842              2100049
>   000010              2115247              2038471
>   000020              2119821              2190996
>   000030              2029234              2354424
>   000040              2160856              2167105
>   000050              1970351              2059045
>   000060              2706486              2571655
>   000070              2662262              2628346
>   000080              2490688              2668926
>   000090              2631587              2643791
>   0000a0
> 
> What I don't understand is why lctl reports last_id=1 for that OST 
> 
> lctl get_param osc.*.prealloc_last_id | grep OST0010
> osc.scratch2-OST0010-osc.prealloc_last_id=1
> 
> Unless this is because that OST is deactivated on the MDT ? 
> 
> On 26 October 2010 19:49, Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:
> That is the value in the lov_objid.
> 
> Cheers,
> Bernd
> 
> On Tuesday, October 26, 2010, Wojciech Turek wrote:
> > I can not find where MDT stores that LAST_ID value for the OST?
> >
> > On 26 October 2010 19:10, Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:
> > > I think the difference is quite huge (over 100000 files). But the MDS has
> > > a sanity check and will refuse to activate this OST, if the difference
> > > is larger
> > > than 20000 files.
> > >
> > > So one way or the other you need to correct it (either increase LAST_ID
> > > value
> > > on the OST or on the MDS).
> > >
> > >
> > > Cheers,
> > > Bernd
> > >
> > > On Tuesday, October 26, 2010, Wojciech Turek wrote:
> > > > Ok, I have created a filesystem on a loopback device. I mounted it as
> > > > ldiskfs and copied CONFIGS directory back to my old OST. Now
> > >
> > > tunefs.lustre
> > >
> > > > returns correct info.
> > > >
> > > > last_id on OST is smaller then number in MDT lov_objid which is good
> > > >
> > > > Can ignore that lctl get_param osc.*.prealloc_last_id | grep OST0010
> > > > osc.scratch2-OST0010-osc.prealloc_last_id=1
> > > >
> > > > I guess when I restart whole filesystem after writeconf MDT should
> > >
> > > correct
> > >
> > > > that?
> > > >
> > > > best regards,
> > > >
> > > > Wojciech
> > > >
> > > > On 26 October 2010 18:05, Bernd Schubert <bs_lists at aakef.fastmail.fm>
> > >
> > > wrote:
> > > > > Hello Wojciech,
> > > > >
> > > > > tunefs.lustre has to complain as the files are missing. If you copy
> > >
> > > over
> > >
> > > > > the
> > > > > files from the loop back device (yes, same index and label),
> > > > > tunefs.lustre should work.
> > > > >
> > > > > Cheers,
> > > > > Bernd
> > > > >
> > > > > On Tuesday, October 26, 2010, Wojciech Turek wrote:
> > > > > > Hi Bernd,
> > > > > >
> > > > > > I am not quite clear how creating new OST on a loopback device
> > > > > > would
> > > > >
> > > > > help:
> > > > > > Shall I create new OST on a loopback device formatting it with old
> > > > > > index and label and then copy recovered objects to that OST and
> > > > > > mount it to the filesystem?
> > > > > >
> > > > > > I think I need to reformat old OST before mounting it as lustre
> > > > > > type filesystem as although fsck recovered some objects (and I can
> > > > > > access them mounting OST as ldiskfs)  if you run tunefs.lustre on
> > > > > > that OST device, tunefs.lustre complaints that it doesn't find any
> > > > > > lustre filesystem.
> > > > > >
> > > > > > As for the EAs I have created a backup of the recovered objects
> > > > >
> > > > > preserving
> > > > >
> > > > > > EAs.
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Wojciech
> > > > > >
> > > > > > On 26 October 2010 16:35, Bernd Schubert
> > > > > > <bernd.schubert at fastmail.fm
> > > > >
> > > > > wrote:
> > > > > > > Hello Wojciech,
> > > > > > >
> > > > > > > I think both would work, but why don't just create a small OST
> > > > > > > with mkfs.lustre on a loopback device? And then copy over those
> > > > > > > files to
> > > > >
> > > > > your
> > > > >
> > > > > > > recovered filesystem.
> > > > > > > Hmm, well, e2fsck might not have fixed all issues and then a
> > >
> > > reformat
> > >
> > > > > > > indeed
> > > > > > > might be helpful.
> > > > > > >
> > > > > > > Also note: EAs on OST objects are a nice to have, but not
> > >
> > > absolutely
> > >
> > > > > > > required.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Bernd
> > > > > > >
> > > > > > > On Tuesday, October 26, 2010, Wojciech Turek wrote:
> > > > > > > > Bernd, I would like to clarify if I understood you suggestion
> > > > > > > > correctly:
> > > > > > > >
> > > > > > > > 1) create a new OST but using old index and old label
> > > > > > > > 2) mount it as ldiskfs and copy recovered objects (using tar or
> > > > > > > > rsync
> > > > > > >
> > > > > > > with
> > > > > > >
> > > > > > > > xattrs support) from the old OST to the new OST
> > > > > > > > 3) run --writeconf on MDT and OST of that filesystem
> > > > > > > > 4) mount MDT and all OSTs
> > > > > > > >
> > > > > > > >
> > > > > > > > I guess I could do it also that way:
> > > > > > > >
> > > > > > > > 1) backup restored object using tar or rsync with xattrs
> > > > > > > > support 2) format old OST with old index and old label
> > > > > > > > 3) restore Objects from the backup
> > > > > > > >
> > > > > > > > Do you think that would work?
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > > Wojciech
> > > > > > > >
> > > > > > > > On 22 October 2010 18:52, Bernd Schubert
> > > > > > > > <bernd.schubert at fastmail.fm
> > > > > > >
> > > > > > > wrote:
> > > > > > > > > Hmm, I would probably format a small fake device on a ramdisk
> > >
> > > and
> > >
> > > > > > > > > copy files
> > > > > > > > > over, run tunefs --writeconf /mdt and then start everything
> > > > > > > > > (inlcuding all OSTs) again.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > On Friday, October 22, 2010, Wojciech Turek wrote:
> > > > > > > > > > I have tried Bernd's suggestion and it seem to have worked,
> > > > > > > > > > after running e2fsck -D ll_recover_lost_found_objs didn't
> > > > > > > > > > cause kernel
> > > > > > >
> > > > > > > panic
> > > > > > >
> > > > > > > > > > but moved
> > > > > > > > >
> > > > > > > > > a
> > > > > > > > >
> > > > > > > > > > number of objects to O directory. Problem is that I do not
> > >
> > > have
> > >
> > > > > > > > > > last_rcvd file so the OST has no index at the moment. What
> > > > > > > > > > would
> > > > >
> > > > > be
> > > > >
> > > > > > > > > > the next step
> > > > > > > > >
> > > > > > > > > to
> > > > > > > > >
> > > > > > > > > > enable access to those files in the filesystem?
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > >
> > > > > > > > > > Wojciech
> > > > > > > > > >
> > > > > > > > > > On 22 October 2010 17:15, Andreas Dilger
> > > > > > > > > > <andreas.dilger at oracle.com>
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > On 2010-10-22, at 5:42, Bernd Schubert
> > > > > > > > > > > <bernd.schubert at fastmail.fm
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > Hmm, e2fsck didn't catch that? rec_len is the length of
> > > > > > > > > > > > a
> > > > > > >
> > > > > > > directory
> > > > > > >
> > > > > > > > > > > entry, so
> > > > > > > > > > >
> > > > > > > > > > > > after how many bytes the next entry follows.
> > > > > > > > > > >
> > > > > > > > > > > I agree that e2fsck should have caught that.
> > > > > > > > > > >
> > > > > > > > > > > > You can try to force e2fsck to do
> > > > > > > > > > > > something about that: e2fsck -D
> > > > > > > > > > >
> > > > > > > > > > > No, I would recommend against using -D at this point.
> > > > > > > > > > > That will
> > > > > > >
> > > > > > > cause
> > > > > > >
> > > > > > > > > it
> > > > > > > > >
> > > > > > > > > > > to re-write the directory contents, and given that the
> > > > >
> > > > > filesystem
> > > > >
> > > > > > > was
> > > > > > >
> > > > > > > > > > > previously corrupted I would prefer making as few changes
> > >
> > > as
> > >
> > > > > > > possible
> > > > > > >
> > > > > > > > > > > before the data is estranged.
> > > > > > > > > > >
> > > > > > > > > > > Wojciech,
> > > > > > > > > > > note that if you are able to mount the filesystem you
> > > > > > > > > > > could
> > > > >
> > > > > just
> > > > >
> > > > > > > copy
> > > > > > >
> > > > > > > > > all
> > > > > > > > >
> > > > > > > > > > > of the objects (with xattrs!) from lost+found on the bad
> > > > > > >
> > > > > > > filesystem,
> > > > > > >
> > > > > > > > > > > along with the last_rcvd file (if you can find it) into a
> > >
> > > new
> > >
> > > > > > > ldiskfs
> > > > > > >
> > > > > > > > > > > filesystem and then run ll_recover_lost_found_objs on
> > > > > > > > > > > that.
> > > > > > > > > > >
> > > > > > > > > > > > On Friday, October 22, 2010, Wojciech Turek wrote:
> > > > > > > > > > > >> Ok, removing and recreating the journal fixed that
> > >
> > > problem
> > >
> > > > > and
> > > > >
> > > > > > > > > > > >> I am able
> > > > > > > > > > >
> > > > > > > > > > > to
> > > > > > > > > > >
> > > > > > > > > > > >> mount device as ldiskfs filesystem. Now I hit another
> > >
> > > wall
> > >
> > > > > > > > > > > >> when
> > > > > > > > >
> > > > > > > > > trying
> > > > > > > > >
> > > > > > > > > > > to
> > > > > > > > > > >
> > > > > > > > > > > >> run ll_recover_lost_found_objs
> > > > > > > > > > > >> When I first time run ll_recover_lost_found_objs -d
> > > > > > > > > > > >> /mnt/ost/lost+found
> > > > > > > > > > >
> > > > > > > > > > > it
> > > > > > > > > > >
> > > > > > > > > > > >> only creates the O dir and exits. When I repeat this
> > > > > > > > > > > >> command
> > > > > > >
> > > > > > > again
> > > > > > >
> > > > > > > > > > > kernel
> > > > > > > > > > >
> > > > > > > > > > > >> panics. Any idea what could be the problem here?
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad
> > >
> > > entry
> > >
> > > > > in
> > > > >
> > > > > > > > > > > >> directory #6831: rec_len is smaller than minimal -
> > > > > > > > > > > >> offset=0,
> > > > > > > > >
> > > > > > > > > inode=0,
> > > > > > > > >
> > > > > > > > > > > >> rec_len=0, name_len=0
> > > > > > > > > > > >> Aborting journal on device dm-4.
> > > > > > > > > > > >> Unable to handle kernel NULL pointer dereference at
> > > > > > > > > > > >> 0000000000000000
> > > > > > > > > > >
> > > > > > > > > > > RIP:
> > > > > > > > > > > >> [<ffffffff88033448>]
> > > > > > > :
> > > > > > > :jbd:journal_commit_transaction+0xc5b/0x12db
> > > > > > > :
> > > > > > > > > > > >> PGD 1a118d067 PUD 1ce7e7067 PMD 0
> > > > > > > > > > > >> Oops: 0002 [1] SMP
> > > > > > > > > > > >> last sysfs file: /class/infiniband_mad/umad0/port
> > > > > > > > > > > >> CPU 3
> > > > > > > > > > > >> Modules linked in: ldiskfs(U) crc16(U) autofs4(U)
> > >
> > > hidp(U)
> > >
> > > > > > > l2cap(U)
> > > > > > >
> > > > > > > > > > > >> bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U)
> > > > > > > > > > > >> ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U)
> > > > > > > > > > > >> ipv6(U)
> > >
> > > xfrm_nalgo(U)
> > >
> > > > > > > > > > > >> crypto_api(U)
> > > > > > > > > > >
> > > > > > > > > > > ib_uverbs(U)
> > > > > > > > > > >
> > > > > > > > > > > >> ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U)
> > > > > > > > > > > >> ib_mthca(U)
> > > > > > > > > > >
> > > > > > > > > > > mptctl(U)
> > > > > > > > > > >
> > > > > > > > > > > >> dm_mirror(U) video(U) backlight(U) sbs(U)
> > > > > > > > > > > >> power_meter(U)
> > > > > > >
> > > > > > > hwmon(U)
> > > > > > >
> > > > > > > > > > > i2c_ec(U)
> > > > > > > > > > >
> > > > > > > > > > > >> i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U)
> > > > > > > > > > > >> asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U)
> > >
> > > lp(U)
> > >
> > > > > > > > > > > >> parport(U)
> > > > > > >
> > > > > > > sr_mod(U)
> > > > > > >
> > > > > > > > > > > cdrom(U)
> > > > > > > > > > >
> > > > > > > > > > > >> mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U)
> > > > > > > > >
> > > > > > > > > usb_storage(U)
> > > > > > > > >
> > > > > > > > > > > >> pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U)
> > >
> > > edac_mc(U)
> > >
> > > > > > > > > dm_raid45(U)
> > > > > > > > >
> > > > > > > > > > > >> dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U)
> > > > > > > > > > > >> dm_mem_cache(U)
> > > > > > > > > > >
> > > > > > > > > > > nfs(U)
> > > > > > > > > > >
> > > > > > > > > > > >> lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U)
> > > > >
> > > > > mptscsih(U)
> > > > >
> > > > > > > > > > > mptbase(U)
> > > > > > > > > > >
> > > > > > > > > > > >> scsi_transport_sas(U) mppVhba(U) megaraid_sas(U)
> > > > > > > > > > > >> mppUpper(U)
> > > > > > >
> > > > > > > sg(U)
> > > > > > >
> > > > > > > > > > > >> sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U)
> > > > > > > > > > > >> uhci_hcd(U)
> > > > > > > > > > > >>
> > > > > > > > > > > >> ohci_hcd(U) ehci_hcd(U) Pid: 11360, comm: kjournald
> > >
> > > Tainted:
> > > > > G
> > > > >
> > > > > > > > > > > >> 2.6.18-194.3.1.el5_lustre.1.8.4 #1
> > > > > > > > > > > >> RIP: 0010:[<ffffffff88033448>]  [<ffffffff88033448>]
> > > > > > > > > > > >>
> > > > > > > > > > > >> :jbd:journal_commit_transaction+0xc5b/0x12db
> > > > > > > > > > > >>
> > > > > > > > > > > >> RSP: 0018:ffff8101c6481d90  EFLAGS: 00010246
> > > > > > >
> > > > > > > > > > > >> RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> > > > > > > 00000000ffffffff
> > > > > > >
> > > > > > > > > > > >> RDX: 0000000000000000 RSI: ffff8101e9dab0c0 RDI:
> > > > > > > ffff81022fa46000
> > > > > > >
> > > > > > > > > > > >> RBP: ffff81022fa46000 R08: ffff81022fa46068 R09:
> > > > > > > 0000000000000000
> > > > > > >
> > > > > > > > > > > >> R10: ffff810105925b20 R11: 00000000fffffffa R12:
> > > > > > > 0000000000000000
> > > > > > >
> > > > > > > > > > > >> R13: 0000000000000000 R14: ffff8101e9dab0c0 R15:
> > > > > > > 0000000000000000
> > > > > > >
> > > > > > > > > > > >> FS:  0000000000000000(0000) GS:ffff810107b9a4c0(0000)
> > > > > > > > > > > >> knlGS:0000000000000000 CS:  0010 DS: 0018 ES: 0018
> > > > > > > > > > > >> CR0: 000000008005003b CR2: 0000000000000000 CR3:
> > > > > > > > > > > >> 00000001eaffb000
> > > > > > >
> > > > > > > CR4:
> > > > > > > > > > > >> 00000000000006e0 Process kjournald (pid: 11360,
> > >
> > > threadinfo
> > >
> > > > > > > > > > > >> ffff8101c6480000, task ffff81021c14c0c0)
> > > > > > > > > > > >> Stack:  ffff8101a61b9000 000000002b8263c0
> > >
> > > ffffffff00000000
> > >
> > > > > > > > > > > 0000000000000000
> > > > > > > > > > >
> > > > > > > > > > > >> 0000113b00000001 0000000000000013 0000000000000000
> > > > > > > > > > > >> 0000000000000111 0000000000000000 0000000000000000
> > > > > > > > > > > >> 0000000001282dd7 00000000000020dd Call Trace:
> > > > > > > > > > > >> [<ffffffff8003da91>] lock_timer_base+0x1b/0x3c
> > > > > > > > > > > >> [<ffffffff8004b347>] try_to_del_timer_sync+0x7f/0x88
> > > > > > > > > > > >> [<ffffffff88037386>] :jbd:kjournald+0xc1/0x213
> > > > > > > > > > > >> [<ffffffff800a0ab2>] autoremove_wake_function+0x0/0x2e
> > > > > > > > > > > >> [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4
> > > > > > > > > > > >> [<ffffffff880372c5>] :jbd:kjournald+0x0/0x213
> > > > > > > > > > > >> [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4
> > > > > > > > > > > >> [<ffffffff80032890>] kthread+0xfe/0x132
> > > > > > > > > > > >> [<ffffffff8005dfb1>] child_rip+0xa/0x11
> > > > > > > > > > > >> [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4
> > > > > > > > > > > >> [<ffffffff8014bcf4>] deadline_queue_empty+0x0/0x23
> > > > > > > > > > > >> [<ffffffff80032792>] kthread+0x0/0x132
> > > > > > > > > > > >> [<ffffffff8005dfa7>] child_rip+0x0/0x11
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> Code: f0 0f ba 33 01 e8 42 fc 02 f8 8b 03 a8 04 75 07
> > > > > > > > > > > >> 8b 43
> > > > >
> > > > > 58
> > > > >
> > > > > > > 85
> > > > > > >
> > > > > > > > > > > >> RIP  [<ffffffff88033448>]
> > > > > > > > > :
> > > > > > > > > :jbd:journal_commit_transaction+0xc5b/0x12db
> > > > > > > > > :
> > > > > > > > > > > >> RSP <ffff8101c6481d90>
> > > > > > > > > > > >> CR2: 0000000000000000
> > > > > > > > > > > >> <0>Kernel panic - not syncing: Fatal exception
> > > > > > > > > > > >>
> > > > > > > > > > > >> On 22 October 2010 03:09, Andreas Dilger
> > > > > > > > > > > >> <andreas.dilger at oracle.com
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >>> On 2010-10-21, at 18:44, Wojciech Turek <
> > >
> > > wjt27 at cam.ac.uk>
> > >
> > > > > > > wrote:
> > > > > > > > > > > >>> fsck has finished and does not find any more errors
> > > > > > > > > > > >>> to correct. However when I try to mount the device
> > > > > > > > > > > >>> as ldiskfs kernel panics
> > > > > > > > >
> > > > > > > > > with
> > > > > > > > >
> > > > > > > > > > > >>> following message:
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Assertion failure in cleanup_journal_tail() at
> > > > > > > > > > > >>> fs/jbd/checkpoint.c:459: "blocknr != 0"
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Hmm, not sure, maybe your journal is broken?  You can
> > > > >
> > > > > delete
> > > > >
> > > > > > > > > > > >>> it
> > > > > > > > >
> > > > > > > > > with
> > > > > > > > >
> > > > > > > > > > > >>> "tune2fs -O ^has_journal" (maybe after running e2fsck
> > > > > > > > > > > >>> again to
> > > > > > > > >
> > > > > > > > > clear
> > > > > > > > >
> > > > > > > > > > > the
> > > > > > > > > > >
> > > > > > > > > > > >>> journal), then re-create it with "tune2fs -j".
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> ----------- [cut here ] --------- [please bite here ]
> > > > > > > > > > > >>> --------- Kernel BUG at fs/jbd/checkpoint.c:459
> > > > > > > > > > > >>> invalid opcode: 0000 [1] SMP
> > > > > > > > > > > >>> last sysfs file: /class/infiniband_mad/umad0/
> > > > > > > > > > > >>> port
> > > > > > > > > > > >>> CPU 2
> > > > > > > > > > > >>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U)
> > >
> > > ost(U)
> > >
> > > > > > > > > > > >>> mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U)
> > > > >
> > > > > lquota(U)
> > > > >
> > > > > > > > > > > >>> osc(U)
> > > > > > > > > > >
> > > > > > > > > > > ksocklnd(U)
> > > > > > > > > > >
> > > > > > > > > > > >>> ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U)
> > > > > > > > > > > >>> libcfs(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U)
> > > > > > > > > > > >>> rdma_ucm(U) rdma_cm(U) iw_cm(U)
> > > > > > > > > > >
> > > > > > > > > > > ib_addr(U)
> > > > > > > > > > >
> > > > > > > > > > > >>> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U)
> > > > > > > > > > > >>> xfrm_nalgo(U)
> > > > > > > > > > >
> > > > > > > > > > > crypto_api(U)
> > > > > > > > > > >
> > > > > > > > > > > >>> ib_uverbs(U) ib_umad(U) mlx4_vnic(U)
> > >
> > > mlx4_vnic_helper(U)
> > >
> > > > > > > ib_sa(U)
> > > > > > >
> > > > > > > > > > > >>> ib_mthca(U) mptctl(U) dm_mirror(U) video(U)
> > >
> > > backlight(U)
> > >
> > > > > > > > > > > >>> sbs(U) power_meter(U) hwmon(U) i2c_ec(U) i2c_core(U)
> > > > > > > > > > > >>> dell_wmi(U)
> > > > > > >
> > > > > > > wmi(U)
> > > > > > >
> > > > > > > > > > > >>> button(U) battery(U) asus_acpi(U) acpi_memhotplug(U)
> > > > > > > > > > > >>> ac(U)
> > > > > > > > > > >
> > > > > > > > > > > parport_pc(U)
> > > > > > > > > > >
> > > > > > > > > > > >>> lp(U) parport(U) sr_mod(U) cdrom(U) mlx4_ib(U)
> > >
> > > ib_mad(U)
> > >
> > > > > > > > > > > >>> ib_core(U) joydev(U) mlx4_core(U) usb_storage(U)
> > > > > > > > > > > >>> shpchp(U) i5000_edac(U)
> > > > > > > > > > >
> > > > > > > > > > > edac_mc(U)
> > > > > > > > > > >
> > > > > > > > > > > >>> serio_raw(U) pcspkr(U) dm_raid45(U) dm_message(U)
> > > > > > > > > > > >>> dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U)
> > > > >
> > > > > nfs(U)
> > > > >
> > > > > > > > > > > >>> lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U)
> > > > > > > > > > > >>> mptscsih(U) mptbase(U)
> > > > > > > > > > > >>> scsi_transport_sas(U) mppVhba(U) megaraid_sas(U)
> > > > >
> > > > > mppUpper(U)
> > > > >
> > > > > > > > > > > >>> sg(U) sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U)
> > > > > > > > > > > >>> uhci_hcd(U)
> > > > > > > > >
> > > > > > > > > ohci_hcd(U)
> > > > > > > > >
> > > > > > > > > > > >>> ehci_hcd(U) Pid: 13891, comm: mount Tainted: G
> > > > > > >
> > > > > > > > > > > >>> 2.6.18-194.3.1.el5_lustre.1.8.4 #1 RIP:
> > > > > > > 0010:[<ffffffff88034a95>]
> > > > > > >
> > > > > > > > > > > >>> [<ffffffff88034a95>]
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> :jbd:cleanup_journal_tail+0x9d/0x118
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> RSP: 0018:ffff81016f00da68  EFLAGS: 00010286
> > > > > > >
> > > > > > > > > > > >>> RAX: 000000000000005a RBX: ffff81012ca12c00 RCX:
> > > > > > > ffffffff80311da8
> > > > > > >
> > > > > > > > > > > >>> RDX: ffffffff80311da8 RSI: 0000000000000000 RDI:
> > > > > > > ffffffff80311da0
> > > > > > >
> > > > > > > > > > > >>> RBP: 0000000000000000 R08: ffffffff80311da8 R09:
> > > > > > > 0000000000000001
> > > > > > >
> > > > > > > > > > > >>> R10: 0000000000000000 R11: 0000000000000080 R12:
> > > > > > > 0000000000000002
> > > > > > >
> > > > > > > > > > > >>> R13: ffff81012ca12d4c R14: ffff81012ca12c24 R15:
> > > > > > > ffff81017a8d7400
> > > > > > >
> > > > > > > > > > > >>> FS:  00002abd7cef1f70(0000) GS:ffff810107b9acc0(0000)
> > > > > > > > > > > >>> knlGS:0000000000000000
> > > > > > > > > > > >>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > >
> > > > > > > > > > > >>> CR2: 000000000042b000 CR3: 000000012813f000 CR4:
> > > > > > > 00000000000006e0
> > > > > > >
> > > > > > > > > > > >>> Process mount (pid: 13891, threadinfo
> > > > > > > > > > > >>> ffff81016f00c000,
> > > > >
> > > > > task
> > > > >
> > > > > > > > > > > >>> ffff81022e1b7820)
> > > > > > > > > > > >>> Stack:  0000000000000000 ffff81012ca12c00
> > > > > > > > > > > >>> ffff81017a8d7400 ffffffff88037690
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> ffff81012ca12c00 ffff8102034ff000 ffff81017a8d7400
> > > > > > > > > > > >>> 0000000000000000 ffff8102034ff000 ffffffff88a9be56
> > > > > > > > > > > >>> 0000000001000000 ffff8101bf788000
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Call Trace:
> > > > > > > > > > > >>> [<ffffffff88037690>] :jbd:journal_flush+0xbe/0x248
> > > > > > > > > > > >>> [<ffffffff88a9be56>]
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> :ldiskfs:ldiskfs_mark_recovery_complete+0x36/0x90
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> [<ffffffff88aa02e0>]
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> :ldiskfs:ldiskfs_fill_super+0x1790/0x1950
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> [<ffffffff800eccd2>] get_filesystem+0x12/0x3b
> > > > > > > > > > > >>> [<ffffffff800e343e>] test_bdev_super+0x0/0xd
> > > > > > > > > > > >>> [<ffffffff88a9eb50>]
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> :ldiskfs:ldiskfs_fill_super+0x0/0x1950
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> [<ffffffff800e43fd>] get_sb_bdev+0x10a/0x16c
> > > > > > > > > > > >>> [<ffffffff800e3d9a>] vfs_kern_mount+0x93/0x11a
> > > > > > > > > > > >>> [<ffffffff800e3e63>] do_kern_mount+0x36/0x4d
> > > > > > > > > > > >>> [<ffffffff800ee601>] do_mount+0x6a9/0x719
> > > > > > > > > > > >>> [<ffffffff800090d2>] __handle_mm_fault+0x96f/0xfaa
> > > > > > > > > > > >>> [<ffffffff8002c9e0>] mntput_no_expire+0x19/0x89
> > > > > > > > > > > >>> [<ffffffff8000a72a>] __link_path_walk+0xf1e/0xf42
> > > > > > > > > > > >>> [<ffffffff800220ce>] __up_read+0x19/0x7f
> > > > > > > > > > > >>> [<ffffffff80066b88>] do_page_fault+0x4fe/0x874
> > > > > > > > > > > >>> [<ffffffff8002c9e0>] mntput_no_expire+0x19/0x89
> > > > > > > > > > > >>> [<ffffffff8000ea45>] link_path_walk+0xa6/0xb2
> > > > > > > > > > > >>> [<ffffffff800cc329>] zone_statistics+0x3e/0x6d
> > > > > > > > > > > >>> [<ffffffff8000f2cf>] __alloc_pages+0x78/0x308
> > > > > > > > > > > >>> [<ffffffff8004c68e>] sys_mount+0x8a/0xcd
> > > > > > > > > > > >>> [<ffffffff8005d28d>] tracesys+0xd5/0xe0
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Code: 0f 0b 68 3a 94 03 88 c2 cb 01 44 39 a3 58 01 00
> > >
> > > 00
> > >
> > > > > > > > > > > >>> 75 0e
> > > > > > >
> > > > > > > c7
> > > > > > >
> > > > > > > > > > > >>> RIP  [<ffffffff88034a95>]
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> :jbd:cleanup_journal_tail+0x9d/0x118
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> RSP <ffff81016f00da68>
> > > > > > > > > > > >>> <0>Kernel panic - not syncing: Fatal exception
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Any idea how to fix this?
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Many thanks
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Wojciech
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> On 21 October 2010 17:54, Wojciech Turek < <
> > > > >
> > > > > wjt27 at cam.ac.uk>
> > > > >
> > > > > > > > > > > >>> wjt27 at cam.ac.uk> wrote:
> > > > > > > > > > > >>>> Thanks Ken, that worked.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> On 21 October 2010 17:39, Ken Hornstein <
> > > > > > > > > > > >>>> <kenh at cmf.nrl.navy.mil>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> kenh at cmf.nrl.navy.mil> wrote:
> > > > > > > > > > > >>>>>> Now I have another problem. After last segfault I
> > >
> > > can
> > >
> > > > > not
> > > > >
> > > > > > > > > restart
> > > > > > > > >
> > > > > > > > > > > the
> > > > > > > > > > >
> > > > > > > > > > > >>>>> fsck
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>>> due to MMP.
> > > > > > > > > > > >>>>>> [...]
> > > > > > > > > > > >>>>>> Also when I try to access filesystem via debugfs
> > > > > > > > > > > >>>>>> it
> > > > >
> > > > > fails:
> > > > > > > > > > > >>>>>> debugfs -c -R 'ls' /dev/scratch2_ost16vg/ost16lv
> > > > > > > > > > > >>>>>> debugfs 1.41.10.sun2 (24-Feb-2010)
> > > > > > > > > > > >>>>>> /dev/scratch2_ost16vg/ost16lv: MMP: fsck being run
> > > > > > > > > > > >>>>>> while opening
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> filesystem
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>>> ls: Filesystem not open
> > > > > > > > > > > >>>>>>
> > > > > > > > > > > >>>>>> Is there a way to clear teh MMP flag so it allows
> > >
> > > fsck
> > >
> > > > > to
> > > > >
> > > > > > > run?
> > > > > > >
> > > > > > > > > > > >>>>> You want tune2fs -f -E clear-mmp
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> --Ken
> > > > >
> > > > > --
> > > > > Bernd Schubert
> > > > > DataDirect Networks
> > >
> > > --
> > > Bernd Schubert
> > > DataDirect Networks
> 
> 
> --
> Bernd Schubert
> DataDirect Networks
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.