[Lustre-discuss] recovering formatted OST

Andreas Dilger andreas.dilger at oracle.com
Fri Oct 22 11:40:18 PDT 2010


On 2010-10-22, at 12:25, Wojciech Turek wrote:
> Actually I remember now, Andreas wrote some time ago that when one adds OST in to the same slot as the old one MDS will think that the OST have objects up to the what old OST had, and when the new OST starts it will recreate those objects which may use a lot of inodes and space. So loop device or ramdisk maybe not enough for that?

The ll_recover_lost_found_objs will at least recreate the O/0/LAST_ID file with the highest-available object ID, but given the corruption of the filesystem this may not cover all of the objects previously created.  I would suggest to read the last_id for this OST from the MDS:

mds> lctl get_param osc.*.prealloc_last_id

and then use a binary editor to set the LAST_ID on the recovered OST, if it is significantly different.

> On 22 October 2010 19:11, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
>> Thanks Bernd, I will give it a go, for some reason I thought that this --index parameter didn't work in lustre.
>> 
>> 
>> On 22 October 2010 19:05, Bernd Schubert <bernd.schubert at fastmail.fm> wrote:
> Er no, mkfs.lustre --index=${the_right_index}.
> 
> 
> Cheers,
> Bernd
> 
> On Friday, October 22, 2010, Wojciech Turek wrote:
> > Ok, but this means that new OST will come up with a new index (next
> > available). Maybe this is a stupid question, but how  MDS will know that
> > the missing files are residing now on a new OST?
> >
> > On 22 October 2010 18:52, Bernd Schubert <bernd.schubert at fastmail.fm> wrote:
> > > Hmm, I would probably format a small fake device on a ramdisk and copy
> > > files
> > > over, run tunefs --writeconf /mdt and then start everything (inlcuding
> > > all OSTs) again.
> > >
> > >
> > > Cheers,
> > >
> > > On Friday, October 22, 2010, Wojciech Turek wrote:
> > > > I have tried Bernd's suggestion and it seem to have worked, after
> > > > running e2fsck -D ll_recover_lost_found_objs didn't cause kernel panic
> > > > but moved
> > >
> > > a
> > >
> > > > number of objects to O directory. Problem is that I do not have
> > > > last_rcvd file so the OST has no index at the moment. What would be
> > > > the next step
> > >
> > > to
> > >
> > > > enable access to those files in the filesystem?
> > > >
> > > > Best regards,
> > > >
> > > > Wojciech
> > > >
> > > > On 22 October 2010 17:15, Andreas Dilger <andreas.dilger at oracle.com>
> > >
> > > wrote:
> > > > > On 2010-10-22, at 5:42, Bernd Schubert <bernd.schubert at fastmail.fm>
> > >
> > > wrote:
> > > > > > Hmm, e2fsck didn't catch that? rec_len is the length of a directory
> > > > >
> > > > > entry, so
> > > > >
> > > > > > after how many bytes the next entry follows.
> > > > >
> > > > > I agree that e2fsck should have caught that.
> > > > >
> > > > > > You can try to force e2fsck to do
> > > > > > something about that: e2fsck -D
> > > > >
> > > > > No, I would recommend against using -D at this point. That will cause
> > >
> > > it
> > >
> > > > > to re-write the directory contents, and given that the filesystem was
> > > > > previously corrupted I would prefer making as few changes as possible
> > > > > before the data is estranged.
> > > > >
> > > > > Wojciech,
> > > > > note that if you are able to mount the filesystem you could just copy
> > >
> > > all
> > >
> > > > > of the objects (with xattrs!) from lost+found on the bad filesystem,
> > > > > along with the last_rcvd file (if you can find it) into a new ldiskfs
> > > > > filesystem and then run ll_recover_lost_found_objs on that.
> > > > >
> > > > > > On Friday, October 22, 2010, Wojciech Turek wrote:
> > > > > >> Ok, removing and recreating the journal fixed that problem and I
> > > > > >> am able
> > > > >
> > > > > to
> > > > >
> > > > > >> mount device as ldiskfs filesystem. Now I hit another wall when
> > >
> > > trying
> > >
> > > > > to
> > > > >
> > > > > >> run ll_recover_lost_found_objs
> > > > > >> When I first time run ll_recover_lost_found_objs -d
> > > > > >> /mnt/ost/lost+found
> > > > >
> > > > > it
> > > > >
> > > > > >> only creates the O dir and exits. When I repeat this command again
> > > > >
> > > > > kernel
> > > > >
> > > > > >> panics. Any idea what could be the problem here?
> > > > > >>
> > > > > >>
> > > > > >> LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in
> > > > > >> directory #6831: rec_len is smaller than minimal - offset=0,
> > >
> > > inode=0,
> > >
> > > > > >> rec_len=0, name_len=0
> > > > > >> Aborting journal on device dm-4.
> > > > > >> Unable to handle kernel NULL pointer dereference at
> > > > > >> 0000000000000000
> > > > >
> > > > > RIP:
> > > > > >> [<ffffffff88033448>] :jbd:journal_commit_transaction+0xc5b/0x12db
> > > > > >> PGD 1a118d067 PUD 1ce7e7067 PMD 0
> > > > > >> Oops: 0002 [1] SMP
> > > > > >> last sysfs file: /class/infiniband_mad/umad0/port
> > > > > >> CPU 3
> > > > > >> Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U)
> > > > > >> bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U)
> > > > > >> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U)
> > > > > >> crypto_api(U)
> > > > >
> > > > > ib_uverbs(U)
> > > > >
> > > > > >> ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U)
> > > > >
> > > > > mptctl(U)
> > > > >
> > > > > >> dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U)
> > > > >
> > > > > i2c_ec(U)
> > > > >
> > > > > >> i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U)
> > > > > >> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U)
> > > > >
> > > > > cdrom(U)
> > > > >
> > > > > >> mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U)
> > >
> > > usb_storage(U)
> > >
> > > > > >> pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U)
> > >
> > > dm_raid45(U)
> > >
> > > > > >> dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U)
> > > > > >> dm_mem_cache(U)
> > > > >
> > > > > nfs(U)
> > > > >
> > > > > >> lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U)
> > > > >
> > > > > mptbase(U)
> > > > >
> > > > > >> scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U)
> > > > > >> sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U)
> > > > > >> ohci_hcd(U) ehci_hcd(U) Pid: 11360, comm: kjournald Tainted: G
> > > > > >> 2.6.18-194.3.1.el5_lustre.1.8.4 #1
> > > > > >> RIP: 0010:[<ffffffff88033448>]  [<ffffffff88033448>]
> > > > > >>
> > > > > >> :jbd:journal_commit_transaction+0xc5b/0x12db
> > > > > >>
> > > > > >> RSP: 0018:ffff8101c6481d90  EFLAGS: 00010246
> > > > > >> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff
> > > > > >> RDX: 0000000000000000 RSI: ffff8101e9dab0c0 RDI: ffff81022fa46000
> > > > > >> RBP: ffff81022fa46000 R08: ffff81022fa46068 R09: 0000000000000000
> > > > > >> R10: ffff810105925b20 R11: 00000000fffffffa R12: 0000000000000000
> > > > > >> R13: 0000000000000000 R14: ffff8101e9dab0c0 R15: 0000000000000000
> > > > > >> FS:  0000000000000000(0000) GS:ffff810107b9a4c0(0000)
> > > > > >> knlGS:0000000000000000 CS:  0010 DS: 0018 ES: 0018 CR0:
> > > > > >> 000000008005003b CR2: 0000000000000000 CR3: 00000001eaffb000 CR4:
> > > > > >> 00000000000006e0 Process kjournald (pid: 11360, threadinfo
> > > > > >> ffff8101c6480000, task ffff81021c14c0c0)
> > > > > >> Stack:  ffff8101a61b9000 000000002b8263c0 ffffffff00000000
> > > > >
> > > > > 0000000000000000
> > > > >
> > > > > >> 0000113b00000001 0000000000000013 0000000000000000
> > > > > >> 0000000000000111 0000000000000000 0000000000000000
> > > > > >> 0000000001282dd7 00000000000020dd Call Trace:
> > > > > >> [<ffffffff8003da91>] lock_timer_base+0x1b/0x3c
> > > > > >> [<ffffffff8004b347>] try_to_del_timer_sync+0x7f/0x88
> > > > > >> [<ffffffff88037386>] :jbd:kjournald+0xc1/0x213
> > > > > >> [<ffffffff800a0ab2>] autoremove_wake_function+0x0/0x2e
> > > > > >> [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4
> > > > > >> [<ffffffff880372c5>] :jbd:kjournald+0x0/0x213
> > > > > >> [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4
> > > > > >> [<ffffffff80032890>] kthread+0xfe/0x132
> > > > > >> [<ffffffff8005dfb1>] child_rip+0xa/0x11
> > > > > >> [<ffffffff800a089a>] keventd_create_kthread+0x0/0xc4
> > > > > >> [<ffffffff8014bcf4>] deadline_queue_empty+0x0/0x23
> > > > > >> [<ffffffff80032792>] kthread+0x0/0x132
> > > > > >> [<ffffffff8005dfa7>] child_rip+0x0/0x11
> > > > > >>
> > > > > >>
> > > > > >> Code: f0 0f ba 33 01 e8 42 fc 02 f8 8b 03 a8 04 75 07 8b 43 58 85
> > > > > >> RIP  [<ffffffff88033448>]
> > > :
> > > :jbd:journal_commit_transaction+0xc5b/0x12db
> > > :
> > > > > >> RSP <ffff8101c6481d90>
> > > > > >> CR2: 0000000000000000
> > > > > >> <0>Kernel panic - not syncing: Fatal exception
> > > > > >>
> > > > > >> On 22 October 2010 03:09, Andreas Dilger
> > > > > >> <andreas.dilger at oracle.com
> > > > >
> > > > > wrote:
> > > > > >>> On 2010-10-21, at 18:44, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
> > > > > >>>
> > > > > >>> fsck has finished and does not find any more errors to correct.
> > > > > >>> However when I try to mount the device as ldiskfs kernel panics
> > >
> > > with
> > >
> > > > > >>> following message:
> > > > > >>>
> > > > > >>> Assertion failure in cleanup_journal_tail() at
> > > > > >>> fs/jbd/checkpoint.c:459: "blocknr != 0"
> > > > > >>>
> > > > > >>>
> > > > > >>> Hmm, not sure, maybe your journal is broken?  You can delete it
> > >
> > > with
> > >
> > > > > >>> "tune2fs -O ^has_journal" (maybe after running e2fsck again to
> > >
> > > clear
> > >
> > > > > the
> > > > >
> > > > > >>> journal), then re-create it with "tune2fs -j".
> > > > > >>>
> > > > > >>> ----------- [cut here ] --------- [please bite here ] ---------
> > > > > >>> Kernel BUG at fs/jbd/checkpoint.c:459
> > > > > >>> invalid opcode: 0000 [1] SMP
> > > > > >>> last sysfs file: /class/infiniband_mad/umad0/
> > > > > >>> port
> > > > > >>> CPU 2
> > > > > >>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U)
> > > > > >>> ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U)
> > > > >
> > > > > ksocklnd(U)
> > > > >
> > > > > >>> ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U)
> > > > > >>> autofs4(U) hidp(U) l2cap(U) bluetooth(U) rdma_ucm(U) rdma_cm(U)
> > > > > >>> iw_cm(U)
> > > > >
> > > > > ib_addr(U)
> > > > >
> > > > > >>> ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U)
> > > > >
> > > > > crypto_api(U)
> > > > >
> > > > > >>> ib_uverbs(U) ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U)
> > > > > >>> ib_mthca(U) mptctl(U) dm_mirror(U) video(U) backlight(U) sbs(U)
> > > > > >>> power_meter(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U)
> > > > > >>> button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U)
> > > > >
> > > > > parport_pc(U)
> > > > >
> > > > > >>> lp(U) parport(U) sr_mod(U) cdrom(U) mlx4_ib(U) ib_mad(U)
> > > > > >>> ib_core(U) joydev(U) mlx4_core(U) usb_storage(U) shpchp(U)
> > > > > >>> i5000_edac(U)
> > > > >
> > > > > edac_mc(U)
> > > > >
> > > > > >>> serio_raw(U) pcspkr(U) dm_raid45(U) dm_message(U)
> > > > > >>> dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) nfs(U)
> > > > > >>> lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U)
> > > > > >>> mptbase(U)
> > > > > >>> scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U)
> > > > > >>> sg(U) sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U)
> > >
> > > ohci_hcd(U)
> > >
> > > > > >>> ehci_hcd(U) Pid: 13891, comm: mount Tainted: G
> > > > > >>> 2.6.18-194.3.1.el5_lustre.1.8.4 #1 RIP: 0010:[<ffffffff88034a95>]
> > > > > >>> [<ffffffff88034a95>]
> > > > > >>>
> > > > > >>> :jbd:cleanup_journal_tail+0x9d/0x118
> > > > > >>>
> > > > > >>> RSP: 0018:ffff81016f00da68  EFLAGS: 00010286
> > > > > >>> RAX: 000000000000005a RBX: ffff81012ca12c00 RCX: ffffffff80311da8
> > > > > >>> RDX: ffffffff80311da8 RSI: 0000000000000000 RDI: ffffffff80311da0
> > > > > >>> RBP: 0000000000000000 R08: ffffffff80311da8 R09: 0000000000000001
> > > > > >>> R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000002
> > > > > >>> R13: ffff81012ca12d4c R14: ffff81012ca12c24 R15: ffff81017a8d7400
> > > > > >>> FS:  00002abd7cef1f70(0000) GS:ffff810107b9acc0(0000)
> > > > > >>> knlGS:0000000000000000
> > > > > >>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > >>> CR2: 000000000042b000 CR3: 000000012813f000 CR4: 00000000000006e0
> > > > > >>> Process mount (pid: 13891, threadinfo ffff81016f00c000, task
> > > > > >>> ffff81022e1b7820)
> > > > > >>> Stack:  0000000000000000 ffff81012ca12c00 ffff81017a8d7400
> > > > > >>> ffffffff88037690
> > > > > >>>
> > > > > >>> ffff81012ca12c00 ffff8102034ff000 ffff81017a8d7400
> > > > > >>> 0000000000000000 ffff8102034ff000 ffffffff88a9be56
> > > > > >>> 0000000001000000 ffff8101bf788000
> > > > > >>>
> > > > > >>> Call Trace:
> > > > > >>> [<ffffffff88037690>] :jbd:journal_flush+0xbe/0x248
> > > > > >>> [<ffffffff88a9be56>]
> > > > > >>>
> > > > > >>> :ldiskfs:ldiskfs_mark_recovery_complete+0x36/0x90
> > > > > >>>
> > > > > >>> [<ffffffff88aa02e0>] :ldiskfs:ldiskfs_fill_super+0x1790/0x1950
> > > > > >>> [<ffffffff800eccd2>] get_filesystem+0x12/0x3b
> > > > > >>> [<ffffffff800e343e>] test_bdev_super+0x0/0xd
> > > > > >>> [<ffffffff88a9eb50>] :ldiskfs:ldiskfs_fill_super+0x0/0x1950
> > > > > >>> [<ffffffff800e43fd>] get_sb_bdev+0x10a/0x16c
> > > > > >>> [<ffffffff800e3d9a>] vfs_kern_mount+0x93/0x11a
> > > > > >>> [<ffffffff800e3e63>] do_kern_mount+0x36/0x4d
> > > > > >>> [<ffffffff800ee601>] do_mount+0x6a9/0x719
> > > > > >>> [<ffffffff800090d2>] __handle_mm_fault+0x96f/0xfaa
> > > > > >>> [<ffffffff8002c9e0>] mntput_no_expire+0x19/0x89
> > > > > >>> [<ffffffff8000a72a>] __link_path_walk+0xf1e/0xf42
> > > > > >>> [<ffffffff800220ce>] __up_read+0x19/0x7f
> > > > > >>> [<ffffffff80066b88>] do_page_fault+0x4fe/0x874
> > > > > >>> [<ffffffff8002c9e0>] mntput_no_expire+0x19/0x89
> > > > > >>> [<ffffffff8000ea45>] link_path_walk+0xa6/0xb2
> > > > > >>> [<ffffffff800cc329>] zone_statistics+0x3e/0x6d
> > > > > >>> [<ffffffff8000f2cf>] __alloc_pages+0x78/0x308
> > > > > >>> [<ffffffff8004c68e>] sys_mount+0x8a/0xcd
> > > > > >>> [<ffffffff8005d28d>] tracesys+0xd5/0xe0
> > > > > >>>
> > > > > >>> Code: 0f 0b 68 3a 94 03 88 c2 cb 01 44 39 a3 58 01 00 00 75 0e c7
> > > > > >>> RIP  [<ffffffff88034a95>] :jbd:cleanup_journal_tail+0x9d/0x118
> > > > > >>>
> > > > > >>> RSP <ffff81016f00da68>
> > > > > >>> <0>Kernel panic - not syncing: Fatal exception
> > > > > >>>
> > > > > >>> Any idea how to fix this?
> > > > > >>>
> > > > > >>> Many thanks
> > > > > >>>
> > > > > >>> Wojciech
> > > > > >>>
> > > > > >>>
> > > > > >>> On 21 October 2010 17:54, Wojciech Turek < <wjt27 at cam.ac.uk>
> > > > > >>>
> > > > > >>> wjt27 at cam.ac.uk> wrote:
> > > > > >>>> Thanks Ken, that worked.
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 21 October 2010 17:39, Ken Hornstein <
> > > > > >>>> <kenh at cmf.nrl.navy.mil>
> > > > > >>>>
> > > > > >>>> kenh at cmf.nrl.navy.mil> wrote:
> > > > > >>>>>> Now I have another problem. After last segfault I can not
> > >
> > > restart
> > >
> > > > > the
> > > > >
> > > > > >>>>> fsck
> > > > > >>>>>
> > > > > >>>>>> due to MMP.
> > > > > >>>>>> [...]
> > > > > >>>>>> Also when I try to access filesystem via debugfs it fails:
> > > > > >>>>>>
> > > > > >>>>>> debugfs -c -R 'ls' /dev/scratch2_ost16vg/ost16lv
> > > > > >>>>>> debugfs 1.41.10.sun2 (24-Feb-2010)
> > > > > >>>>>> /dev/scratch2_ost16vg/ost16lv: MMP: fsck being run while
> > > > > >>>>>> opening
> > > > > >>>>>
> > > > > >>>>> filesystem
> > > > > >>>>>
> > > > > >>>>>> ls: Filesystem not open
> > > > > >>>>>>
> > > > > >>>>>> Is there a way to clear teh MMP flag so it allows fsck to run?
> > > > > >>>>>
> > > > > >>>>> You want tune2fs -f -E clear-mmp
> > > > > >>>>>
> > > > > >>>>> --Ken
> 


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.




More information about the lustre-discuss mailing list