[lustre-discuss] (LFSCK) LBUG: ASSERTION( get_current()->journal_info == ((void *)0) ) failed - (ungracefully) SOLVED

Cédric Dufour - Idiap Research Institute cedric.dufour at idiap.ch
Thu Sep 15 07:08:37 PDT 2016


Hello,


After looking at the LFSCK source code - specifically
lustre/lfsck/lfsck_namespace.c - I was led to believe that suppressing
the "lfsck_namespace" file in the MDT LDISKFS should be safe enough (NB:
I had a backup of the MDT on a disconnected DRBD peer for the worst case
scenario):

# mount -t ldiskfs -o rw /dev/mdt /mnt/mdt.ldiskfs
# rm -f /mnt/mdt.ldiskfs/lfsck_namespace
# umount /mnt/mdt.ldiskfs


This allowed to start the MDT again. The "lfsck_namespace" reappeared
with a "blank" payload:

# mount.lustre -o rw /dev/mdt /lustre/mdt

# debugfs -c -R 'stat lfsck_namespace' /dev/drbd3 2>/dev/null
Inode: 156   Type: regular    Mode:  0644   Flags: 0x0
Generation: 1420451772    Version: 0x00000000:00000000
User:     0   Group:     0   Size: 8192
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 16
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
 atime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
 mtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
crtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
Size of extra inode fields: 28
Extended attributes stored in inode body:
  lma = "00 00 00 00 00 00 00 00 03 00 00 00 02 00 00 00 0a 00 00 00 00
00 00 00 " (24)
  lma: fid=[0x200000003:0xa:0x0] compat=0 incompat=0
  lfsck_namespace = "03 9d 62 a0 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (256)
BLOCKS:
(0-1):178240-178241
TOTAL: 2

# cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: init
flags:
param: <NULL>
time_since_last_completed: N/A
time_since_latest_start: N/A
time_since_last_checkpoint: N/A
latest_start_position: N/A, N/A, N/A
last_checkpoint_position: N/A, N/A, N/A
first_failure_position: N/A, N/A, N/A
checked_phase1: 0
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 0
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 0
run_time_phase1: 0 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 0 items/sec
average_speed_phase2: 0 objs/sec
real-time_speed_phase1: N/A
real-time_speed_phase2: N/A
current_position: N/A


Rather ungraceful - I don't know where that would have led me if I
hadn't use the "--dryrun on" flag on the LFSCK command (or was that flag
actually ignored? for it did not appear as a "param: ..." in the
lfsck_namespace output when LFSCK was running) - but got me out of the
present deadlock.

Forgot to mention: (still...) running Lustre 2.5.2


Best,


Cédric



On 15/09/16 08:05, Cédric Dufour - Idiap Research Institute wrote:
> Hello,
>
> On 14/09/16 20:58, Bernd Schubert wrote:
>> Hi Cédric,
>>
>> I'm by no means familiar with Lustre code anymore, but based on the stack 
>> trace and function names, it seems to be a problem with the journal. Maybe try 
>> to do an 'efsck -f' which would replay the journal and possibly clean up the 
>> file it has problem with.
> Thanks for the tip.
>
> Unfortunately, I did perform a filesystem check as part of my attempts for recovery (and even ran a dry-run afterwards, to make sure no errors were dangling).
>
> Cédric
>
>
>>
>> Cheers,
>> Bernd
>>
>>
>> On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - Idiap 
>> Research Institute wrote:
>>> Hello,
>>>
>>> Last Friday, during normal operations, our MDS froze with the following
>>> LBUG, which happens again as soon as one mounts the MDT again:
>>>
>>> Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError:
>>> 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION(
>>> get_current()->journal_info == ((void *)0) ) failed: Sep 13 15:10:28
>>> n00a kernel: [ 8414.612825] LustreError:
>>> 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel:
>>> [ 8414.619835] Call Trace:
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619850]  [<ffffffffa0224822>]
>>> libcfs_debug_dumpstack+0x52/0x80 [libcfs]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619857]  [<ffffffffa0224db2>]
>>> lbug_with_loc+0x42/0xa0 [libcfs]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619864]  [<ffffffffa0b11890>]
>>> osd_trans_start+0x250/0x630 [osd_ldiskfs]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619870]  [<ffffffffa0b0e748>] ?
>>> osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619876]  [<ffffffffa0c6ffc7>]
>>> lod_trans_start+0x177/0x200 [lod]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619881]  [<ffffffffa0cbd752>]
>>> lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619888]  [<ffffffff8136741b>] ?
>>> thread_return+0x3e/0x10c
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619894]  [<ffffffff81038b87>] ?
>>> enqueue_task_fair+0x58/0x5d
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619899]  [<ffffffffa0cb68ea>]
>>> lfsck_double_scan+0x5a/0x70 [lfsck]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619904]  [<ffffffffa0cb7dfd>]
>>> lfsck_master_engine+0x50d/0x650 [lfsck]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619909]  [<ffffffffa0cb78f0>] ?
>>> lfsck_master_engine+0x0/0x650 [lfsck]
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619915]  [<ffffffff810534c4>]
>>> kthread+0x7b/0x83
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619918]  [<ffffffff810369d3>] ?
>>> finish_task_switch+0x48/0xb9
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619924]  [<ffffffff8101092a>]
>>> child_rip+0xa/0x20
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619928]  [<ffffffff81053449>] ?
>>> kthread+0x0/0x83
>>> Sep 13 15:10:28 n00a kernel: [ 8414.619931]  [<ffffffff81010920>] ?
>>> child_rip+0x0/0x20
>>>
>>>
>>> I originally had the LFSCK launched in "dry-run" mode:
>>>
>>> lctl lfsck_start --device lustre-1-MDT0000 --dryrun on --type namespace
>>>
>>> The LFSCK was reported completed (I was 'watch[ing] -n 1' on a terminal)
>>> before the LBUG popped-up; now, I can't even get any output
>>>
>>> cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace  # just hang
>>> there indefinitely
>>>
>>>
>>> I remember seing a lfsck_namespace file in the MDT underlyding LDISKFS;
>>> is there anything sensible I can do with it (e.g. would deleting it
>>> solve the situation) ?
>>> What else could I do ?
>>>
>>>
>>> Thanks for your answers and best regards,
>>>
>>> Cédric D.
>>>
>>>
>>> PS: I had this message originally posted on HPDD-discuss mailing list
>>> and just realized it was the wrong place; sorry for any crossposting case
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list