[lustre-discuss] (LFSCK) LBUG: ASSERTION( get_current()->journal_info == ((void *)0) ) failed - (ungracefully) SOLVED

Cédric Dufour - Idiap Research Institute cedric.dufour at idiap.ch
Mon Sep 19 05:52:52 PDT 2016


Hello Nasf,

Below the entire ouput of LFSCK start/stop, using the '--dryrun' option
and its '-n' shortcut:

root at n00b:~ # lctl lfsck_start --device lustre-2-MDT0000 --dryrun on
--type namespace
Started LFSCK on the device lustre-2-MDT0000: namespace.
# 2016-09-19 14:47:14

root at n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase1
flags: inconsistent,upgrade
param: <NULL>
time_since_last_completed: 937135 seconds
time_since_latest_start: 11 seconds
time_since_last_checkpoint: N/A
latest_start_position: 16, N/A, N/A
last_checkpoint_position: N/A, N/A, N/A
first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877
checked_phase1: 147222
checked_phase2: 0
updated_phase1: 1585
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 2923
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 1
run_time_phase1: 11 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 13383 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 12842 items/sec
real-time_speed_phase2: N/A
current_position: 81972, [0x200014ba1:0xc1a:0x0], 3037912543715234557
# 2016-09-19 14:47:25

root at n00b:~ # lctl lfsck_stop --device lustre-2-MDT0000
Stopped LFSCK on the device lustre-2-MDT0000.
# 2016-09-19 14:47:34

root at n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: stopped
flags: inconsistent,upgrade
param: <NULL>
time_since_last_completed: 937147 seconds
time_since_latest_start: 23 seconds
time_since_last_checkpoint: 3 seconds
latest_start_position: 16, N/A, N/A
last_checkpoint_position: 562059, [0x2000014f0:0x12ea5:0x0],
955442147668871112
first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877
checked_phase1: 305529
checked_phase2: 0
updated_phase1: 9172
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 5339
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 1
run_time_phase1: 20 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 15276 items/sec
average_speed_phase2: 0 objs/sec
real-time_speed_phase1: N/A
real-time_speed_phase2: N/A
current_position: N/A
# 2016-09-19 14:47:37

root at n00b:~ # lctl lfsck_start --device lustre-2-MDT0000 -n on --type
namespace
Started LFSCK on the device lustre-2-MDT0000: namespace.
# 2016-09-19 14:47:50

root at n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: scanning-phase1
flags: inconsistent,upgrade
param: <NULL>
time_since_last_completed: 937163 seconds
time_since_latest_start: 3 seconds
time_since_last_checkpoint: 19 seconds
latest_start_position: 562060, [0x2000014f0:0x12ea5:0x0], 955476103740388659
last_checkpoint_position: 562059, [0x2000014f0:0x12ea5:0x0],
955442147668871112
first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877
checked_phase1: 419903
checked_phase2: 0
updated_phase1: 59322
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 5506
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 1
run_time_phase1: 23 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 18256 items/sec
average_speed_phase2: N/A
real-time_speed_phase1: 44125 items/sec
real-time_speed_phase2: N/A
current_position: 622857, N/A, N/A
# 2016-09-19 14:47:53

root at n00b:~ # lctl lfsck_stop --device lustre-2-MDT0000
Stopped LFSCK on the device lustre-2-MDT0000.
# 2016-09-19 14:47:56

root at n00b:~ # cat /proc/fs/lustre/mdd/lustre-2-MDT0000/lfsck_namespace
name: lfsck_namespace
magic: 0xa0629d03
version: 2
status: stopped
flags: inconsistent,upgrade
param: <NULL>
time_since_last_completed: 937169 seconds
time_since_latest_start: 9 seconds
time_since_last_checkpoint: 3 seconds
latest_start_position: 562060, [0x2000014f0:0x12ea5:0x0], 955476103740388659
last_checkpoint_position: 646716, N/A, N/A
first_failure_position: 4957, [0x135d:0x395d90f2:0x0], 1835421993583260877
checked_phase1: 456954
checked_phase2: 0
updated_phase1: 59322
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
dirs: 5510
M-linked: 0
nlinks_repaired: 0
lost_found: 0
success_count: 1
run_time_phase1: 26 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 17575 items/sec
average_speed_phase2: 0 objs/sec
real-time_speed_phase1: N/A
real-time_speed_phase2: N/A
current_position: N/A
# 2016-09-19 14:47:59


Given my previous mileage, I'd rather not let LFSCK run to completion again.

Best,

Cédric


On 18/09/16 03:05, Yong, Fan wrote:
> To be clear, what are the console and proc output if "-n on" specified when start the namespace LFSCK?
>
> --
> Regards,
> Nasf
>
> -----Original Message-----
> From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Cédric Dufour - Idiap Research Institute
> Sent: Thursday, September 15, 2016 10:09 PM
> To: lustre-discuss at lists.lustre.org
> Subject: Re: [lustre-discuss] (LFSCK) LBUG: ASSERTION( get_current()->journal_info == ((void *)0) ) failed - (ungracefully) SOLVED
>
> Hello,
>
>
> After looking at the LFSCK source code - specifically lustre/lfsck/lfsck_namespace.c - I was led to believe that suppressing the "lfsck_namespace" file in the MDT LDISKFS should be safe enough (NB:
> I had a backup of the MDT on a disconnected DRBD peer for the worst case
> scenario):
>
> # mount -t ldiskfs -o rw /dev/mdt /mnt/mdt.ldiskfs # rm -f /mnt/mdt.ldiskfs/lfsck_namespace # umount /mnt/mdt.ldiskfs
>
>
> This allowed to start the MDT again. The "lfsck_namespace" reappeared with a "blank" payload:
>
> # mount.lustre -o rw /dev/mdt /lustre/mdt
>
> # debugfs -c -R 'stat lfsck_namespace' /dev/drbd3 2>/dev/null
> Inode: 156   Type: regular    Mode:  0644   Flags: 0x0
> Generation: 1420451772    Version: 0x00000000:00000000
> User:     0   Group:     0   Size: 8192
> File ACL: 0    Directory ACL: 0
> Links: 1   Blockcount: 16
> Fragment:  Address: 0    Number: 0    Size: 0
>  ctime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
>  atime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
>  mtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016
> crtime: 0x57da4e9f:7e34390c -- Thu Sep 15 09:32:47 2016 Size of extra inode fields: 28 Extended attributes stored in inode body:
>   lma = "00 00 00 00 00 00 00 00 03 00 00 00 02 00 00 00 0a 00 00 00 00
> 00 00 00 " (24)
>   lma: fid=[0x200000003:0xa:0x0] compat=0 incompat=0
>   lfsck_namespace = "03 9d 62 a0 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 " (256)
> BLOCKS:
> (0-1):178240-178241
> TOTAL: 2
>
> # cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace
> name: lfsck_namespace
> magic: 0xa0629d03
> version: 2
> status: init
> flags:
> param: <NULL>
> time_since_last_completed: N/A
> time_since_latest_start: N/A
> time_since_last_checkpoint: N/A
> latest_start_position: N/A, N/A, N/A
> last_checkpoint_position: N/A, N/A, N/A
> first_failure_position: N/A, N/A, N/A
> checked_phase1: 0
> checked_phase2: 0
> updated_phase1: 0
> updated_phase2: 0
> failed_phase1: 0
> failed_phase2: 0
> dirs: 0
> M-linked: 0
> nlinks_repaired: 0
> lost_found: 0
> success_count: 0
> run_time_phase1: 0 seconds
> run_time_phase2: 0 seconds
> average_speed_phase1: 0 items/sec
> average_speed_phase2: 0 objs/sec
> real-time_speed_phase1: N/A
> real-time_speed_phase2: N/A
> current_position: N/A
>
>
> Rather ungraceful - I don't know where that would have led me if I hadn't use the "--dryrun on" flag on the LFSCK command (or was that flag actually ignored? for it did not appear as a "param: ..." in the lfsck_namespace output when LFSCK was running) - but got me out of the present deadlock.
>
> Forgot to mention: (still...) running Lustre 2.5.2
>
>
> Best,
>
>
> Cédric
>
>
>
> On 15/09/16 08:05, Cédric Dufour - Idiap Research Institute wrote:
>> Hello,
>>
>> On 14/09/16 20:58, Bernd Schubert wrote:
>>> Hi Cédric,
>>>
>>> I'm by no means familiar with Lustre code anymore, but based on the 
>>> stack trace and function names, it seems to be a problem with the 
>>> journal. Maybe try to do an 'efsck -f' which would replay the journal 
>>> and possibly clean up the file it has problem with.
>> Thanks for the tip.
>>
>> Unfortunately, I did perform a filesystem check as part of my attempts for recovery (and even ran a dry-run afterwards, to make sure no errors were dangling).
>>
>> Cédric
>>
>>
>>> Cheers,
>>> Bernd
>>>
>>>
>>> On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - 
>>> Idiap Research Institute wrote:
>>>> Hello,
>>>>
>>>> Last Friday, during normal operations, our MDS froze with the 
>>>> following LBUG, which happens again as soon as one mounts the MDT again:
>>>>
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError:
>>>> 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION( 
>>>> get_current()->journal_info == ((void *)0) ) failed: Sep 13 15:10:28 
>>>> n00a kernel: [ 8414.612825] LustreError:
>>>> 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG Sep 13 15:10:28 
>>>> n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck Sep 13 15:10:28 
>>>> n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel:
>>>> [ 8414.619835] Call Trace:
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619850]  [<ffffffffa0224822>]
>>>> libcfs_debug_dumpstack+0x52/0x80 [libcfs] Sep 13 15:10:28 n00a 
>>>> kernel: [ 8414.619857]  [<ffffffffa0224db2>]
>>>> lbug_with_loc+0x42/0xa0 [libcfs]
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619864]  [<ffffffffa0b11890>]
>>>> osd_trans_start+0x250/0x630 [osd_ldiskfs] Sep 13 15:10:28 n00a 
>>>> kernel: [ 8414.619870]  [<ffffffffa0b0e748>] ?
>>>> osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs] Sep 13 15:10:28 n00a 
>>>> kernel: [ 8414.619876]  [<ffffffffa0c6ffc7>]
>>>> lod_trans_start+0x177/0x200 [lod]
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619881]  [<ffffffffa0cbd752>]
>>>> lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck] Sep 13 15:10:28 
>>>> n00a kernel: [ 8414.619888]  [<ffffffff8136741b>] ?
>>>> thread_return+0x3e/0x10c
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619894]  [<ffffffff81038b87>] ?
>>>> enqueue_task_fair+0x58/0x5d
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619899]  [<ffffffffa0cb68ea>]
>>>> lfsck_double_scan+0x5a/0x70 [lfsck]
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619904]  [<ffffffffa0cb7dfd>]
>>>> lfsck_master_engine+0x50d/0x650 [lfsck] Sep 13 15:10:28 n00a kernel: 
>>>> [ 8414.619909]  [<ffffffffa0cb78f0>] ?
>>>> lfsck_master_engine+0x0/0x650 [lfsck] Sep 13 15:10:28 n00a kernel: [ 
>>>> 8414.619915]  [<ffffffff810534c4>]
>>>> kthread+0x7b/0x83
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619918]  [<ffffffff810369d3>] ?
>>>> finish_task_switch+0x48/0xb9
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619924]  [<ffffffff8101092a>]
>>>> child_rip+0xa/0x20
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619928]  [<ffffffff81053449>] ?
>>>> kthread+0x0/0x83
>>>> Sep 13 15:10:28 n00a kernel: [ 8414.619931]  [<ffffffff81010920>] ?
>>>> child_rip+0x0/0x20
>>>>
>>>>
>>>> I originally had the LFSCK launched in "dry-run" mode:
>>>>
>>>> lctl lfsck_start --device lustre-1-MDT0000 --dryrun on --type 
>>>> namespace
>>>>
>>>> The LFSCK was reported completed (I was 'watch[ing] -n 1' on a 
>>>> terminal) before the LBUG popped-up; now, I can't even get any 
>>>> output
>>>>
>>>> cat /proc/fs/lustre/mdd/lustre-1-MDT0000/lfsck_namespace  # just 
>>>> hang there indefinitely
>>>>
>>>>
>>>> I remember seing a lfsck_namespace file in the MDT underlyding 
>>>> LDISKFS; is there anything sensible I can do with it (e.g. would 
>>>> deleting it solve the situation) ?
>>>> What else could I do ?
>>>>
>>>>
>>>> Thanks for your answers and best regards,
>>>>
>>>> Cédric D.
>>>>
>>>>
>>>> PS: I had this message originally posted on HPDD-discuss mailing 
>>>> list and just realized it was the wrong place; sorry for any 
>>>> crossposting case _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list