[lustre-discuss] "ls" hangs for certain files AND lfsck_namespace gets stuck in scanning-phase1, same position

Sternberg, Michael G. sternberg at anl.gov
Fri Jun 14 17:45:15 PDT 2019


Hello,
	
On a Lustre-2.12.2 system, running "ls" on a client hangs for certain files.

This has the increasingly troublesome global ramification of preventing backups from completing (unless I could manage to establish an exclusion list, by trial-and-error increments). The "ls" process cannot be killed; there is no log entry on the client and servers; neither clients as a whole nor servers hang.

- More precisely, "ls -f" does *not* hang; "ls -l" and other options calling for metadata do trigger the hang.

- I can do "ls -l", "cat" and "lfs getstripe" on one file from the problematic dir, but not the single other file in that dir.

- I another dir, "ls" works, but "ls -a" doesn't, hinting that the issue is related to *files*, not so much the directories themselves.


I'll describe my diag steps so far in the following. -- What could I do next?


* This looked at first similar to LU-8696  {https://jira.whamcloud.com/browse/LU-8696  "ls" hangs on a particular directory on production system}, and I tried the approach suggested there,

	mds# lctl set_param fail_loc=0x1505

    but that did not change the "ls" behavior, even with the client rebooted.


* My file system has 2 MDS (forming 1 HA pair) and 6 OSS (forming 3 HA pairs).

	- For diagnostics, I run with HA inactive and instead all OSD mounted manually, on one node per HA pair, respectively.

	- "lfs df" outputs are below, for both blocks and inodes.

	- All OSDs are connected over multipath.  (*)

	- 2 OSTs are "large", at 31 TB (28.9 TiB). (**)

	(*) and (**):  It appears I hit "LU-10510 blk_cloned_rq_check_limits: over max size limit.", which went away after I issued the following for each "large" OSD (> 16 TB):

	    echo 16384 > /sys/block/$i/queue/max_sectors_kb

	I did that manually because "l_tunedisk /dev/foo" from /etc/udev/rules.d/99-lustre-server.rules  did not appear to have an effect.


* The FS was formatted under 2.10.7.  I hoped from LU-8696/LU-10237 that upgrading the servers, a test client, and thus "lfsck" to 2.12 would help, but it evidently didn't.

* I ran the following diagnostics so far:

  (1) e2fsck (from e2fsprogs-1.44.5.wc1-0.el7.x86_64) on all OSD, which found and fixed some errors of the following (trivial?) kind, about a dozen per OSD:

	[QUOTA WARNING] Usage inconsistent for ID 22xxxx:actual (68884811776, 21451) != expected (68884819968, 21451)
	Inode 10119785 extent tree (at level 1) could be shorter.  Optimize? yes
	Inode 10121783 extent tree (at level 1) could be narrower.  Optimize? yes

  (2) lfsck.  This, to confound matters, gets stuck as well:

    While in scanning-phase1, after 4 and 9 minutes, respectively, "last_checkpoint_time" and "current_position" no longer advance for both MDTs of the file system. They get stuck:

    for MDT0000 at:
	current_position: 20457226, [0x240000418:0xcf75:0x0], 0x9dc22912f913dbf 

    for MDT0001 at:
	current_position: 27846673, [0x200000419:0x157df:0x0], 0x7e6aec6f046e890d

    I append the full "lctl get_param" outputs.


With best regards,
-- 
Michael Sternberg, Ph.D.
Principal Scientific Computing Administrator
Center for Nanoscale Materials
Argonne National Laboratory


Details:

client# lfs df /home
UUID                   1K-blocks        Used   Available Use% Mounted on
carbonfs-MDT0000_UUID    28426640     4888312    20423852  20% /home[MDT:0]
carbonfs-MDT0001_UUID    28426640     4476696    20835468  18% /home[MDT:1]
carbonfs-OST0000_UUID 15498310640  5746106888  8970807784  40% /home[OST:0]
carbonfs-OST0001_UUID 15498310640  6418661584  8298253088  44% /home[OST:1]
carbonfs-OST0002_UUID 15498310640  6094074632  8622840040  42% /home[OST:2]
carbonfs-OST0003_UUID 15498310640  6561987336  8154927336  45% /home[OST:3]
carbonfs-OST0004_UUID 15498310640  6708444352  8008470320  46% /home[OST:4]
carbonfs-OST0005_UUID 15498310640  6346698352  8370216320  44% /home[OST:5]
carbonfs-OST0006_UUID 15498310640  6407076140  8309838532  44% /home[OST:6]
carbonfs-OST0007_UUID 15498310640  6085008472  8631906200  42% /home[OST:7]
carbonfs-OST0008_UUID 30996858600 11440083316 17993977716  39% /home[OST:8]
carbonfs-OST0009_UUID 30996858600 12218084364 17215976668  42% /home[OST:9]

filesystem_summary:  185980202320 74026225436 102577214004  42% /home

client# lfs df -i /home
UUID                      Inodes       IUsed       IFree IUse% Mounted on
carbonfs-MDT0000_UUID    31154760    12356024    18798736  40% /home[MDT:0]
carbonfs-MDT0001_UUID    31154760    10709228    20445532  35% /home[MDT:1]
carbonfs-OST0000_UUID    12399920     2502317     9897603  21% /home[OST:0]
carbonfs-OST0001_UUID    12399920     2437151     9962769  20% /home[OST:1]
carbonfs-OST0002_UUID    12399920     2479428     9920492  20% /home[OST:2]
carbonfs-OST0003_UUID    12399920     2424332     9975588  20% /home[OST:3]
carbonfs-OST0004_UUID    12399920     2405313     9994607  20% /home[OST:4]
carbonfs-OST0005_UUID    12399920     2461958     9937962  20% /home[OST:5]
carbonfs-OST0006_UUID    12399920     2422940     9976980  20% /home[OST:6]
carbonfs-OST0007_UUID    12399920     2471340     9928580  20% /home[OST:7]
carbonfs-OST0008_UUID    24800048     4637459    20162589  19% /home[OST:8]
carbonfs-OST0009_UUID    24800048     4574730    20225318  19% /home[OST:9]

filesystem_summary:     62309520    23065252    39244268  38% /home



mds# lctl get_param -n mdd.carbonfs-MDT0000.lfsck_namespace 
name: lfsck_namespace
magic: 0xa06249ff
version: 2
status: scanning-phase1
flags:
param: all_targets,create_ostobj,create_mdtobj
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1560551417
time_since_latest_start: 2926 seconds
last_checkpoint_time: 1560551957
time_since_last_checkpoint: 2386 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 27483017, N/A, N/A
first_failure_position: N/A, N/A, N/A
checked_phase1: 12194065
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 782291
dirent_repaired: 0
linkea_repaired: 0
nlinks_repaired: 0
multiple_linked_checked: 1486515
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
agent_entries_repaired: 0
success_count: 0
run_time_phase1: 2925 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 4168 items/sec
average_speed_phase2: N/A
average_speed_total: 4168 items/sec
real_time_speed_phase1: 46 items/sec
real_time_speed_phase2: N/A
current_position: 27846673, [0x200000419:0x157df:0x0], 0x7e6aec6f046e890d

# lctl get_param -n mdd.carbonfs-MDT0001.lfsck_namespace
name: lfsck_namespace
magic: 0xa06249ff
version: 2
status: scanning-phase1
flags:
param: all_targets,create_ostobj,create_mdtobj
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1560551417
time_since_latest_start: 3947 seconds
last_checkpoint_time: 1560551661
time_since_last_checkpoint: 3703 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 19924068, [0x240000418:0xbff:0x0], 0xac6ef906919f08
first_failure_position: N/A, N/A, N/A
checked_phase1: 7205570
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 508466
dirent_repaired: 0
linkea_repaired: 0
nlinks_repaired: 0
multiple_linked_checked: 302727
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
agent_entries_repaired: 0
success_count: 0
run_time_phase1: 3946 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 1826 items/sec
average_speed_phase2: N/A
average_speed_total: 1826 items/sec
real_time_speed_phase1: 45 items/sec
real_time_speed_phase2: N/A
current_position: 20457226, [0x240000418:0xcf75:0x0], 0x9dc22912f913dbf



More information about the lustre-discuss mailing list