[lustre-discuss] "ls" hangs for certain files AND lfsck_namespace gets stuck in scanning-phase1, same position
Sternberg, Michael G.
sternberg at anl.gov
Fri Jun 14 17:45:15 PDT 2019
Hello,
On a Lustre-2.12.2 system, running "ls" on a client hangs for certain files.
This has the increasingly troublesome global ramification of preventing backups from completing (unless I could manage to establish an exclusion list, by trial-and-error increments). The "ls" process cannot be killed; there is no log entry on the client and servers; neither clients as a whole nor servers hang.
- More precisely, "ls -f" does *not* hang; "ls -l" and other options calling for metadata do trigger the hang.
- I can do "ls -l", "cat" and "lfs getstripe" on one file from the problematic dir, but not the single other file in that dir.
- I another dir, "ls" works, but "ls -a" doesn't, hinting that the issue is related to *files*, not so much the directories themselves.
I'll describe my diag steps so far in the following. -- What could I do next?
* This looked at first similar to LU-8696 {https://jira.whamcloud.com/browse/LU-8696 "ls" hangs on a particular directory on production system}, and I tried the approach suggested there,
mds# lctl set_param fail_loc=0x1505
but that did not change the "ls" behavior, even with the client rebooted.
* My file system has 2 MDS (forming 1 HA pair) and 6 OSS (forming 3 HA pairs).
- For diagnostics, I run with HA inactive and instead all OSD mounted manually, on one node per HA pair, respectively.
- "lfs df" outputs are below, for both blocks and inodes.
- All OSDs are connected over multipath. (*)
- 2 OSTs are "large", at 31 TB (28.9 TiB). (**)
(*) and (**): It appears I hit "LU-10510 blk_cloned_rq_check_limits: over max size limit.", which went away after I issued the following for each "large" OSD (> 16 TB):
echo 16384 > /sys/block/$i/queue/max_sectors_kb
I did that manually because "l_tunedisk /dev/foo" from /etc/udev/rules.d/99-lustre-server.rules did not appear to have an effect.
* The FS was formatted under 2.10.7. I hoped from LU-8696/LU-10237 that upgrading the servers, a test client, and thus "lfsck" to 2.12 would help, but it evidently didn't.
* I ran the following diagnostics so far:
(1) e2fsck (from e2fsprogs-1.44.5.wc1-0.el7.x86_64) on all OSD, which found and fixed some errors of the following (trivial?) kind, about a dozen per OSD:
[QUOTA WARNING] Usage inconsistent for ID 22xxxx:actual (68884811776, 21451) != expected (68884819968, 21451)
Inode 10119785 extent tree (at level 1) could be shorter. Optimize? yes
Inode 10121783 extent tree (at level 1) could be narrower. Optimize? yes
(2) lfsck. This, to confound matters, gets stuck as well:
While in scanning-phase1, after 4 and 9 minutes, respectively, "last_checkpoint_time" and "current_position" no longer advance for both MDTs of the file system. They get stuck:
for MDT0000 at:
current_position: 20457226, [0x240000418:0xcf75:0x0], 0x9dc22912f913dbf
for MDT0001 at:
current_position: 27846673, [0x200000419:0x157df:0x0], 0x7e6aec6f046e890d
I append the full "lctl get_param" outputs.
With best regards,
--
Michael Sternberg, Ph.D.
Principal Scientific Computing Administrator
Center for Nanoscale Materials
Argonne National Laboratory
Details:
client# lfs df /home
UUID 1K-blocks Used Available Use% Mounted on
carbonfs-MDT0000_UUID 28426640 4888312 20423852 20% /home[MDT:0]
carbonfs-MDT0001_UUID 28426640 4476696 20835468 18% /home[MDT:1]
carbonfs-OST0000_UUID 15498310640 5746106888 8970807784 40% /home[OST:0]
carbonfs-OST0001_UUID 15498310640 6418661584 8298253088 44% /home[OST:1]
carbonfs-OST0002_UUID 15498310640 6094074632 8622840040 42% /home[OST:2]
carbonfs-OST0003_UUID 15498310640 6561987336 8154927336 45% /home[OST:3]
carbonfs-OST0004_UUID 15498310640 6708444352 8008470320 46% /home[OST:4]
carbonfs-OST0005_UUID 15498310640 6346698352 8370216320 44% /home[OST:5]
carbonfs-OST0006_UUID 15498310640 6407076140 8309838532 44% /home[OST:6]
carbonfs-OST0007_UUID 15498310640 6085008472 8631906200 42% /home[OST:7]
carbonfs-OST0008_UUID 30996858600 11440083316 17993977716 39% /home[OST:8]
carbonfs-OST0009_UUID 30996858600 12218084364 17215976668 42% /home[OST:9]
filesystem_summary: 185980202320 74026225436 102577214004 42% /home
client# lfs df -i /home
UUID Inodes IUsed IFree IUse% Mounted on
carbonfs-MDT0000_UUID 31154760 12356024 18798736 40% /home[MDT:0]
carbonfs-MDT0001_UUID 31154760 10709228 20445532 35% /home[MDT:1]
carbonfs-OST0000_UUID 12399920 2502317 9897603 21% /home[OST:0]
carbonfs-OST0001_UUID 12399920 2437151 9962769 20% /home[OST:1]
carbonfs-OST0002_UUID 12399920 2479428 9920492 20% /home[OST:2]
carbonfs-OST0003_UUID 12399920 2424332 9975588 20% /home[OST:3]
carbonfs-OST0004_UUID 12399920 2405313 9994607 20% /home[OST:4]
carbonfs-OST0005_UUID 12399920 2461958 9937962 20% /home[OST:5]
carbonfs-OST0006_UUID 12399920 2422940 9976980 20% /home[OST:6]
carbonfs-OST0007_UUID 12399920 2471340 9928580 20% /home[OST:7]
carbonfs-OST0008_UUID 24800048 4637459 20162589 19% /home[OST:8]
carbonfs-OST0009_UUID 24800048 4574730 20225318 19% /home[OST:9]
filesystem_summary: 62309520 23065252 39244268 38% /home
mds# lctl get_param -n mdd.carbonfs-MDT0000.lfsck_namespace
name: lfsck_namespace
magic: 0xa06249ff
version: 2
status: scanning-phase1
flags:
param: all_targets,create_ostobj,create_mdtobj
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1560551417
time_since_latest_start: 2926 seconds
last_checkpoint_time: 1560551957
time_since_last_checkpoint: 2386 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 27483017, N/A, N/A
first_failure_position: N/A, N/A, N/A
checked_phase1: 12194065
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 782291
dirent_repaired: 0
linkea_repaired: 0
nlinks_repaired: 0
multiple_linked_checked: 1486515
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
agent_entries_repaired: 0
success_count: 0
run_time_phase1: 2925 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 4168 items/sec
average_speed_phase2: N/A
average_speed_total: 4168 items/sec
real_time_speed_phase1: 46 items/sec
real_time_speed_phase2: N/A
current_position: 27846673, [0x200000419:0x157df:0x0], 0x7e6aec6f046e890d
# lctl get_param -n mdd.carbonfs-MDT0001.lfsck_namespace
name: lfsck_namespace
magic: 0xa06249ff
version: 2
status: scanning-phase1
flags:
param: all_targets,create_ostobj,create_mdtobj
last_completed_time: N/A
time_since_last_completed: N/A
latest_start_time: 1560551417
time_since_latest_start: 3947 seconds
last_checkpoint_time: 1560551661
time_since_last_checkpoint: 3703 seconds
latest_start_position: 77, N/A, N/A
last_checkpoint_position: 19924068, [0x240000418:0xbff:0x0], 0xac6ef906919f08
first_failure_position: N/A, N/A, N/A
checked_phase1: 7205570
checked_phase2: 0
updated_phase1: 0
updated_phase2: 0
failed_phase1: 0
failed_phase2: 0
directories: 508466
dirent_repaired: 0
linkea_repaired: 0
nlinks_repaired: 0
multiple_linked_checked: 302727
multiple_linked_repaired: 0
unknown_inconsistency: 0
unmatched_pairs_repaired: 0
dangling_repaired: 0
multiple_referenced_repaired: 0
bad_file_type_repaired: 0
lost_dirent_repaired: 0
local_lost_found_scanned: 0
local_lost_found_moved: 0
local_lost_found_skipped: 0
local_lost_found_failed: 0
striped_dirs_scanned: 0
striped_dirs_repaired: 0
striped_dirs_failed: 0
striped_dirs_disabled: 0
striped_dirs_skipped: 0
striped_shards_scanned: 0
striped_shards_repaired: 0
striped_shards_failed: 0
striped_shards_skipped: 0
name_hash_repaired: 0
linkea_overflow_cleared: 0
agent_entries_repaired: 0
success_count: 0
run_time_phase1: 3946 seconds
run_time_phase2: 0 seconds
average_speed_phase1: 1826 items/sec
average_speed_phase2: N/A
average_speed_total: 1826 items/sec
real_time_speed_phase1: 45 items/sec
real_time_speed_phase2: N/A
current_position: 20457226, [0x240000418:0xcf75:0x0], 0x9dc22912f913dbf
More information about the lustre-discuss
mailing list