[lustre-discuss] wrong cache result

K._Scott Rowe krowe at nrao.edu
Tue Apr 11 07:41:54 PDT 2023


We have a cluster of diskless nodes called nmpost running RHEL7.8, kernel
3.10.0-1127.13.1.el7.x86_64, Lustre version 2.12.5.  Checking the md5sum
of a specific file on Lustre shows that most hosts get the correct result,

  nmpost061 010585dfa7a66ae60b887a843056a4ec  /lustre/aoc/cluster/pipeline/dsoc-prod/workspaces/sbin/casa_envoy

but a few get different results

  nmpost073 e4c7c2eceec068ab061151866e2a0d64  /lustre/aoc/cluster/pipeline/dsoc-prod/workspaces/sbin/casa_envoy
  nmpost077 61d1d2bc7a86b53334d005a72603d8a1  /lustre/aoc/cluster/pipeline/dsoc-prod/workspaces/sbin/casa_envoy
  nmpost081 e4c7c2eceec068ab061151866e2a0d64  /lustre/aoc/cluster/pipeline/dsoc-prod/workspaces/sbin/casa_envoy

This casa_envoy file was updated about 10 days ago and it looks like
the hosts that see the wrong md5sum are seeing a previous version of
this file.

Either rebooting or running "echo 3 > /proc/sys/vm/drop_caches" on one
of these hosts causes it to see the correct md5sum
(010585dfa7a66ae60b887a843056a4ec).  So it seems that the Linux page
cache is not getting updated with the new version of this file even after
running md5sum on it multiple times.

Any ideas on why this is?  Is there a known issue between Lustre and
Linux page cache?

Thanks




More information about the lustre-discuss mailing list