[lustre-discuss] Report Strange Problem on 2.15.5 with changelog_mask

Aurelien Degremont adegremont at nvidia.com
Thu Nov 21 02:36:00 PST 2024


The changelog_mask has a default value. If you do

 changelog_mask='MARK MTIME CTIME'

you are setting the mask to this exact value, whereas

changelog_mask='+SATTR'

is keeping all the default flags plus adding SATTR. Thus the difference  in output.

AFAIK, both commands should work, so it feels like a bug. Looks like that some missing flag in the first case in causing some bugs, whereas in your second case almost all flags are enabled. You can try to bisect them to find out a smaller flag set that still work and report that in jira.whamcloud.com.

Aurélien

________________________________
De : Philippe Dos Santos <philippe.dos-santos at ipsl.fr>
Envoyé : jeudi 21 novembre 2024 11:18
À : Aurelien Degremont <adegremont at nvidia.com>
Cc : lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>; Philippe Weill <Philippe.Weill at latmos.ipsl.fr>
Objet : Re: [lustre-discuss] Report Strange Problem on 2.15.5 with changelog_mask

External email: Use caution opening links or attachments


Hello Aurelien,

I'm working with Philippe WEILL and I'm Philippe too ;o)

We first met the problem a few months ago.
And it happened again yesterday after the maintenance window.
On production we now have all servers and clients running Lustre 2.15.5.

We reproduced the problem with 3 RockyLinux 8.10 VMs running Lustre 2.15.5 (1x mds-mgs, 2x oss and 1x client).
We wonder if it's be related to a misuse of the changelog mask (='MARK MTIME CTIME' vs ='+MTIME +CTIME') ?

## Making the problem happen :

[root at test-mds-mgs ~]# lctl set_param -P mdd.lustre-MDT0000.changelog_mask='MARK MTIME CTIME'
[root at test-mds-mgs ~]# reboot
[root at test-mds-mgs ~]# mount -t lustre /dev/sdb /mnt/mgt/
[root at test-mds-mgs ~]# mount -t lustre /dev/sdc /mnt/mdt/
[root at test-mds-mgs ~]# lctl get_param mdd.lustre-MDT0000.changelog_mask
mdd.lustre-MDT0000.changelog_mask=MARK MTIME CTIME

[root at test-rbh-cl-215 lustre]# LANG=C touch aeffacer
touch: setting times of 'aeffacer': Input/output error

[root at test-mds-mgs ~]# LANG=C dmesg -T
...
[Thu Nov 21 10:54:24 2024] Lustre: Lustre: Build Version: 2.15.5
[Thu Nov 21 10:54:24 2024] LNet: Added LNI 172.20.240.172 at tcp [8/256/0/180]
[Thu Nov 21 10:54:24 2024] LNet: Accept secure, port 988
[Thu Nov 21 10:54:24 2024] LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Thu Nov 21 10:54:35 2024] LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Thu Nov 21 10:54:35 2024] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 172.20.240.171 at tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[Thu Nov 21 10:54:35 2024] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 300-900
[Thu Nov 21 10:54:35 2024] Lustre: lustre-MDD0000: changelog on
[Thu Nov 21 10:55:26 2024] Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects
[Thu Nov 21 10:55:26 2024] Lustre: lustre-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
[Thu Nov 21 10:55:26 2024] LustreError: 1907:0:(llog_cat.c:543:llog_cat_current_log()) lustre-MDD0000: next log does not exist!
...

## "Solving" the problem:

[root at test-mds-mgs ~]# lctl set_param -P mdd.lustre-MDT0000.changelog_mask='+SATTR'
[root at test-mds-mgs ~]# reboot
[root at test-mds-mgs ~]# mount -t lustre /dev/sdb /mnt/mgt/
[root at test-mds-mgs ~]# mount -t lustre /dev/sdc /mnt/mdt/
[root at test-mds-mgs ~]# lctl get_param mdd.lustre-MDT0000.changelog_mask
mdd.lustre-MDT0000.changelog_mask=
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RENME RNMTO CLOSE LYOUT TRUNC SATTR XATTR HSM MTIME CTIME MIGRT FLRW RESYNC

[root at test-rbh-cl-215 lustre]# touch aeffacer
[root at test-rbh-cl-215 lustre]# ll aeffacer
-rw-r--r-- 1 root root 0 21 nov.  11:03 aeffacer

[root at test-mds-mgs ~]# LANG=C dmesg -T
...
[Thu Nov 21 11:02:52 2024] Lustre: Lustre: Build Version: 2.15.5
[Thu Nov 21 11:02:52 2024] LNet: Added LNI 172.20.240.172 at tcp [8/256/0/180]
[Thu Nov 21 11:02:52 2024] LNet: Accept secure, port 988
[Thu Nov 21 11:02:53 2024] LDISKFS-fs (sdb): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Thu Nov 21 11:02:57 2024] LDISKFS-fs (sdc): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
[Thu Nov 21 11:02:57 2024] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 300-900
[Thu Nov 21 11:02:57 2024] Lustre: lustre-MDD0000: changelog on
[Thu Nov 21 11:03:27 2024] Lustre: lustre-MDT0000: Will be in recovery for at least 5:00, or until 1 client reconnects
[Thu Nov 21 11:03:27 2024] Lustre: lustre-MDT0000: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.

Philippe


----- Mail original -----
De: "Philippe Weill" <Philippe.Weill at latmos.ipsl.fr>
À: "Aurelien Degremont" <adegremont at nvidia.com>, lustre-discuss at lists.lustre.org
Envoyé: Mercredi 20 Novembre 2024 17:44:16
Objet: Re: [lustre-discuss] Report Strange Problem on 2.15.5 with changelog_mask

On 20/11/2024 16:24, Aurelien Degremont wrote:
> Hello Philippe,
>
> I do not see why changing the changelog mask would cause I/O error, especially as this seems transient.
> Did you happen to have any errors on your client hosts or MDS hosts as the time of your testing ? (see dmesg)


hello

no we did not see and we have reproduced the problem with 3 vm Rocky 8.10 with fresh 2.15.5  ( 1 mds , 1 oss , 1 client )


>
>
> Aurélien
> ------------------------------------------------------------------------------------------------------------------------------------
> *De :* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> de la part de Philippe Weill <Philippe.Weill at latmos.ipsl.fr>
> *Envoyé :* mercredi 20 novembre 2024 07:11
> *À :* lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
> *Objet :* [lustre-discuss] Report Strange Problem on 2.15.5 with changelog_mask
> External email: Use caution opening links or attachments
>
>
> Hello
>
> after passing the following command on our lustre MDS
>
> lctl set_param -P mdd.*-MDT0000.changelog_mask='MARK MTIME CTIME'
>
> unmounting and remounting the mdt on mds
>
> we had  error on touch chmod chgrp existing files
>
> root at host:~# echo foobar > /scratch/root/foobar
> root at host:~# cat /scratch/root/foobar
> foobar
> root at host:~# echo foobar2 >>  /scratch/root/foobar
> root at host:~# cat /scratch/root/foobar
> foobar
> foobar2
> root at host:~# touch /scratch/root/foobar
> touch: setting times of '/scratch/root/foobar': Input/output error
> root at host:~# chgrp group /scratch/root/foobar
> chgrp: changing group of '/scratch/root/foobar': Input/output error
> root at host:~# chmod 666 /scratch/root/foobar
> chmod: changing permissions of '/scratch/root/foobar': Input/output error
>
>
> doing the following command
>
> lctl set_param -P mdd.*-MDT0000.changelog_mask='-MARK -MTIME -CTIME'
>
>
> and only activating non permanently for our robinhood
>
> lctl set_param  mdd.*-MDT0000.changelog_mask='MARK MTIME CTIME'
>
>
> [root at mds ~]#  lctl get_param  mdd.scratch-MDT0000.changelog_mask
> mdd.scratch-MDT0000.changelog_mask=MARK MTIME CTIME
>
>
> everything started to work again
>
> Bug or bad use from us ?
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C02%7Cadegremont%40nvidia.com%7C9ebc0b0435e24da9446b08dd0a15ed9a%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638677811643765311%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C40000%7C%7C%7C&sdata=nyYWcMtejqwdkaO%2BoF2AXvi0wjQLfjX7ihGl11Ol44Y%3D&reserved=0<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org> <https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C02%7Cadegremont%40nvidia.com%7C9ebc0b0435e24da9446b08dd0a15ed9a%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638677811643782865%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C40000%7C%7C%7C&sdata=%2Fs5ibjjz682sH9IFUpgnqMftgL%2FvujT37bebw8w8g6k%3D&reserved=0<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>>

--
Weill Philippe -  Administrateur Systeme et Reseaux
CNRS/UPMC/IPSL   LATMOS (UMR 8190)
Tour 45/46 3e Etage B302|4 Place Jussieu|75252 Paris Cedex 05 -  FRANCE
Email:philippe.weill at latmos.ipsl.fr | tel:+33 0144274759
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C02%7Cadegremont%40nvidia.com%7C9ebc0b0435e24da9446b08dd0a15ed9a%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638677811643794634%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C40000%7C%7C%7C&sdata=oe9Fq73BgDT6kg9tGstln7Ys%2FpSku3%2B%2B9SwLdBHS0QE%3D&reserved=0<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20241121/4b780518/attachment-0001.htm>


More information about the lustre-discuss mailing list