[lustre-discuss] network error on bulk WRITE/bad log

Stepan Nassyr s.nassyr at fz-juelich.de
Tue Aug 16 06:25:39 PDT 2022


Hello Peter,

Thank you for the reply. I have upgraded lustre to 2.15.1 . The errors 
persist, however - now I am also seeing a new error on io02:

[ 1749.396942] LustreError: 9216:0:(mdt_handler.c:7499:mdt_iocontrol()) 
storage-MDT0001: Not supported cmd = 1074292357, rc = -95

I'm not entirely sure how to look up the cmd code and rc -95 seems to 
just be EOPNOTSUPP, so no additional information here.

Is there a way to look up what the cmd value means?

On 15.08.22 14:50, Peter Jones wrote:
>
> Stepan
>
> 2.14.56 is not a version of Lustre – it is an interim dev build. Even 
> if it does not resolve this specific issue, I would strongly recommend 
> switching to the recently released Lustre 2.15.1 release
>
> Peter
>
> *From: *lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on 
> behalf of Stepan Nassyr via lustre-discuss 
> <lustre-discuss at lists.lustre.org>
> *Reply-To: *Stepan Nassyr <s.nassyr at fz-juelich.de>
> *Date: *Monday, August 15, 2022 at 1:35 AM
> *To: *"lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
> *Subject: *[lustre-discuss] network error on bulk WRITE/bad log
>
> Hi all,
>
> In May I had a failure on a small cluster and asked here 
> (http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-May/018073.html). 
> Due to time constraints I just recreated the filesystem back then.
>
> Now the failure happened again, this time I have more time and can 
> investigate and haven't done anything destructive yet.
>
> I use the following versions:
>
>   * lustre 2.14.56
>   * zfs 2.0.7 (previously used 2.1.2, but got told that 2.1.x is not
>     tested well with lustre)
>   * Nodes are running Rocky Linux 8.6
>   * uname -r: 4.18.0-372.19.1.el8_6.aarch64
>
> There are 2 IO nodes (io01 and io02), both of them are MDS and OSS and 
> one of them is MGS. Here are the devices:
>
> [snassyr at io02 ~]$ sudo lctl dl
>   0 UP osd-zfs storage-MDT0001-osd storage-MDT0001-osd_UUID 8
>   1 UP mgc MGC10.31.7.61 at o2ib a087e05e-d57c-4561-ad75-6827d4428f54 4
>   2 UP mds MDS MDS_uuid 2
>   3 UP lod storage-MDT0001-mdtlov storage-MDT0001-mdtlov_UUID 3
>   4 UP mdt storage-MDT0001 storage-MDT0001_UUID 8
>   5 UP mdd storage-MDD0001 storage-MDD0001_UUID 3
>   6 UP osp storage-MDT0000-osp-MDT0001 storage-MDT0001-mdtlov_UUID 4
>   7 UP osp storage-OST0000-osc-MDT0001 storage-MDT0001-mdtlov_UUID 4
>   8 UP osp storage-OST0001-osc-MDT0001 storage-MDT0001-mdtlov_UUID 4
>   9 UP lwp storage-MDT0000-lwp-MDT0001 storage-MDT0000-lwp-MDT0001_UUID 4
>  10 UP osd-zfs storage-OST0001-osd storage-OST0001-osd_UUID 4
>  11 UP ost OSS OSS_uuid 2
>  12 UP obdfilter storage-OST0001 storage-OST0001_UUID 6
>  13 UP lwp storage-MDT0000-lwp-OST0001 storage-MDT0000-lwp-OST0001_UUID 4
>  14 UP lwp storage-MDT0001-lwp-OST0001 storage-MDT0001-lwp-OST0001_UUID 4
>
> [snassyr at io01 ~]$ sudo lctl dl
>   0 UP osd-zfs MGS-osd MGS-osd_UUID 4
>   1 UP mgs MGS MGS 6
>   2 UP mgc MGC10.31.7.61 at o2ib 9f351a51-0232-4306-a66d-cecee8629329 4
>   3 UP osd-zfs storage-MDT0000-osd storage-MDT0000-osd_UUID 9
>   4 UP mds MDS MDS_uuid 2
>   5 UP lod storage-MDT0000-mdtlov storage-MDT0000-mdtlov_UUID 3
>   6 UP mdt storage-MDT0000 storage-MDT0000_UUID 12
>   7 UP mdd storage-MDD0000 storage-MDD0000_UUID 3
>   8 UP qmt storage-QMT0000 storage-QMT0000_UUID 3
>   9 UP osp storage-MDT0001-osp-MDT0000 storage-MDT0000-mdtlov_UUID 4
>  10 UP osp storage-OST0000-osc-MDT0000 storage-MDT0000-mdtlov_UUID 4
>  11 UP osp storage-OST0001-osc-MDT0000 storage-MDT0000-mdtlov_UUID 4
>  12 UP lwp storage-MDT0000-lwp-MDT0000 storage-MDT0000-lwp-MDT0000_UUID 4
>  13 UP osd-zfs storage-OST0000-osd storage-OST0000-osd_UUID 4
>  14 UP ost OSS OSS_uuid 2
>  15 UP obdfilter storage-OST0000 storage-OST0000_UUID 6
>  16 UP lwp storage-MDT0000-lwp-OST0000 storage-MDT0000-lwp-OST0000_UUID 4
>  17 UP lwp storage-MDT0001-lwp-OST0000 storage-MDT0001-lwp-OST0000_UUID 4
>
> On io01 I see repeating errors mentioning a network error:
>
> [65922.582578] LustreError: 20017:0:(ldlm_lib.c:3540:target_bulk_io()) 
> Skipped 11 previous similar messages
> [66494.575431] LNetError: 
> 20017:0:(o2iblnd.c:1880:kiblnd_fmr_pool_map()) Failed to map mr 1/8 
> elements
> [66494.575442] LNetError: 
> 20017:0:(o2iblnd.c:1880:kiblnd_fmr_pool_map()) Skipped 11 previous 
> similar messages
> [66494.575446] LNetError: 
> 20017:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) Can't map 32768 bytes 
> (8/8)s: -22
> [66494.575448] LNetError: 
> 20017:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) Skipped 11 previous 
> similar messages
> [66494.575452] LNetError: 20017:0:(o2iblnd_cb.c:1725:kiblnd_send()) 
> Can't setup PUT src for 10.31.7.62 at o2ib: -22
> [66494.575454] LNetError: 20017:0:(o2iblnd_cb.c:1725:kiblnd_send()) 
> Skipped 11 previous similar messages
> [66494.575458] LustreError: 
> 20017:0:(events.c:477:server_bulk_callback()) event type 5, status -5, 
> desc 00000000cdd4e797
> [66494.575460] LustreError: 
> 20017:0:(events.c:477:server_bulk_callback()) Skipped 11 previous 
> similar messages
> [66546.574314] LustreError: 20017:0:(ldlm_lib.c:3540:target_bulk_io()) 
> @@@ network error on bulk WRITE  req at 0000000070b8f1ab 
> x1740960836990720/t0(0) 
> o1000->storage-MDT0001-mdtlov_UUID at 10.31.7.62@o2ib:522/0 lens 
> 336/33016 e 0 to 0 dl 1660376137 ref 1 fl Interpret:/0/0 rc 0/0 job:''
>
> On io02 I see repeating errors mentioning a bad log:
>
> [66582.856444] LustreError: 
> 14905:0:(llog_osd.c:264:llog_osd_read_header()) 
> storage-MDT0000-osp-MDT0001: bad log  [0x200000401:0x1:0x0] header 
> magic: 0x0 (expected 0x10645539)
> [66582.856450] LustreError: 
> 14905:0:(llog_osd.c:264:llog_osd_read_header()) Skipped 11 previous 
> similar messages
>
> I can't make sense of these error messages. How can I recover?
>
> (I have the full dmesg/lctl dk log, but they are too big to attach, is 
> it ok to upload them somewhere and put a link in a reply?)
>
> Thank you and best regards,
> Stepan
>
>
>
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr. Astrid Lambrecht,
> Prof. Dr. Frauke Melchior
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
>
>
> Neugierige sind herzlich willkommen am Sonntag, den 21. August 2022, 
> von 10:00 bis 17:00 Uhr. Mehr unter: https://www.tagderneugier.de
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220816/b1b3b1f1/attachment.htm>


More information about the lustre-discuss mailing list