[lustre-discuss] some clients dmesg filled up with "dirty page discard"

肖正刚 guru.novice at gmail.com
Tue Aug 25 16:42:47 PDT 2020


no, on oss we found only the client who reported " dirty page discard  "
being evicted.
we hit this again last night, and on oss we can see logs like:
"
[Tue Aug 25 23:40:12 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 100s: evicting client at 10.10.3.223 at o2ib  ns:
filter-public1-OST0000_UUID lock: ffff9f1f91cba880/0x3fcc67dad1c65842 lrc:
3/0,0 mode: PR/PR res: [0xde2db83:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->270335) flags: 0x60000400020020 nid:
10.10.3.223 at o2ib remote: 0xd713b7b417045252 expref: 7081 pid: 25923
timeout: 21386699 lvb_type: 0
[Tue Aug 25 23:40:12 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar
messages
[Tue Aug 25 23:40:14 2020] LustreError:
26000:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req at ffff9f13259a6300 x1653628454261296/t0(0)
o106->public1-OST0000 at 10.10.3.223@o2ib:15/16 lens 296/280 e 0 to 0 dl 0 ref
1 fl Rpc:/0/ffffffff rc 0/-1
[Tue Aug 25 23:40:14 2020] LustreError:
26000:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 14 previous
similar messages
[Tue Aug 25 23:40:26 2020] LustreError:
25917:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
req at ffff9f1339a5c800 x1653628454263632/t0(0)
o106->public1-OST0002 at 10.10.3.223@o2ib:15/16 lens 296/280 e 0 to 0 dl 0 ref
1 fl Rpc:/0/ffffffff rc 0/-1
[Tue Aug 25 23:40:26 2020] LustreError:
25917:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 2 previous
similar messages
[Tue Aug 25 23:44:59 2020] LustreError:
32485:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST0000: cli
3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 28672 GRANT, real grant 0
[Tue Aug 25 23:44:59 2020] LustreError:
32485:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 5755 previous similar
messages
[Tue Aug 25 23:49:18 2020] Lustre: public1-OST0002: Connection restored to
87ca2182-98a3-25dd-7d30-989d822381c6 (at 10.10.5.6 at o2ib)
[Tue Aug 25 23:49:18 2020] Lustre: Skipped 102 previous similar messages
[Tue Aug 25 23:55:00 2020] LustreError:
32485:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST0004: cli
3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 577536 GRANT, real grant 0
[Tue Aug 25 23:55:00 2020] LustreError:
32485:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 1121 previous similar
messages
[Tue Aug 25 23:59:25 2020] Lustre: public1-OST0000: Connection restored to
d45ad9f4-8903-7c80-7b35-bd32037de660 (at 10.10.7.131 at o2ib)
[Tue Aug 25 23:59:25 2020] Lustre: Skipped 50 previous similar messages
[Tue Aug 25 23:59:49 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 156s: evicting client at 10.10.3.223 at o2ib  ns:
filter-public1-OST0000_UUID lock: ffff9f130863a880/0x3fcc67dad1cff1d5 lrc:
3/0,0 mode: PR/PR res: [0xde2db83:0x0:0x0].0x0 rrc: 4 type: EXT
[0->18446744073709551615] (req 3911680->4173823) flags: 0x60000000020020
nid: 10.10.3.223 at o2ib remote: 0xd713b7b417354237 expref: 11891 pid: 26099
timeout: 21387847 lvb_type: 0
[Tue Aug 25 23:59:49 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 2 previous similar
messages
[Wed Aug 26 00:00:40 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 100s: evicting client at 10.10.3.223 at o2ib  ns:
filter-public1-OST0004_UUID lock: ffff9f2df4a10d80/0x3fcc67dad1d50925 lrc:
3/0,0 mode: PR/PR res: [0xdc95179:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->266239) flags: 0x60000400000020 nid:
10.10.3.223 at o2ib remote: 0xd713b7b417549c43 expref: 14594 pid: 26181
timeout: 21387927 lvb_type: 0
[Wed Aug 26 00:00:40 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar
message
[Wed Aug 26 00:02:37 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 100s: evicting client at 10.10.3.223 at o2ib  ns:
filter-public1-OST0000_UUID lock: ffff9f1359e94a40/0x3fcc67dad1dacd8b lrc:
3/0,0 mode: PR/PR res: [0xde609f1:0x0:0x0].0x0 rrc: 4 type: EXT
[0->18446744073709551615] (req 1941504->2097151) flags: 0x60000400020020
nid: 10.10.3.223 at o2ib remote: 0xd713b7b417780209 expref: 5626 pid: 26134
timeout: 21388044 lvb_type: 0
[Wed Aug 26 00:02:37 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 1 previous similar
message
[Wed Aug 26 00:05:00 2020] LustreError:
26199:0:(tgt_grant.c:750:tgt_grant_check()) public1-OST0004: cli
3a021350-bbe4-b05e-7ddf-95009f8dff7b claims 28672 GRANT, real grant 0
[Wed Aug 26 00:05:00 2020] LustreError:
26199:0:(tgt_grant.c:750:tgt_grant_check()) Skipped 14028 previous similar
messages
[Wed Aug 26 00:09:30 2020] Lustre: public1-OST0000: Connection restored to
956559c4-4e7c-e6a5-3867-83ab85699688 (at 10.10.6.91 at o2ib)
[Wed Aug 26 00:09:30 2020] Lustre: Skipped 39 previous similar messages
[Wed Aug 26 00:10:27 2020] LustreError:
14278:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer
expired after 147s: evicting client at 10.10.3.223 at o2ib  ns:
filter-public1-OST0002_UUID lock: ffff9f16e6f95c40/0x3fcc67dad1dea822 lrc:
3/0,0 mode: PR/PR res: [0xdd5d4bb:0x0:0x0].0x0 rrc: 3 type: EXT
[0->18446744073709551615] (req 0->24575) flags: 0x60000400020020 nid:
10.10.3.223 at o2ib remote: 0xd713b7b417900639 expref: 8633 pid: 25993
timeout: 21388514 lvb_type: 0
"

Anymore , we exec lfsck on all servers,  result is
"
layout_mdts_init: 0
layout_mdts_scanning-phase1: 0
layout_mdts_scanning-phase2: 0
layout_mdts_completed: 1
layout_mdts_failed: 0
layout_mdts_stopped: 0
layout_mdts_paused: 0
layout_mdts_crashed: 0
layout_mdts_partial: 0
layout_mdts_co-failed: 0
layout_mdts_co-stopped: 0
layout_mdts_co-paused: 0
layout_mdts_unknown: 0
layout_osts_init: 0
layout_osts_scanning-phase1: 0
layout_osts_scanning-phase2: 0
layout_osts_completed: 8
layout_osts_failed: 0
layout_osts_stopped: 0
layout_osts_paused: 0
layout_osts_crashed: 0
layout_osts_partial: 0
layout_osts_co-failed: 0
layout_osts_co-stopped: 0
layout_osts_co-paused: 0
layout_osts_unknown: 0
layout_repaired: 2253861
namespace_mdts_init: 0
namespace_mdts_scanning-phase1: 0
namespace_mdts_scanning-phase2: 0
namespace_mdts_completed: 1
namespace_mdts_failed: 0
namespace_mdts_stopped: 0
namespace_mdts_paused: 0
namespace_mdts_crashed: 0
namespace_mdts_partial: 0
namespace_mdts_co-failed: 0
namespace_mdts_co-stopped: 0
namespace_mdts_co-paused: 0
namespace_mdts_unknown: 0
namespace_osts_init: 0
namespace_osts_scanning-phase1: 0
namespace_osts_scanning-phase2: 0
namespace_osts_completed: 0
namespace_osts_failed: 0
namespace_osts_stopped: 0
namespace_osts_paused: 0
namespace_osts_crashed: 0
namespace_osts_partial: 0
namespace_osts_co-failed: 0
namespace_osts_co-stopped: 0
namespace_osts_co-paused: 0
namespace_osts_unknown: 0
namespace_repaired: 0
"

Colin Faber <cfaber at gmail.com> 于2020年8月26日周三 上午12:17写道:

> The I/O was not fully committed after close() from the client. Are you
> experiencing high numbers of evictions?
>
> On Tue, Aug 25, 2020 at 9:12 AM 肖正刚 <guru.novice at gmail.com> wrote:
>
>> Hi, all
>>
>> We found that some clients' dmesg filled up with messages like
>> "
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13565:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x1680f:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13547:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x14246:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13545:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12018:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13567:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12c86:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13566:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12c76:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13550:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12c8e:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13568:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12c66:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13569:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12c7e:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13548:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12c6e:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13570:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12ca6:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13549:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12cbe:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13571:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12cb6:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13551:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12cae:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13572:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12cce:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13573:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12cc6:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13574:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12d56:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13575:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x12d36:0x0]/ may get corrupted (rc -108)
>> Aug 24 19:54:34 ln5 kernel: Lustre:
>> 13576:0:(llite_lib.c:2759:ll_dirty_page_discard_warn()) public1: dirty page
>> discard: 10.10.2.11 at o2ib:10.10.2.12 at o2ib:/public1/fid:
>> [0x200007a82:0x1429e:0x0]/ may get corrupted (rc -108)
>>
>> "
>> Then, we checked disk array, sas link, multipath, but no error found.
>> Has anyone ever met the same problem ?
>> Any suggestions will help!
>>
>> Regards.
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200826/557fb5b2/attachment.html>


More information about the lustre-discuss mailing list