[lustre-discuss] [EXTERNAL] MDTs will only mount read only

Wed Jun 21 11:20:52 PDT 2023

Jeff,

At this point we have the OSS shutdown.  We were coming back from. full
outage and so we are trying to get the MDS up before starting to bring up
the OSS.

Mike

On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson <jeff.johnson at aeoncomputing.com>
wrote:

> Mike,
>
> Have you made sure the the o2ib interface on all of your Lustre servers
> (MDS & OSS) are functioning properly? Are you able to `lctl ping
> x.x.x.x at o2ib` successfully between MDS and OSS nodes?
>
> --Jeff
>
>
> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
> lustre-discuss at lists.lustre.org> wrote:
>
>> Rick,
>> 172.16.100.4 is the IB address of one of the OSS servers.    I
>>  believe the mgt and mdt0 are the same target.   My understanding is that
>> we have a single instanceof the MGT which is on the first MDT server i.e.
>> it was created via a comand similar to:
>>
>> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>>
>> Does that make sense.
>>
>> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick <mohrrf at ornl.gov> wrote:
>>
>>> Which host is 172.16.100.4?  Also, are the mgt and mdt0 on the same
>>> target or are they two separate targets just on the same host?
>>>
>>> --Rick
>>>
>>>
>>> On 6/21/23, 12:52 PM, "Mike Mosley" <Mike.Mosley at charlotte.edu <mailto:
>>> Mike.Mosley at charlotte.edu>> wrote:
>>>
>>>
>>> Hi Rick,
>>>
>>>
>>> The MGS/MDS are combined. The output I posted is from the primary.
>>>
>>>
>>>
>>>
>>> THanks,
>>>
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick <mohrrf at ornl.gov <mailto:
>>> mohrrf at ornl.gov> <mailto:mohrrf at ornl.gov <mailto:mohrrf at ornl.gov>>>
>>> wrote:
>>>
>>>
>>> Mike,
>>>
>>>
>>> It looks like the mds server is having a problem contacting the mgs
>>> server. I'm guessing the mgs is a separate host? I would start by looking
>>> for possible network problems that might explain the LNet timeouts. You can
>>> try using "lctl ping" to test the LNet connection between nodes, and you
>>> can also try regular "ping" between the IP addresses on the IB interfaces.
>>>
>>>
>>> --Rick
>>>
>>>
>>>
>>>
>>> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
>>> lustre-discuss" <lustre-discuss-bounces at lists.lustre.org <mailto:
>>> lustre-discuss-bounces at lists.lustre.org> <_blank> <mailto:
>>> lustre-discuss-bounces at lists.lustre.org <mailto:
>>> lustre-discuss-bounces at lists.lustre.org> <_blank>> on behalf of
>>> lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>
>>> <_blank> <mailto:lustre-discuss at lists.lustre.org <mailto:
>>> lustre-discuss at lists.lustre.org> <_blank>>> wrote:
>>>
>>>
>>>
>>>
>>> Greetings,
>>>
>>>
>>>
>>>
>>> We have experienced some type of issue that is causing both of our MDS
>>> servers to only be able to mount the mdt device in read only mode. Here are
>>> some of the error messages we are seeing in the log files below. We lost
>>> our Lustre expert a while back and we are not sure how to proceed to
>>> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked
>>> for more than 120 seconds.
>>> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123
>>> 1 0x00000086
>>> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb585da9>] schedule+0x29/0x70
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb5838b1>]
>>> schedule_timeout+0x221/0x2d0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6b8e5>] ?
>>> tracing_is_on+0x15/0x30
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf6f5bd>] ?
>>> tracing_record_cmdline+0x1d/0x120
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaf77d9b>] ?
>>> probe_sched_wakeup+0x2b/0xa0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaed7d15>] ?
>>> ttwu_do_wakeup+0xb5/0xe0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb58615d>]
>>> wait_for_completion+0xfd/0x140
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbaedb990>] ?
>>> wake_up_state+0x20/0x20
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f529a4>]
>>> llog_process_or_fork+0x244/0x450 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f52bc4>]
>>> llog_process+0x14/0x20 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f85d05>]
>>> class_config_parse_llog+0x125/0x350 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a69fc0>]
>>> mgc_process_cfg_log+0x790/0xc40 [mgc]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6d4cc>]
>>> mgc_process_log+0x3dc/0x8f0 [mgc]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6e15f>] ?
>>> config_recover_log_add+0x13f/0x280 [mgc]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ?
>>> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0a6eb2b>]
>>> mgc_process_config+0x88b/0x13f0 [mgc]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f91b58>]
>>> lustre_process_log+0x2d8/0xad0 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0e5a177>] ?
>>> libcfs_debug_msg+0x57/0x80 [libcfs]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f7c8b9>] ?
>>> lprocfs_counter_add+0xf9/0x160 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc08f4>]
>>> server_start_targets+0x13a4/0x2a20 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f94bb0>] ?
>>> lustre_start_mgc+0x260/0x2510 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8df40>] ?
>>> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0fc303c>]
>>> server_fill_super+0x10cc/0x1890 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f97a08>]
>>> lustre_fill_super+0x468/0x960 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f975a0>] ?
>>> lustre_common_put_super+0x270/0x270 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0510cf>]
>>> mount_nodev+0x4f/0xb0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffc0f8f9a8>]
>>> lustre_mount+0x38/0x60 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb051c4e>] mount_fs+0x3e/0x1b0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0707a7>]
>>> vfs_kern_mount+0x67/0x110
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb072edf>]
>>> do_mount+0x1ef/0xd00
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb049d7a>] ?
>>> __check_object_size+0x1ca/0x250
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb0288ec>] ?
>>> kmem_cache_alloc_trace+0x3c/0x200
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb073d33>] SyS_mount+0x83/0xd0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [<ffffffffbb592ed2>]
>>> system_call_fastpath+0x25/0x2a
>>> Jun 20 15:13:14 hyd-mds1 kernel: LNet:
>>> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for
>>> 172.16.100.4 at o2ib: 9 seconds
>>> Jun 20 15:13:14 hyd-mds1 kernel: LNet:
>>> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous
>>> similar messages
>>> Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked
>>> for more than 120 seconds.
>>> Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Jun 20 15:14:14 hyd-mds1 kernel: mount.lustre D ffff9f27a3bc5230 0 4123
>>> 1 0x00000086
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> dumpe2fs seems to show that the file systems are clean i.e.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> dumpe2fs 1.45.6.wc1 (20-Mar-2020)
>>> Filesystem volume name: hydra-MDT0000
>>> Last mounted on: /
>>> Filesystem UUID: 3ae09231-7f2a-43b3-a4ee-7f36080b5a66
>>> Filesystem magic number: 0xEF53
>>> Filesystem revision #: 1 (dynamic)
>>> Filesystem features: has_journal ext_attr resize_inode dir_index
>>> filetype mmp flex_bg dirdata sparse_super large_file huge_file uninit_bg
>>> dir_nlink quota
>>> Filesystem flags: signed_directory_hash
>>> Default mount options: user_xattr acl
>>> Filesystem state: clean
>>> Errors behavior: Continue
>>> Filesystem OS type: Linux
>>> Inode count: 2247671504
>>> Block count: 1404931944
>>> Reserved block count: 70246597
>>> Free blocks: 807627552
>>> Free inodes: 2100036536
>>> First block: 0
>>> Block size: 4096
>>> Fragment size: 4096
>>> Reserved GDT blocks: 1024
>>> Blocks per group: 20472
>>> Fragments per group: 20472
>>> Inodes per group: 32752
>>> Inode blocks per group: 8188
>>> Flex block group size: 16
>>> Filesystem created: Thu Aug 8 14:21:01 2019
>>> Last mount time: Tue Jun 20 15:19:03 2023
>>> Last write time: Wed Jun 21 10:43:51 2023
>>> Mount count: 38
>>> Maximum mount count: -1
>>> Last checked: Thu Aug 8 14:21:01 2019
>>> Check interval: 0 (<none>)
>>> Lifetime writes: 219 TB
>>> Reserved blocks uid: 0 (user root)
>>> Reserved blocks gid: 0 (group root)
>>> First inode: 11
>>> Inode size: 1024
>>> Required extra isize: 32
>>> Desired extra isize: 32
>>> Journal inode: 8
>>> Default directory hash: half_md4
>>> Directory Hash Seed: 2e518531-82d9-4652-9acd-9cf9ca09c399
>>> Journal backup: inode blocks
>>> MMP block number: 1851467
>>> MMP update interval: 5
>>> User quota inode: 3
>>> Group quota inode: 4
>>> Journal features: journal_incompat_revoke
>>> Journal size: 4096M
>>> Journal length: 1048576
>>> Journal sequence: 0x0a280713
>>> Journal start: 0
>>> MMP_block:
>>> mmp_magic: 0x4d4d50
>>> mmp_check_interval: 6
>>> mmp_sequence: 0xff4d4d50
>>> mmp_update_date: Wed Jun 21 10:43:51 2023
>>> mmp_update_time: 1687358631
>>> mmp_node_name: hyd-mds1.uncc.edu <_blank> <_blank>
>>> mmp_device_name: dm-0
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
>
> --
> ------------------------------
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.johnson at aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
>
> 4170 Morena Boulevard, Suite C - San Diego, CA 92117
>
> High-Performance Computing / Lustre Filesystems / Scale-out Storage
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230621/95d1bb46/attachment.htm>