[lustre-discuss] MDS/MGS has a block storage device mounted and it does not have any permissions (no read , no write, no execute)

Wed Feb 6 10:17:10 PST 2019

Thanks Andreas.   Given below are the output of the commands you asked to run:.

> [root at lustre-mds-server-1 opc]#
> 	• Assuming if the above is not an issue,  after setting up OSS/OST and Client node,  When my client tries to mount, I get the below error: 
> [root at lustre-client-1 opc]# mount -t lustre 10.0.2.4 at tcp:/lustrewt 
> /mnt
> mount.lustre: mount 10.0.2.4 at tcp:/lustrewt at /mnt failed: 
> Input/output error Is the MGS running?
> [root at lustre-client-1 opc]#

Andreas:  Can you do "lctl ping" from the client to the MGS node?  Most commonly this happens because the client still has a firewall configured, or it is defined to have "127.0.0.1" as the local node address.

Pinkesh response:  
[root at lustre-client-1 opc]# lctl ping 10.0.2.6 at tcp
12345-0 at lo
12345-10.0.2.6 at tcp

So there is a "lo" mentioned here, could that be a problem?

I also ran the mount command on client node to capture logs on both the client node and MDS node.  

(ran command at 18.11 time)
[root at lustre-client-1 opc]# mount -t lustre 10.0.2.6 at tcp:/lustrewt /mnt
mount.lustre: mount 10.0.2.6 at tcp:/lustrewt at /mnt failed: Input/output error
Is the MGS running?
[root at lustre-client-1 opc]#

[root at lustre-mds-server-1 opc]# tail -f /var/log/messages      
Feb  6 18:11:38 lustre-mds-server-1 kernel: Lustre: MGS: Connection restored to 88e1c321-1eaa-6914-5a37-4fff2063b526 (at 10.0.0.2 at tcp)
Feb  6 18:11:38 lustre-mds-server-1 kernel: Lustre: Skipped 1 previous similar message
Feb  6 18:11:45 lustre-mds-server-1 kernel: Lustre: MGS: Received new LWP connection from 10.0.0.2 at tcp, removing former export from same NID
Feb  6 18:11:45 lustre-mds-server-1 kernel: Lustre: MGS: Connection restored to 88e1c321-1eaa-6914-5a37-4fff2063b526 (at 10.0.0.2 at tcp)

[root at lustre-client-1 opc]# less /var/log/messages  
Feb  6 18:10:01 lustre-client-1 systemd: Removed slice User Slice of root.
Feb  6 18:10:01 lustre-client-1 systemd: Stopping User Slice of root.
Feb  6 18:11:45 lustre-client-1 kernel: Lustre: 10376:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1549476698/real 1549476698]  req at ffff9259bb42a100 x1624614953288736/t0(0) o503->MGC10.0.2.6 at tcp@10.0.2.6 at tcp:26/25 lens 272/8416 e 0 to 1 dl 1549476705 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
Feb  6 18:11:45 lustre-client-1 kernel: Lustre: 10376:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Feb  6 18:11:45 lustre-client-1 kernel: LustreError: 166-1: MGC10.0.2.6 at tcp: Connection to MGS (at 10.0.2.6 at tcp) was lost; in progress operations using this service will fail
Feb  6 18:11:45 lustre-client-1 kernel: LustreError: 15c-8: MGC10.0.2.6 at tcp: The configuration from log 'lustrewt-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Feb  6 18:11:45 lustre-client-1 kernel: Lustre: MGC10.0.2.6 at tcp: Connection restored to MGC10.0.2.6 at tcp_0 (at 10.0.2.6 at tcp)
Feb  6 18:11:45 lustre-client-1 kernel: Lustre: Unmounted lustrewt-client
Feb  6 18:11:45 lustre-client-1 kernel: LustreError: 10376:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-5)

Thanks,
Pinkesh Valdria
OCI – Big Data
Principal Solutions Architect 
m: +1-206-234-4314
pinkesh.valdria at oracle.com

-----Original Message-----
From: Andreas Dilger <adilger at whamcloud.com> 
Sent: Wednesday, February 6, 2019 2:28 AM
To: Pinkesh Valdria <pinkesh.valdria at oracle.com>
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] MDS/MGS has a block storage device mounted and it does not have any permissions (no read , no write, no execute)

On Feb 5, 2019, at 15:39, Pinkesh Valdria <pinkesh.valdria at oracle.com> wrote:
> 
> Hello All,
>  
> I am new to Lustre.   I started by using the docs on this page to deploy Lustre on Virtual machines running CentOS 7.x (CentOS-7-2018.08.15-0).    Included below are the content of the scripts I used and the error I get.  
> I have not done any setup for “o2ib0(ib0)” and lnet is using tcp.   All the nodes are on the same network & subnet and cannot communicate on my protocol and port #. 
>  
> Thanks for your help.  I am completely blocked and looking for ideas. (already did google search ☹).  
>  
> I have 2 questions:  
> 	• The MDT mounted on MDS has no permissions (no read , no write, no execute), even for root user on MDS/MGS node.   Is that expected? .   See “MGS/MDS node setup” section for more details on what I did. 
> [root at lustre-mds-server-1 opc]# mount -t lustre /dev/sdb /mnt/mdt
>  
> [root at lustre-mds-server-1 opc]# ll /mnt total 0 d---------. 1 root 
> root 0 Jan  1  1970 mdt

The mountpoint on the MDS is just there for "df" to work and to manage the block device. It does not provide access to filesystem.  You need to do a client mount for that (typically on another node, but it also works on the MDS), like:

    mount -t lustre lustre-mds-server-1:/lustrewt /mnt/lustrewt

or similar.

> [root at lustre-mds-server-1 opc]#
> 	• Assuming if the above is not an issue,  after setting up OSS/OST and Client node,  When my client tries to mount, I get the below error: 
> [root at lustre-client-1 opc]# mount -t lustre 10.0.2.4 at tcp:/lustrewt 
> /mnt
> mount.lustre: mount 10.0.2.4 at tcp:/lustrewt at /mnt failed: 
> Input/output error Is the MGS running?
> [root at lustre-client-1 opc]#

Can you do "lctl ping" from the client to the MGS node?  Most commonly this happens because the client still has a firewall configured, or it is defined to have "127.0.0.1" as the local node address.

> dmesg shows the below error on the client node:  
> [root at lustre-client-1 opc]#  dmesg
> [35639.535862] Lustre: 
> 11730:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent 
> has timed out for slow reply: [sent 1549386846/real 1549386846]  
> req at ffff9259bb518c00 x1624614953288208/t0(0) o250->MGC10.0.2.4 at tcp@10.0.2.4 at tcp:26/25 lens 520/544 e 0 to 1 dl 1549386851 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [35640.535877] LustreError: 7718:0:(mgc_request.c:251:do_config_log_add()) MGC10.0.2.4 at tcp: failed processing log, type 1: rc = -5 [35669.535028] Lustre: 11730:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1549386871/real 1549386871]  req at ffff9259bb428f00 x1624614953288256/t0(0) o250->MGC10.0.2.4 at tcp@10.0.2.4 at tcp:26/25 lens 520/544 e 0 to 1 dl 1549386881 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [35670.546671] LustreError: 15c-8: MGC10.0.2.4 at tcp: The configuration from log 'lustrewt-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
> [35670.557472] Lustre: Unmounted lustrewt-client [35670.560432] 
> LustreError: 7718:0:(obd_mount.c:1582:lustre_fill_super()) Unable to 
> mount  (-5)
> [root at lustre-client-1 opc]#

Nothing here except that the client can't communicate with the MGS.  There might be some more useful messages earlier on in the logs.

> I have firewall turned off on all nodes (client, mds/mgs, oss),  selinux is disabled/setenforce=0 .  I can telnet to the MDS/MGS node from client machine.  
>  
>  
> Given below is the setup I have on different nodes: 
>  
> MGS/MDS node setup
> #!/bin/bash
> service firewalld stop
> chkconfig firewalld off
>  
> cat > /etc/yum.repos.d/lustre.repo << EOF [hpddLustreserver]
> name=CentOS- - Lustre
> baseurl=https://urldefense.proofpoint.com/v2/url?u=https-3A__downloads
> .whamcloud.com_public_lustre_latest-2Drelease_el7.6.1810_server_&d=DwI
> GaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuy
> xxo2149EjwqpQDE7ytv-4sZuI&m=NbIsbq43SHhJVpkun9lIZ0sWf6CyRhx7ZezcrxlmUZ
> A&s=eF3UV2xVrOKgPavZgR1IPKJkDlgIJ00fq2OnOcy7-IQ&e=
> gpgcheck=0
>  
> [e2fsprogs]
> name=CentOS- - Ldiskfs
> baseurl=https://urldefense.proofpoint.com/v2/url?u=https-3A__downloads
> .whamcloud.com_public_e2fsprogs_latest_el7_&d=DwIGaQ&c=RoP1YumCXCgaWHv
> lZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuyxxo2149EjwqpQDE7ytv-4
> sZuI&m=NbIsbq43SHhJVpkun9lIZ0sWf6CyRhx7ZezcrxlmUZA&s=AuuzvoJSb-G4pHCMr
> DGxk8gw4cab29On1jReGLbbQLI&e=
> gpgcheck=0
>  
> [hpddLustreclient]
> name=CentOS- - Lustre
> baseurl=https://urldefense.proofpoint.com/v2/url?u=https-3A__downloads
> .whamcloud.com_public_lustre_latest-2Drelease_el7.6.1810_client_&d=DwI
> GaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=HpfvG0tozSl7HgJJuy
> xxo2149EjwqpQDE7ytv-4sZuI&m=NbIsbq43SHhJVpkun9lIZ0sWf6CyRhx7ZezcrxlmUZ
> A&s=6-AiITILdM-XXJNDSOz6-2v5MUYqA-41Y9lwZ6oMa_I&e=
> gpgcheck=0
> EOF
>  
> sudo yum install lustre-tests -y
>  
> cp /etc/selinux/config /etc/selinux/config.backup sed 
> 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
>  
> setenforce 0
>  
> echo "complete.  rebooting now"
> reboot
>  
>  
>  
> After reboot is complete,  I login to the MGS/MDS node as root and run the following steps: 
>  
> The node has a block storage device attached:  /dev/sdb Run the below 
> command:
> pvcreate -y  /dev/sdb
> mkfs.xfs -f /dev/sdb

The "mkfs.xfs" command here is useless.  Lustre uses only ext4 or ZFS for the MDT and OSTs, and reformats the filesystem with mkfs.lustre in any case.

> [root at lustre-mds-server-1 opc]# setenforce 0
> [root at lustre-mds-server-1 opc]# mkfs.lustre --fsname=lustrewt --index=0 --mgs --mdt /dev/sdb
>    Permanent disk data:
> Target:     lustrewt:MDT0000
> Index:      0
> Lustre FS:  lustrewt
> Mount type: ldiskfs
> Flags:      0x65
>               (MDT MGS first_time update ) Persistent mount opts: 
> user_xattr,errors=remount-ro
> Parameters:
>  
> checking for existing Lustre data: not found device size = 51200MB 
> formatting backing filesystem ldiskfs on /dev/sdb
>         target name   lustrewt:MDT0000
>         4k blocks     13107200
>         options        -J size=2048 -I 1024 -i 2560 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_journal_init -F
> mkfs_cmd = mke2fs -j -b 4096 -L lustrewt:MDT0000  -J size=2048 -I 1024 
> -i 2560 -q -O 
> dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E 
> lazy_journal_init -F /dev/sdb 13107200
>  
>  
> [root at lustre-mds-server-1 opc]# mkdir -p /mnt/mdt
> [root at lustre-mds-server-1 opc]# mount -t lustre /dev/sdb /mnt/mdt
> [root at lustre-mds-server-1 opc]# modprobe lnet
> [root at lustre-mds-server-1 opc]# lctl network up LNET configured
> [root at lustre-mds-server-1 opc]# lctl list_nids 10.0.2.4 at tcp
>  
> [root at lustre-mds-server-1 opc]# ll /mnt total 0 d---------. 1 root 
> root 0 Jan  1  1970 mdt
> [root at lustre-mds-server-1 opc]#
>  
>  
> OSS/OST node
> 1 OSS node with 1 block device for OST (/dev/sdb). The setup to update kernel was the same as MGS/MDS node (described above),  then I ran the below commands: 
>  
>  
> mkfs.lustre --ost --fsname=lustrewt --index=0 --mgsnode=10.0.2.4 at tcp 
> /dev/sdb mkdir -p /ostoss_mount mount -t lustre /dev/sdb /ostoss_mount
>  
>  
> Client  node
> 1 client node. The setup to update kernel was the same as MGS/MDS node (described above),  then I ran the below commands: 
>  
> [root at lustre-client-1 opc]# modprobe lustre
> [root at lustre-client-1 opc]# mount -t lustre 10.0.2.3 at tcp:/lustrewt /mnt   (This fails with below error):
> mount.lustre: mount 10.0.2.4 at tcp:/lustrewt at /mnt failed: 
> Input/output error Is the MGS running?

You shouldn't need the "modprobe" command, it should be enough just to "mount -t lustre ..."
to auto-load the modules.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud