[Lustre-discuss] Client hangs on 'simple' lustre setup

Jon Yeargers yeargers at ohsu.edu
Mon Sep 17 07:57:05 PDT 2012


Issue: I'm trying to assess the (possible) use of Lustre for our group. To this end I've been trying to create a simple system to explore the nuances. I can't seem to get past the 'llmount.sh' test with any degree of success.

What I've done: Each system (throwaway PCs with 70Gb HD, 2Gb RAM) is formatted with CentOS 6.2. I then update everything and install the Lustre kernel from downloads.whamcloud.com and add on the various (appropriate) lustre and e2fs RPM files. Systems are rebooted and tested with 'llmount.sh' (and then cleared with 'llmountcleanup.sh'). All is well to this point.

First I create an MDS/MDT system via:

    /usr/sbin/mkfs.lustre --mgs --mdt --fsname=lustre --device-size=2000000 --param sys.timeout=20 --mountfsoptions=errors=remount-ro,user_xattr,acl --param lov.stripesize=1048576 --param lov.stripecount=0 --param mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype ldiskfs --reformat /tmp/lustre-mdt1

and then

    mkdir -p /mnt/mds1
    mount -t lustre -o loop,user_xattr,acl  /tmp/lustre-mdt1 /mnt/mds1

Next I take 3 systems and create a 2Gb loop mount via:

    /usr/sbin/mkfs.lustre --ost --fsname=lustre --device-size=2000000 --param sys.timeout=20 --mgsnode=lustre_MDS0 at tcp --backfstype ldiskfs --reformat /tmp/lustre-ost1


    mkdir -p /mnt/ost1
    mount -t lustre -o loop  /tmp/lustre-ost1 /mnt/ost1

The logs on the MDT box show the OSS boxes connecting up. All appears ok.

Last I create a client and attach to the MDT box:

    mkdir -p /mnt/lustre
    mount -t lustre -o user_xattr,acl,flock luster_MDS0 at tcp:/lustre /mnt/lustre

Again, the log on the MDT box shows the client connection. Appears to be successful.

Here's where the issues (appear to) start. If I do a 'df -h' on the client it hangs after showing the system drives.  If I attempt to create files (via 'dd') on the lustre mount the session hangs and the job can't be killed. Rebooting the client is the only solution.

I can create and use a client on the MDS/MSG box. Doing so from any other machine will hang.

From the MDS box:

[root at lustre_mds0 lustre]# lctl dl
  0 UP mgs MGS MGS 13
  1 UP mgc MGC10.127.24.42 at tcp 7923c008-a0de-1c87-f21a-4a5ab48abb96 5
  2 UP lov lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
  3 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 7
  4 UP mds mdd_obd-lustre-MDT0000 mdd_obd_uuid-lustre-MDT0000 3
  5 UP osc lustre-OST0000-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
  6 UP osc lustre-OST0001-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
  7 UP lov lustre-clilov-ffff8800631c8000 b6b66579-1f44-90e5-ae63-e778d4ed6ac5 4
  8 UP lmv lustre-clilmv-ffff8800631c8000 b6b66579-1f44-90e5-ae63-e778d4ed6ac5 4
  9 UP mdc lustre-MDT0000-mdc-ffff8800631c8000 b6b66579-1f44-90e5-ae63-e778d4ed6ac5 5
10 UP osc lustre-OST0000-osc-ffff8800631c8000 b6b66579-1f44-90e5-ae63-e778d4ed6ac5 5
11 UP osc lustre-OST0001-osc-ffff8800631c8000 b6b66579-1f44-90e5-ae63-e778d4ed6ac5 5
12 UP osc lustre-OST0002-osc-ffff8800631c8000 b6b66579-1f44-90e5-ae63-e778d4ed6ac5 5
13 UP osc lustre-OST0002-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5

[root at lustre_mds0 lustre]# lfs df -h
UUID                       bytes        Used   Available Use% Mounted on
lustre-MDT0000_UUID         1.4G       83.9M        1.3G   6% /mnt/lustre[MDT:0]
lustre-OST0000_UUID         1.9G        1.1G      716.5M  61% /mnt/lustre[OST:0]
lustre-OST0001_UUID         1.9G        1.1G      728.5M  60% /mnt/lustre[OST:1]
lustre-OST0002_UUID         1.9G        1.1G      728.5M  60% /mnt/lustre[OST:2]

filesystem summary:         5.6G        3.2G        2.1G  60% /mnt/lustre

All appears normal.


Doing this from another (identical) client:

[root at lfstest0 lustre]# lctl dl
  0 UP mgc MGC10.127.24.42 at tcp 272a8405-8512-e9de-f532-feb5b7d6f9b1 5
  1 UP lov lustre-clilov-ffff880070eee400 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 4
  2 UP lmv lustre-clilmv-ffff880070eee400 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 4
  3 UP mdc lustre-MDT0000-mdc-ffff880070eee400 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 5
  4 UP osc lustre-OST0000-osc-ffff880070eee400 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 5
  5 UP osc lustre-OST0001-osc-ffff880070eee400 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 5
  6 UP osc lustre-OST0002-osc-ffff880070eee400 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 5

[root at lfstest0 lustre]# lfs df
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID      1499596       85888     1313708   6% /mnt/lustre[MDT:0]
OST0000             : inactive device
lustre-OST0001_UUID      1968528     1122468      745996  60% /mnt/lustre[OST:1]
OST0002             : inactive device

filesystem summary:      1968528     1122468      745996  60% /mnt/luster

Doing a ‘dd’ or ‘touch’ or even ‘df’ from this machine will hang it.


EDIT: each system has all other systems defined in /etc/hosts and entries in iptables to provide access.

All systems have identical setup:

[root at lfstest0 lustre]# rpm -qa | grep lustre
lustre-ldiskfs-3.3.0-2.6.32_279.2.1.el6_lustre.gc46c389.x86_64.x86_64
lustre-2.1.3-2.6.32_279.2.1.el6_lustre.gc46c389.x86_64.x86_64
kernel-2.6.32-279.2.1.el6_lustre.gc46c389.x86_64
lustre-modules-2.1.3-2.6.32_279.2.1.el6_lustre.gc46c389.x86_64.x86_64
lustre-tests-2.1.3-2.6.32_279.2.1.el6_lustre.gc46c389.x86_64.x86_64

[root at lfstest0 lustre]# uname -a
Linux lfstest0 2.6.32-279.2.1.el6_lustre.gc46c389.x86_64 #1 SMP Mon Aug 13 11:00:10 PDT 2012 x86_64 x86_64 x86_64 GNU/Linux

[root at lfstest0 lustre]# rpm -qa | grep e2fs
e2fsprogs-libs-1.41.90.wc2-7.el6.x86_64
e2fsprogs-1.41.90.wc2-7.el6.x86_64


SO: I'm clearly making several mistakes. Any pointers as to where to start correcting them?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120917/e9db3160/attachment.htm>


More information about the lustre-discuss mailing list