[Lustre-discuss] Client hangs on 'simple' lustre setup

Jon Yeargers yeargers at ohsu.edu
Mon Sep 17 08:51:21 PDT 2012


This is from the MDS machine: I have rebooted the client, mounted the /lustre partition and entered "ls /mnt/lustre". The client is hung and remains so.

(NOTE: much of this log is from earlier today - you can see previous attempts to make this process work)

Sep 17 06:16:52 lustre_mds0 kernel: Lustre: Lustre: Build Version: RC3--PRISTINE-2.6.32-279.2.1.el6_lustre.gc46c389.x86_
64
Sep 17 06:16:53 lustre_mds0 kernel: Lustre: Added LNI 10.127.24.42 at tcp [8/256/0/180]
Sep 17 06:16:53 lustre_mds0 kernel: Lustre: Accept secure, port 988
Sep 17 06:16:54 lustre_mds0 kernel: Lustre: Lustre OSC module (ffffffffa07f2b20).
Sep 17 06:16:54 lustre_mds0 kernel: Lustre: Lustre LOV module (ffffffffa08714c0).
Sep 17 06:16:54 lustre_mds0 kernel: Lustre: Lustre client module (ffffffffa095e260).
Sep 17 06:17:17 lustre_mds0 kernel: LDISKFS-fs warning (device loop0): ldiskfs_fill_super: extents feature not enabled o
n this filesystem, use tune2fs.
Sep 17 06:17:17 lustre_mds0 kernel: LDISKFS-fs (loop0): warning: maximal mount count reached, running e2fsck is recommen
ded
Sep 17 06:17:17 lustre_mds0 kernel: LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts:
Sep 17 06:18:12 lustre_mds0 kernel: LDISKFS-fs warning (device loop0): ldiskfs_fill_super: extents feature not enabled o
n this filesystem, use tune2fs.
Sep 17 06:18:12 lustre_mds0 kernel: LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts:
Sep 17 06:20:49 lustre_mds0 kernel: LDISKFS-fs warning (device loop0): ldiskfs_fill_super: extents feature not enabled o
n this filesystem, use tune2fs.
Sep 17 06:20:49 lustre_mds0 kernel: LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts:
Sep 17 06:20:49 lustre_mds0 kernel: LDISKFS-fs warning (device loop0): ldiskfs_fill_super: extents feature not enabled o
n this filesystem, use tune2fs.
Sep 17 06:20:49 lustre_mds0 kernel: LDISKFS-fs (loop0): mounted filesystem with ordered data mode. Opts:
Sep 17 06:20:49 lustre_mds0 kernel: Lustre: MGS MGS started
Sep 17 06:20:49 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) MGS: connection from 7923c00
8-a0de-1c87-f21a-4a5ab48abb96 at 0@lo t0 exp (null) cur 1347888049 last 0
Sep 17 06:20:49 lustre_mds0 kernel: Lustre: MGC10.127.24.42 at tcp: Reactivating import
Sep 17 06:20:49 lustre_mds0 kernel: Lustre: MGS: Regenerating lustre-MDTffff log by user request.
Sep 17 06:20:49 lustre_mds0 kernel: Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000

Sep 17 06:20:49 lustre_mds0 kernel: Lustre: Enabling ACL
Sep 17 06:20:49 lustre_mds0 kernel: Lustre: Enabling user_xattr
Sep 17 06:20:50 lustre_mds0 kernel: Lustre: lustre-MDT0000: new disk, initializing
Sep 17 06:20:50 lustre_mds0 kernel: Lustre: 1240:0:(mdt_lproc.c:416:lprocfs_wr_identity_upcall()) lustre-MDT0000: identi
ty upcall set to /usr/sbin/l_getidentity
Sep 17 06:21:04 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) MGS: connection from b05bf03
e-633e-604d-54ba-9cc47b0b160d at 10.127.56.72@tcp t0 exp (null) cur 1347888064 last 0
Sep 17 06:21:04 lustre_mds0 kernel: Lustre: MGS: Regenerating lustre-OSTffff log by user request.
Sep 17 06:21:10 lustre_mds0 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0000_UUID now active, resetting orphan
s
Sep 17 06:22:44 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) MGS: connection from c72ffd0
3-3f98-3fc6-c8bc-648b4c266970 at 10.127.56.87@tcp t0 exp (null) cur 1347888164 last 0
Sep 17 06:22:44 lustre_mds0 kernel: Lustre: MGS: Regenerating lustre-OSTffff log by user request.
Sep 17 06:22:47 lustre_mds0 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0001_UUID now active, resetting orphan
s
Sep 17 06:24:42 lustre_mds0 kernel: LustreError: 152-6: Ignoring deprecated mount option 'acl'.
Sep 17 06:24:42 lustre_mds0 kernel: Lustre: 1253:0:(ldlm_lib.c:946:target_handle_connect()) lustre-MDT0000: connection f
rom 8c097db6-d726-216f-0c84-77cbead03807 at 0@lo t0 exp (null) cur 1347888282 last 0
Sep 17 06:24:42 lustre_mds0 kernel: Lustre: Mounted lustre-client
Sep 17 06:27:21 lustre_mds0 kernel: Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT00
00
Sep 17 06:27:21 lustre_mds0 kernel: Lustre: Skipped 4 previous similar messages
Sep 17 06:27:45 lustre_mds0 kernel: Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0
000000240000400):0:0
Sep 17 06:34:24 lustre_mds0 kernel: Lustre: 1227:0:(ldlm_lib.c:946:target_handle_connect()) MGS: connection from 9077ab0
7-a949-368f-d946-b4d8dc6addea at 10.127.56.88@tcp t0 exp (null) cur 1347888864 last 0
Sep 17 06:34:24 lustre_mds0 kernel: Lustre: MGS: Regenerating lustre-OSTffff log by user request.
Sep 17 06:34:28 lustre_mds0 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0002_UUID now active, resetting orphan
s
Sep 17 06:35:01 lustre_mds0 kernel: LustreError: 1352:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from can
cel RPC: canceling anyway
Sep 17 06:35:01 lustre_mds0 kernel: LustreError: 1352:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_lis
t: -108
Sep 17 06:35:01 lustre_mds0 kernel: Lustre: Unmounted lustre-client
Sep 17 06:35:27 lustre_mds0 kernel: LustreError: 152-6: Ignoring deprecated mount option 'acl'.
Sep 17 06:35:27 lustre_mds0 kernel: Lustre: 1252:0:(ldlm_lib.c:946:target_handle_connect()) lustre-MDT0000: connection f
rom b6b66579-1f44-90e5-ae63-e778d4ed6ac5 at 0@lo t0 exp (null) cur 1347888927 last 0
Sep 17 06:35:27 lustre_mds0 kernel: Lustre: Mounted lustre-client
Sep 17 06:57:08 lustre_mds0 kernel: Lustre: 1227:0:(ldlm_lib.c:946:target_handle_connect()) MGS: connection from 35c69b8
6-f9d2-e274-7ac5-5380a17f9c32 at 10.127.24.35@tcp t0 exp (null) cur 1347890228 last 0
Sep 17 07:31:54 lustre_mds0 kernel: hrtimer: interrupt took 10250792 ns
Sep 17 07:34:23 lustre_mds0 kernel: Lustre: 1328:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request  sent has tim
ed out for slow reply: [sent 1347892452/real 1347892452]  req at ffff880078a04800 x1413362811414993/t0(0) o104->lustre-MDT0
000 at 10.127.24.35@tcp:15/16 lens 296/192 e 0 to 1 dl 1347892463 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Sep 17 07:34:23 lustre_mds0 kernel: LustreError: 138-a: lustre-MDT0000: A client on nid 10.127.24.35 at tcp was evicted due
 to a lock blocking callback time out: rc -107
Sep 17 07:34:39 lustre_mds0 kernel: Lustre: MGS: haven't heard from client 35c69b86-f9d2-e274-7ac5-5380a17f9c32 (at 10.1
27.24.35 at tcp) in 50 seconds. I think it's dead, and I am evicting it. exp ffff88007bad8000, cur 1347892479 expire 134789
2449 last 1347892429
Sep 17 07:39:56 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) MGS: connection from 94b20a9
7-ffc1-174f-b2fa-84612be87730 at 10.127.24.35@tcp t0 exp (null) cur 1347892796 last 0
Sep 17 07:39:56 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) Skipped 1 previous similar m
essage
Sep 17 07:41:09 lustre_mds0 kernel: Lustre: MGS: haven't heard from client 94b20a97-ffc1-174f-b2fa-84612be87730 (at 10.1
27.24.35 at tcp) in 48 seconds. I think it's dead, and I am evicting it. exp ffff88002f159c00, cur 1347892869 expire 134789
2839 last 1347892821
Sep 17 07:41:13 lustre_mds0 kernel: Lustre: lustre-MDT0000: haven't heard from client 98c74908-5348-2392-0f61-6a7f6d3a03
76 (at 10.127.24.35 at tcp) in 52 seconds. I think it's dead, and I am evicting it. exp ffff8800765d6000, cur 1347892873 ex
pire 1347892843 last 1347892821
Sep 17 07:43:41 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) MGS: connection from 272a840
5-8512-e9de-f532-feb5b7d6f9b1 at 10.127.56.82@tcp t0 exp (null) cur 1347893021 last 0
Sep 17 07:43:41 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) Skipped 1 previous similar m
essage
Sep 17 08:45:08 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) MGS: connection from 5b11484
9-9a41-b016-4de2-10b780351206 at 10.127.24.35@tcp t0 exp (null) cur 1347896708 last 0
Sep 17 08:45:08 lustre_mds0 kernel: Lustre: 1228:0:(ldlm_lib.c:946:target_handle_connect()) Skipped 1 previous similar m
essage
Sep 17 08:45:08 lustre_mds0 kernel: Lustre: 1252:0:(ldlm_lib.c:946:target_handle_connect()) lustre-MDT0000: connection f
rom 46730388-647a-16f4-b349-df0a8df6a759 at 10.127.24.35@tcp t0 exp (null) cur 1347896708 last 0
[root at lustre_mds0 lustre]#

-----Original Message-----
From: Colin Faber [mailto:colin_faber at xyratex.com]
Sent: Monday, September 17, 2012 8:38 AM
To: Jon Yeargers
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Client hangs on 'simple' lustre setup

Hi,

Ideally just post the logging directly around the event and not the entire syslog =)

/var/log/messages right around the time you try and mount / ls

and

dmesg | tail -n 75

or so, if it's larger, pastebin is a great resource to reference so you're not polluting peoples inbox with large logs.

-cf

On 09/17/2012 09:33 AM, Jon Yeargers wrote:
> What's the preferred method for posting large files to this list?
>
> Does Lustre have specific log files that I should look for or just anything in /var/log/messages that seems appropriate?
>
> -----Original Message-----
> From: Colin Faber [mailto:cfaber at gmail.com]
> Sent: Monday, September 17, 2012 8:05 AM
> To: Jon Yeargers
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Client hangs on 'simple' lustre setup
>
> Hi Jon,
>
> If you could provide the logging from your clients and mds (as well as possibly oss's) that'll help in determining the problem.
>
> -cf
>
> On 09/17/2012 08:57 AM, Jon Yeargers wrote:
>> Issue: I'm trying to assess the (possible) use of Lustre for our
>> group. To this end I've been trying to create a simple system to
>> explore the nuances. I can't seem to get past the 'llmount.sh' test
>> with any degree of success.
>>
>> What I've done: Each system (throwaway PCs with 70Gb HD, 2Gb RAM) is
>> formatted with CentOS 6.2. I then update everything and install the
>> Lustre kernel from downloads.whamcloud.com and add on the various
>> (appropriate) lustre and e2fs RPM files. Systems are rebooted and
>> tested with 'llmount.sh' (and then cleared with 'llmountcleanup.sh').
>> All is well to this point.
>>
>> First I create an MDS/MDT system via:
>>
>> /usr/sbin/mkfs.lustre --mgs --mdt --fsname=lustre
>> --device-size=2000000 --param sys.timeout=20
>> --mountfsoptions=errors=remount-ro,user_xattr,acl --param
>> lov.stripesize=1048576 --param lov.stripecount=0 --param
>> mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype ldiskfs
>> --reformat /tmp/lustre-mdt1
>>
>> and then
>>
>> mkdir -p /mnt/mds1
>>
>> mount -t lustre -o loop,user_xattr,acl /tmp/lustre-mdt1 /mnt/mds1
>>
>> Next I take 3 systems and create a 2Gb loop mount via:
>>
>> /usr/sbin/mkfs.lustre --ost --fsname=lustre --device-size=2000000
>> --param sys.timeout=20 --mgsnode=lustre_MDS0 at tcp --backfstype ldiskfs
>> --reformat /tmp/lustre-ost1
>>
>> mkdir -p /mnt/ost1
>>
>> mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1
>>
>> The logs on the MDT box show the OSS boxes connecting up. All appears ok.
>>
>> Last I create a client and attach to the MDT box:
>>
>> mkdir -p /mnt/lustre
>>
>> mount -t lustre -o user_xattr,acl,flock luster_MDS0 at tcp:/lustre
>> /mnt/lustre
>>
>> Again, the log on the MDT box shows the client connection. Appears to
>> be successful.
>>
>> Here's where the issues (appear to) start. If I do a 'df -h' on the
>> client it hangs after showing the system drives. If I attempt to
>> create files (via 'dd') on the lustre mount the session hangs and the
>> job can't be killed. Rebooting the client is the only solution.
>>
>> I can create and use a client on the MDS/MSG box. Doing so from any
>> other machine will hang.
>>
>>  From the MDS box:
>>
>> [root at lustre_mds0 lustre]# lctl dl
>>
>> 0 UP mgs MGS MGS 13
>>
>> 1 UP mgc MGC10.127.24.42 at tcp 7923c008-a0de-1c87-f21a-4a5ab48abb96 5
>>
>> 2 UP lov lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4
>>
>> 3 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 7
>>
>> 4 UP mds mdd_obd-lustre-MDT0000 mdd_obd_uuid-lustre-MDT0000 3
>>
>> 5 UP osc lustre-OST0000-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
>>
>> 6 UP osc lustre-OST0001-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
>>
>> 7 UP lov lustre-clilov-ffff8800631c8000
>> b6b66579-1f44-90e5-ae63-e778d4ed6ac5 4
>>
>> 8 UP lmv lustre-clilmv-ffff8800631c8000
>> b6b66579-1f44-90e5-ae63-e778d4ed6ac5 4
>>
>> 9 UP mdc lustre-MDT0000-mdc-ffff8800631c8000
>> b6b66579-1f44-90e5-ae63-e778d4ed6ac5 5
>>
>> 10 UP osc lustre-OST0000-osc-ffff8800631c8000
>> b6b66579-1f44-90e5-ae63-e778d4ed6ac5 5
>>
>> 11 UP osc lustre-OST0001-osc-ffff8800631c8000
>> b6b66579-1f44-90e5-ae63-e778d4ed6ac5 5
>>
>> 12 UP osc lustre-OST0002-osc-ffff8800631c8000
>> b6b66579-1f44-90e5-ae63-e778d4ed6ac5 5
>>
>> 13 UP osc lustre-OST0002-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5
>>
>> [root at lustre_mds0 lustre]# lfs df -h
>>
>> UUID bytes Used Available Use% Mounted on
>>
>> lustre-MDT0000_UUID 1.4G 83.9M 1.3G 6% /mnt/lustre[MDT:0]
>>
>> lustre-OST0000_UUID 1.9G 1.1G 716.5M 61% /mnt/lustre[OST:0]
>>
>> lustre-OST0001_UUID 1.9G 1.1G 728.5M 60% /mnt/lustre[OST:1]
>>
>> lustre-OST0002_UUID 1.9G 1.1G 728.5M 60% /mnt/lustre[OST:2]
>>
>> filesystem summary: 5.6G 3.2G 2.1G 60% /mnt/lustre
>>
>> All appears normal.
>>
>> Doing this from another (identical) client:
>>
>> [root at lfstest0 lustre]# lctl dl
>>
>> 0 UP mgc MGC10.127.24.42 at tcp 272a8405-8512-e9de-f532-feb5b7d6f9b1 5
>>
>> 1 UP lov lustre-clilov-ffff880070eee400
>> 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 4
>>
>> 2 UP lmv lustre-clilmv-ffff880070eee400
>> 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 4
>>
>> 3 UP mdc lustre-MDT0000-mdc-ffff880070eee400
>> 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 5
>>
>> 4 UP osc lustre-OST0000-osc-ffff880070eee400
>> 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 5
>>
>> 5 UP osc lustre-OST0001-osc-ffff880070eee400
>> 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 5
>>
>> 6 UP osc lustre-OST0002-osc-ffff880070eee400
>> 0cb7fd2e-ade0-dab3-c4b9-6b7956ef9720 5
>>
>> [root at lfstest0 lustre]# lfs df
>>
>> UUID 1K-blocks Used Available Use% Mounted on
>>
>> lustre-MDT0000_UUID 1499596 85888 1313708 6% /mnt/lustre[MDT:0]
>>
>> OST0000 : inactive device
>>
>> lustre-OST0001_UUID 1968528 1122468 745996 60% /mnt/lustre[OST:1]
>>
>> OST0002 : inactive device
>>
>> filesystem summary: 1968528 1122468 745996 60% /mnt/luster
>>
>> Doing a ‘dd’ or ‘touch’ or even ‘df’ from this machine will hang it.
>>
>> EDIT: each system has all other systems defined in /etc/hosts and
>> entries in iptables to provide access.
>>
>> All systems have identical setup:
>>
>> [root at lfstest0 lustre]# rpm -qa | grep lustre
>>
>> lustre-ldiskfs-3.3.0-2.6.32_279.2.1.el6_lustre.gc46c389.x86_64.x86_64
>>
>> lustre-2.1.3-2.6.32_279.2.1.el6_lustre.gc46c389.x86_64.x86_64
>>
>> kernel-2.6.32-279.2.1.el6_lustre.gc46c389.x86_64
>>
>> lustre-modules-2.1.3-2.6.32_279.2.1.el6_lustre.gc46c389.x86_64.x86_64
>>
>> lustre-tests-2.1.3-2.6.32_279.2.1.el6_lustre.gc46c389.x86_64.x86_64
>>
>> [root at lfstest0 lustre]# uname -a
>>
>> Linux lfstest0 2.6.32-279.2.1.el6_lustre.gc46c389.x86_64 #1 SMP Mon
>> Aug 13 11:00:10 PDT 2012 x86_64 x86_64 x86_64 GNU/Linux
>>
>> [root at lfstest0 lustre]# rpm -qa | grep e2fs
>>
>> e2fsprogs-libs-1.41.90.wc2-7.el6.x86_64
>>
>> e2fsprogs-1.41.90.wc2-7.el6.x86_64
>>
>> SO: I'm clearly making several mistakes. Any pointers as to where to
>> start correcting them?
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list