[Lustre-discuss] An odd problem in my lustre 1.8.0

Sun Dec 19 05:20:10 PST 2010

Dear all,

I have an odd problem today in my lustre 1.8.0. All of the OSSes and
MDS appear well. But one of client has problem. When creating a file
in OST5(one of my osts), and dd or echo something to this file, then
the process hangs, and never succeeds. for example,

client1:/home # lfs setstripe -o 5 test.txt

client1:/home # lfs getstripe test.txt
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE
7: lustre-OST0007_UUID ACTIVE
8: lustre-OST0008_UUID ACTIVE
9: lustre-OST0009_UUID ACTIVE
10: lustre-OST000a_UUID ACTIVE
11: lustre-OST000b_UUID ACTIVE
12: lustre-OST000c_UUID ACTIVE
13: lustre-OST000d_UUID ACTIVE
14: lustre-OST000e_UUID ACTIVE
15: lustre-OST000f_UUID ACTIVE
16: lustre-OST0010_UUID ACTIVE
test.txt
        obdidx           objid          objid            group
             5       158029029      0x96b54e5                0

client1:/home # dd if=/dev/zero of=test.txt bs=1M count=100

then the dd process hangs and never return. If I edit and save it,
then it's location changes to another OST, not OST5. for example

client1:/home # dd if=/dev/zero of=test.txt bs=1M count=100 #(ctrl-C)
1+0 records in
0+0 records out
0 bytes (0 B) copied, 173.488 seconds, 0.0 kB/s

client1:/home # vi test.txt #add something

client1:/home # lfs getstripe test.txt
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE
7: lustre-OST0007_UUID ACTIVE
8: lustre-OST0008_UUID ACTIVE
9: lustre-OST0009_UUID ACTIVE
10: lustre-OST000a_UUID ACTIVE
11: lustre-OST000b_UUID ACTIVE
12: lustre-OST000c_UUID ACTIVE
13: lustre-OST000d_UUID ACTIVE
14: lustre-OST000e_UUID ACTIVE
15: lustre-OST000f_UUID ACTIVE
16: lustre-OST0010_UUID ACTIVE
test.txt
        obdidx           objid          objid            group
             6       159122026      0x97c026a                0

But both the client and the OSS seems good. By the way, other clients
and OSS have not this problem.

client1:/home # lfs check servers
lustre-MDT0000-mdc-ffff810438d12c00 active.
lustre-OST000a-osc-ffff810438d12c00 active.
lustre-OST000f-osc-ffff810438d12c00 active.
lustre-OST000c-osc-ffff810438d12c00 active.
lustre-OST0006-osc-ffff810438d12c00 active.
lustre-OST000e-osc-ffff810438d12c00 active.
lustre-OST0009-osc-ffff810438d12c00 active.
lustre-OST0000-osc-ffff810438d12c00 active.
lustre-OST000d-osc-ffff810438d12c00 active.
lustre-OST0003-osc-ffff810438d12c00 active.
lustre-OST0002-osc-ffff810438d12c00 active.
lustre-OST0008-osc-ffff810438d12c00 active.
lustre-OST000b-osc-ffff810438d12c00 active.
lustre-OST0004-osc-ffff810438d12c00 active.
lustre-OST0007-osc-ffff810438d12c00 active.
lustre-OST0005-osc-ffff810438d12c00 active.
lustre-OST0010-osc-ffff810438d12c00 active.
lustre-OST0001-osc-ffff810438d12c00 active.

I try it many times. The log report some error messages only once.

On client:

Dec 19 18:28:57 client1 kernel: LustreError: 11-0: an error occurred
while communicating with 12.12.71.106 at o2ib. The ost_punch operation
failed with -107
Dec 19 18:28:57 client1 kernel: LustreError: Skipped 1 previous similar message
Dec 19 18:28:57 client1 kernel: Lustre:
lustre-OST0005-osc-ffff810438d12c00: Connection to service
lustre-OST0005 via nid 12.12.71.106 at o2ib was lost; in prog
ress operations using this service will wait for recovery to complete.
Dec 19 18:28:57 client1 kernel: LustreError:
4570:0:(import.c:909:ptlrpc_connect_interpret()) lustre-OST0005_UUID
went back in time (transno 189979771521 was
 previously committed, server now claims 0)!  See
https://bugzilla.lustre.org/show_bug.cgi?id=9646
Dec 19 18:28:57 client1 kernel: LustreError: 167-0: This client was
evicted by lustre-OST0005; in progress operations using this service
will fail.
Dec 19 18:28:57 client1 kernel: LustreError:
7128:0:(rw.c:192:ll_file_punch()) obd_truncate fails (-5) ino 41729130
Dec 19 18:28:57 client1 kernel: Lustre:
lustre-OST0005-osc-ffff810438d12c00: Connection restored to service
lustre-OST0005 using nid 12.12.71.106 at o2ib.

On OSS:

Dec 19 18:27:52 os6 kernel: LustreError:
0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback
timer expired after 101s: evicting client at 12
.12.12.32 at o2ib  ns: filter-lustre-OST0005_UUID lock:
ffff810087d66200/0xae56b014db6d6d0a lrc: 3/0,0 mode: PR/PR res:
158015656/0 rrc: 2 type: EXT [0->1844
6744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote:
0xe02336632642c5fc expref: 27 pid: 5333 timeout 7284896273
Dec 19 18:28:57 os6 kernel: LustreError:
5407:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error
(-107)  req at ffff8103dd91b400 x1343016412725286/t0 o10-><?>@<?>:0/0
lens 400/0 e 0 to 0 dl 1292754580 ref 1 fl Interpret:/0/0 rc -107/0

The MDS has no messages related with this.

I don't know whether these messages have some relationship with the
problem or not.

My lustre version is 1.8.0 with SLES 10 sp2, by the way, the MDS
crashed yesterday with bug #19528, so now the lustre on MDS has
patched with attachment 23574, 23648 and 23751 in bz #19528, other
node have no patches.

Thanks all.