[Lustre-discuss] An odd problem in my lustre 1.8.0
Larry
tsrjzq at gmail.com
Sun Dec 19 05:20:10 PST 2010
Dear all,
I have an odd problem today in my lustre 1.8.0. All of the OSSes and
MDS appear well. But one of client has problem. When creating a file
in OST5(one of my osts), and dd or echo something to this file, then
the process hangs, and never succeeds. for example,
client1:/home # lfs setstripe -o 5 test.txt
client1:/home # lfs getstripe test.txt
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE
7: lustre-OST0007_UUID ACTIVE
8: lustre-OST0008_UUID ACTIVE
9: lustre-OST0009_UUID ACTIVE
10: lustre-OST000a_UUID ACTIVE
11: lustre-OST000b_UUID ACTIVE
12: lustre-OST000c_UUID ACTIVE
13: lustre-OST000d_UUID ACTIVE
14: lustre-OST000e_UUID ACTIVE
15: lustre-OST000f_UUID ACTIVE
16: lustre-OST0010_UUID ACTIVE
test.txt
obdidx objid objid group
5 158029029 0x96b54e5 0
client1:/home # dd if=/dev/zero of=test.txt bs=1M count=100
then the dd process hangs and never return. If I edit and save it,
then it's location changes to another OST, not OST5. for example
client1:/home # dd if=/dev/zero of=test.txt bs=1M count=100 #(ctrl-C)
1+0 records in
0+0 records out
0 bytes (0 B) copied, 173.488 seconds, 0.0 kB/s
client1:/home # vi test.txt #add something
client1:/home # lfs getstripe test.txt
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
6: lustre-OST0006_UUID ACTIVE
7: lustre-OST0007_UUID ACTIVE
8: lustre-OST0008_UUID ACTIVE
9: lustre-OST0009_UUID ACTIVE
10: lustre-OST000a_UUID ACTIVE
11: lustre-OST000b_UUID ACTIVE
12: lustre-OST000c_UUID ACTIVE
13: lustre-OST000d_UUID ACTIVE
14: lustre-OST000e_UUID ACTIVE
15: lustre-OST000f_UUID ACTIVE
16: lustre-OST0010_UUID ACTIVE
test.txt
obdidx objid objid group
6 159122026 0x97c026a 0
But both the client and the OSS seems good. By the way, other clients
and OSS have not this problem.
client1:/home # lfs check servers
lustre-MDT0000-mdc-ffff810438d12c00 active.
lustre-OST000a-osc-ffff810438d12c00 active.
lustre-OST000f-osc-ffff810438d12c00 active.
lustre-OST000c-osc-ffff810438d12c00 active.
lustre-OST0006-osc-ffff810438d12c00 active.
lustre-OST000e-osc-ffff810438d12c00 active.
lustre-OST0009-osc-ffff810438d12c00 active.
lustre-OST0000-osc-ffff810438d12c00 active.
lustre-OST000d-osc-ffff810438d12c00 active.
lustre-OST0003-osc-ffff810438d12c00 active.
lustre-OST0002-osc-ffff810438d12c00 active.
lustre-OST0008-osc-ffff810438d12c00 active.
lustre-OST000b-osc-ffff810438d12c00 active.
lustre-OST0004-osc-ffff810438d12c00 active.
lustre-OST0007-osc-ffff810438d12c00 active.
lustre-OST0005-osc-ffff810438d12c00 active.
lustre-OST0010-osc-ffff810438d12c00 active.
lustre-OST0001-osc-ffff810438d12c00 active.
I try it many times. The log report some error messages only once.
On client:
Dec 19 18:28:57 client1 kernel: LustreError: 11-0: an error occurred
while communicating with 12.12.71.106 at o2ib. The ost_punch operation
failed with -107
Dec 19 18:28:57 client1 kernel: LustreError: Skipped 1 previous similar message
Dec 19 18:28:57 client1 kernel: Lustre:
lustre-OST0005-osc-ffff810438d12c00: Connection to service
lustre-OST0005 via nid 12.12.71.106 at o2ib was lost; in prog
ress operations using this service will wait for recovery to complete.
Dec 19 18:28:57 client1 kernel: LustreError:
4570:0:(import.c:909:ptlrpc_connect_interpret()) lustre-OST0005_UUID
went back in time (transno 189979771521 was
previously committed, server now claims 0)! See
https://bugzilla.lustre.org/show_bug.cgi?id=9646
Dec 19 18:28:57 client1 kernel: LustreError: 167-0: This client was
evicted by lustre-OST0005; in progress operations using this service
will fail.
Dec 19 18:28:57 client1 kernel: LustreError:
7128:0:(rw.c:192:ll_file_punch()) obd_truncate fails (-5) ino 41729130
Dec 19 18:28:57 client1 kernel: Lustre:
lustre-OST0005-osc-ffff810438d12c00: Connection restored to service
lustre-OST0005 using nid 12.12.71.106 at o2ib.
On OSS:
Dec 19 18:27:52 os6 kernel: LustreError:
0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback
timer expired after 101s: evicting client at 12
.12.12.32 at o2ib ns: filter-lustre-OST0005_UUID lock:
ffff810087d66200/0xae56b014db6d6d0a lrc: 3/0,0 mode: PR/PR res:
158015656/0 rrc: 2 type: EXT [0->1844
6744073709551615] (req 0->18446744073709551615) flags: 0x10020 remote:
0xe02336632642c5fc expref: 27 pid: 5333 timeout 7284896273
Dec 19 18:28:57 os6 kernel: LustreError:
5407:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error
(-107) req at ffff8103dd91b400 x1343016412725286/t0 o10-><?>@<?>:0/0
lens 400/0 e 0 to 0 dl 1292754580 ref 1 fl Interpret:/0/0 rc -107/0
The MDS has no messages related with this.
I don't know whether these messages have some relationship with the
problem or not.
My lustre version is 1.8.0 with SLES 10 sp2, by the way, the MDS
crashed yesterday with bug #19528, so now the lustre on MDS has
patched with attachment 23574, 23648 and 23751 in bz #19528, other
node have no patches.
Thanks all.
More information about the lustre-discuss
mailing list