[Lustre-discuss] Recovery Problem
Stefano Elmopi
stefano.elmopi at sociale.it
Mon May 24 05:09:35 PDT 2010
Hi,
so it is a bit strange, at least for me !
The problem occurs even when the clocks are all synchronized
but the thing that I misled is that under equal conditions, the
problem does not always.
The test is always the same, launching a process of copying and then
reboot the server OSS concerned:
These are the log when the recovery process is successful:
MGS/MDS
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c:
2441:LNetPut()) Skipped 5 previous similar messages
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
req at ffff81011b6dd400 x1336357074007206/t0 o400->lustre01-OST0000_UUID at 172.16.100.121
@tcp:28/4 lens 192/384 e 0 to 1 dl 1274688078 ref 2 fl Rpc:N/0/0 rc 0/0
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c:
66:request_out_callback()) Skipped 5 previous similar messages
May 24 10:00:21 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was
lost; in progress operations using this service will wait for recovery
to complete.
May 24 10:01:43 mdt01prdpom kernel: Lustre: 3690:0:(ldlm_lib.c:
575:target_handle_reconnect()) MGS: 52138191-d920-7519-563b-
ab022b922751 reconnecting
May 24 10:02:17 mdt01prdpom kernel: Lustre: 3559:0:(quota_master.c:
1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota
recovery
May 24 10:02:17 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection restored to service lustre01-OST0000 using nid
172.16.100.121 at tcp.
May 24 10:02:17 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:
lustre01-OST0000_UUID now active, resetting orphans
OSS
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 10:01:19 oss1prdpom kernel: Lustre: 3640:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834049 sent
from MGC172.16.100.111 at tcp to NID 172.16.100.111 at tcp 5s ago has timed
out (5s prior to deadline).
May 24 10:01:19 oss1prdpom kernel: req at ffff81012ec78c00
x1336607320834049/t0 o250->MGS at MGC172.16.100.111@tcp_0:26/25 lens
368/584 e 0 to 1 dl 1274688079 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: 3640:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834051 sent
from MGC172.16.100.111 at tcp to NID 0 at lo 5s ago has timed out (5s prior
to deadline).
May 24 10:01:44 oss1prdpom kernel: req at ffff81013ff1cc00
x1336607320834051/t0 o250->MGS at MGC172.16.100.111@tcp_1:26/25 lens
368/584 e 0 to 1 dl 1274688104 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: LustreError: 3557:0:(client.c:
858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81012d15e400
x1336607320834052/t0 o101->MGS at MGC172.16.100.111@tcp_1:26/25 lens
296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: Filtering OBD driver; http://www.lustre.org/
May 24 10:01:44 oss1prdpom kernel: Lustre: 4036:0:(filter.c:
990:filter_init_server_data()) RECOVERY: service lustre01-OST0000, 2
recoverable clients, 0 delayed clients, last_rcvd 103079215728
May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Now
serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Will be
in recovery for at least 5:00, or until 2 clients reconnect
May 24 10:01:44 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 10:01:44 oss1prdpom kernel: NET: Registered protocol family 31
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI device and
connection manager initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI socket layer
initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP socket layer
initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HIDP (Human Interface
Emulation) ver 1.1
May 24 10:01:44 oss1prdpom hidd[4106]: Bluetooth HID daemon
May 24 10:01:45 oss1prdpom kernel: Lustre: MGC172.16.100.111 at tcp:
Reactivating import
May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon
running with check_in interval set to 240 seconds.
May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon
running with check_in interval set to 240 seconds.
May 24 10:01:45 oss1prdpom rhnsd[4179]: Red Hat Network Services
Daemon starting up.
May 24 10:02:12 oss1prdpom kernel: Lustre: 3761:0:(ldlm_lib.c:
1788:target_queue_last_replay_reply()) lustre01-OST0000: 1 recoverable
clients remain
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: Recovery
period over after 0:06, of 2 clients 2 recovered and 0 were evicted.
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: sending
delayed replies to recovered clients
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: received
MDS connection from 172.16.100.111 at tcp
CLIENT
May 24 09:59:13 mdt02prdpom kernel: LustreError: 11-0: an error
occurred while communicating with 172.16.100.121 at tcp. The ost_write
operation failed with -107
May 24 09:59:13 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-
ffff8101337bbc00: Connection to service lustre01-OST0000 via nid
172.16.100.121 at tcp was lost; in progress operations using this service
will wait for recovery to complete.
May 24 10:00:13 mdt02prdpom kernel: Lustre: There was an unexpected
network error while writing to 172.16.100.121: -110.
May 24 10:02:17 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-
ffff8101337bbc00: Connection restored to service lustre01-OST0000
using nid 172.16.100.121 at tcp.
These are the log when the recovery process goes wrong:
MGS/MDS
May 24 09:52:01 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was
lost; in progress operations using this service will wait for recovery
to complete.
May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336357074007136 sent
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed
due to network error (25s prior to deadline).
May 24 09:52:51 mdt01prdpom kernel: req at ffff81012d814c00
x1336357074007136/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4
lens 368/584 e 0 to 1 dl 1274687596 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c:
1463:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 21s
May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c:
517:import_select_connection()) Skipped 4 previous similar messages
May 24 09:53:43 mdt01prdpom kernel: LustreError: 167-0: This client
was evicted by lustre01-OST0000; in progress operations using this
service will fail.
May 24 09:53:43 mdt01prdpom kernel: Lustre: 5071:0:(quota_master.c:
1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota
recovery
May 24 09:53:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection restored to service lustre01-OST0000 using nid
172.16.100.121 at tcp.
May 24 09:53:43 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:
lustre01-OST0000_UUID now active, resetting orphans
May 24 09:54:35 mdt01prdpom kernel: Lustre: MGS: haven't heard from
client b6c79384-6c45-6d9c-ab9b-f12969d74da0 (at 172.16.100.121 at tcp) in
235 seconds. I think it's dead, and I am evicting it.
OSS
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth0: link is
not ready
May 24 09:53:02 oss1prdpom kernel: e1000e: eth0 NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth0: link
becomes ready
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth1: link is
not ready
May 24 09:53:02 oss1prdpom kernel: e1000e: eth1 NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth1: link
becomes ready
May 24 09:53:02 oss1prdpom kernel: Lustre: OBD class driver, http://www.lustre.org/
May 24 09:53:02 oss1prdpom kernel: Lustre: Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre: Build Version:
1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre: Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre: Build Version:
1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:03 oss1prdpom kernel: Lustre: Added LNI
172.16.100.121 at tcp [8/256/0/180]
May 24 09:53:03 oss1prdpom kernel: Lustre: Accept secure, port 988
May 24 09:53:03 oss1prdpom kernel: Lustre: Lustre Client File System; http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: init dynlocks cache
May 24 09:53:03 oss1prdpom kernel: ldiskfs created from ext3-2.6-rhel5
May 24 09:53:03 oss1prdpom kernel: kjournald starting. Commit
interval 5 seconds
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount
count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
May 24 09:53:03 oss1prdpom multipathd: dm-0: umount map (uevent)
May 24 09:53:03 oss1prdpom kernel: kjournald starting. Commit
interval 5 seconds
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount
count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with
ordered data mode.
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 09:53:03 oss1prdpom kernel: Lustre: MGC172.16.100.111 at tcp:
Reactivating import
May 24 09:53:03 oss1prdpom kernel: Lustre: Filtering OBD driver; http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: Lustre: lustre01-OST0000: Now
serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 09:53:04 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 09:53:04 oss1prdpom kernel: NET: Registered protocol family 31
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI device and
connection manager initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI socket layer
initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP socket layer
initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HIDP (Human Interface
Emulation) ver 1.1
May 24 09:53:04 oss1prdpom hidd[4108]: Bluetooth HID daemon
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon
running with check_in interval set to 240 seconds.
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon
running with check_in interval set to 240 seconds.
May 24 09:53:04 oss1prdpom rhnsd[4181]: Red Hat Network Services
Daemon starting up.
May 24 09:53:44 oss1prdpom kernel: Lustre: lustre01-OST0000: received
MDS connection from 172.16.100.111 at tcp
CLIENT
May 24 09:51:28 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-
ffff8101337bbc00: Connection to service lustre01-OST0000 via nid
172.16.100.121 at tcp was lost; in progress operations using this service
will wait for recovery to complete.
May 24 09:51:38 mdt02prdpom kernel: Lustre: There was an unexpected
network error while writing to 172.16.100.121: -110.
May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336351691692671 sent
from lustre01-OST0000-osc-ffff8101337bbc00 to NID 172.16.100.121 at tcp
0s ago has failed due to network error (27s prior to deadline).
May 24 09:52:22 mdt02prdpom kernel: req at ffff810109b92c00
x1336351691692671/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4
lens 368/584 e 0 to 1 dl 1274687569 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c:
1463:ptlrpc_expire_one_request()) Skipped 23 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc-ffff8101337bbc00:
tried all connections, increasing latency to 23s
May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c:
517:import_select_connection()) Skipped 5 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c:
2441:LNetPut()) Skipped 4 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
req at ffff810023fcfc00 x1336351691692679/t0 o8->lustre01-OST0000_UUID at 172.16.100.121
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274687626 ref 2 fl Rpc:N/0/0 rc 0/0
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c:
66:request_out_callback()) Skipped 4 previous similar messages
May 24 09:54:16 mdt02prdpom kernel: LustreError: 167-0: This client
was evicted by lustre01-OST0000; in progress operations using this
service will fail.
May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c:
858:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81012fa9f000
x1336351691692691/t0 o4->lustre01-OST0000_UUID at 172.16.100.121@tcp:6/4
lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c:
858:ptlrpc_import_delay_req()) Skipped 18 previous similar messages
May 24 09:54:16 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc-
ffff8101337bbc00: Connection restored to service lustre01-OST0000
using nid 172.16.100.121 at tcp.
Thanks !!
Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma
cell. 3466147165
tel. 0657060500
email:stefano.elmopi at sociale.it
"Ai sensi e per effetti della legge sulla tutela della riservatezza
personale
(D.lgs n. 196/2003), questa @mail e' destinata unicamente alle
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi
strettamente
riservate. E' proibito leggere, copiare, usare o diffondere il
contenuto della
presente @mail senza autorizzazione. Se avete ricevuto questo
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"
Il giorno 21/mag/10, alle ore 16:56, Andreas Dilger ha scritto:
>
> On 2010-05-21, at 5:49, Stefano Elmopi <stefano.elmopi at sociale.it>
> wrote:
>>
>> I realized that the time server differed much across machines,
>> there were at least a few hours of difference.
>> I'm doing the tests and have not been paying attention to time
>> synchronization
>> but now I have aligned the time of all servers and I've configured
>> ntpd service
>> and the problem no longer occurs.
>> I can imagine that the cause of the problem was just the time
>> misalignment.
>
> The client and server clock should have nothing to do with the
> functioning of lustre, so it surprising that this would be the cause.
>
>
>> Il giorno 20/mag/10, alle ore 13:28, Johann Lombardi ha scritto:
>>
>>> On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi wrote:
>>>> Hi Andreas
>>>> My version of Lustre 1.8.3
>>>> Sorry for my bad English but I used the wrong word, "crash" is
>>>> not the
>>>> right word.
>>>> I try to explain better, I start copying a large file on the file
>>>> system
>>>> and while the copy process continues, I reboot the server OSS,
>>>> and the copy process enters state "- stalled -".
>>>> I expected that once the server back online, the copy process to
>>>> resume
>>>> normal
>>>> and complete copy of the file, instead the copy process fault.
>>>> Therefore the copy process that goes wrong, Lustre continues to
>>>> perform
>>>> good.
>>>
>>> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This
>>> client was
>>> evicted by lustre01-OST0000; in progress operations using this
>>> service
>>> will fail.
>>>
>>> The cp process failed because the client got evicted by the OSS.
>>> We need to look at the OSS logs to figure out the root cause of
>>> the eviction.
>>>
>>> Johann
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100524/1eb81f20/attachment.htm>
More information about the lustre-discuss
mailing list