[Lustre-discuss] Recovery Problem

Mon May 24 05:09:35 PDT 2010

Hi,

so it is a bit strange, at least for me !
The problem occurs even when the clocks are all synchronized
but the thing that I misled is that under equal conditions, the  
problem does not always.
The test is always the same, launching a process of copying and then  
reboot the server OSS concerned:
These are the log when the recovery process is successful:

MGS/MDS
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c: 
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(lib-move.c: 
2441:LNetPut()) Skipped 5 previous similar messages
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c: 
66:request_out_callback()) @@@ type 4, status -113   
req at ffff81011b6dd400 x1336357074007206/t0 o400->lustre01-OST0000_UUID at 172.16.100.121 
@tcp:28/4 lens 192/384 e 0 to 1 dl 1274688078 ref 2 fl Rpc:N/0/0 rc 0/0
May 24 10:00:21 mdt01prdpom kernel: LustreError: 3558:0:(events.c: 
66:request_out_callback()) Skipped 5 previous similar messages
May 24 10:00:21 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was  
lost; in progress operations using this service will wait for recovery  
to complete.
May 24 10:01:43 mdt01prdpom kernel: Lustre: 3690:0:(ldlm_lib.c: 
575:target_handle_reconnect()) MGS: 52138191-d920-7519-563b- 
ab022b922751 reconnecting
May 24 10:02:17 mdt01prdpom kernel: Lustre: 3559:0:(quota_master.c: 
1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota  
recovery
May 24 10:02:17 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection restored to service lustre01-OST0000 using nid  
172.16.100.121 at tcp.
May 24 10:02:17 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
lustre01-OST0000_UUID now active, resetting orphans

OSS
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 10:01:14 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 10:01:19 oss1prdpom kernel: Lustre: 3640:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834049 sent  
from MGC172.16.100.111 at tcp to NID 172.16.100.111 at tcp 5s ago has timed  
out (5s prior to deadline).
May 24 10:01:19 oss1prdpom kernel:   req at ffff81012ec78c00  
x1336607320834049/t0 o250->MGS at MGC172.16.100.111@tcp_0:26/25 lens  
368/584 e 0 to 1 dl 1274688079 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: 3640:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336607320834051 sent  
from MGC172.16.100.111 at tcp to NID 0 at lo 5s ago has timed out (5s prior  
to deadline).
May 24 10:01:44 oss1prdpom kernel:   req at ffff81013ff1cc00  
x1336607320834051/t0 o250->MGS at MGC172.16.100.111@tcp_1:26/25 lens  
368/584 e 0 to 1 dl 1274688104 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: LustreError: 3557:0:(client.c: 
858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff81012d15e400  
x1336607320834052/t0 o101->MGS at MGC172.16.100.111@tcp_1:26/25 lens  
296/544 e 0 to 1 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
May 24 10:01:44 oss1prdpom kernel: Lustre: Filtering OBD driver; http://www.lustre.org/
May 24 10:01:44 oss1prdpom kernel: Lustre: 4036:0:(filter.c: 
990:filter_init_server_data()) RECOVERY: service lustre01-OST0000, 2  
recoverable clients, 0 delayed clients, last_rcvd 103079215728
May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Now  
serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 10:01:44 oss1prdpom kernel: Lustre: lustre01-OST0000: Will be  
in recovery for at least 5:00, or until 2 clients reconnect
May 24 10:01:44 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 10:01:44 oss1prdpom kernel: NET: Registered protocol family 31
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI device and  
connection manager initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HCI socket layer  
initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 10:01:44 oss1prdpom kernel: Bluetooth: L2CAP socket layer  
initialized
May 24 10:01:44 oss1prdpom kernel: Bluetooth: HIDP (Human Interface  
Emulation) ver 1.1
May 24 10:01:44 oss1prdpom hidd[4106]: Bluetooth HID daemon
May 24 10:01:45 oss1prdpom kernel: Lustre: MGC172.16.100.111 at tcp:  
Reactivating import
May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon  
running with check_in interval set to 240 seconds.
May 24 10:01:45 oss1prdpom rhnsd: Red Hat Network Services Daemon  
running with check_in interval set to 240 seconds.
May 24 10:01:45 oss1prdpom rhnsd[4179]: Red Hat Network Services  
Daemon starting up.
May 24 10:02:12 oss1prdpom kernel: Lustre: 3761:0:(ldlm_lib.c: 
1788:target_queue_last_replay_reply()) lustre01-OST0000: 1 recoverable  
clients remain
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: Recovery  
period over after 0:06, of 2 clients 2 recovered and 0 were evicted.
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: sending  
delayed replies to recovered clients
May 24 10:02:18 oss1prdpom kernel: Lustre: lustre01-OST0000: received  
MDS connection from 172.16.100.111 at tcp

CLIENT
May 24 09:59:13 mdt02prdpom kernel: LustreError: 11-0: an error  
occurred while communicating with 172.16.100.121 at tcp. The ost_write  
operation failed with -107
May 24 09:59:13 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- 
ffff8101337bbc00: Connection to service lustre01-OST0000 via nid  
172.16.100.121 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
May 24 10:00:13 mdt02prdpom kernel: Lustre: There was an unexpected  
network error while writing to 172.16.100.121: -110.
May 24 10:02:17 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- 
ffff8101337bbc00: Connection restored to service lustre01-OST0000  
using nid 172.16.100.121 at tcp.

These are the log when the recovery process goes wrong:

MGS/MDS
May 24 09:52:01 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was  
lost; in progress operations using this service will wait for recovery  
to complete.
May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336357074007136 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed  
due to network error (25s prior to deadline).
May 24 09:52:51 mdt01prdpom kernel:   req at ffff81012d814c00  
x1336357074007136/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274687596 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 09:52:51 mdt01prdpom kernel: Lustre: 3559:0:(client.c: 
1463:ptlrpc_expire_one_request()) Skipped 9 previous similar messages
May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc: tried all  
connections, increasing latency to 21s
May 24 09:53:43 mdt01prdpom kernel: Lustre: 3560:0:(import.c: 
517:import_select_connection()) Skipped 4 previous similar messages
May 24 09:53:43 mdt01prdpom kernel: LustreError: 167-0: This client  
was evicted by lustre01-OST0000; in progress operations using this  
service will fail.
May 24 09:53:43 mdt01prdpom kernel: Lustre: 5071:0:(quota_master.c: 
1716:mds_quota_recovery()) Only 1/3 OSTs are active, abort quota  
recovery
May 24 09:53:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection restored to service lustre01-OST0000 using nid  
172.16.100.121 at tcp.
May 24 09:53:43 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
lustre01-OST0000_UUID now active, resetting orphans
May 24 09:54:35 mdt01prdpom kernel: Lustre: MGS: haven't heard from  
client b6c79384-6c45-6d9c-ab9b-f12969d74da0 (at 172.16.100.121 at tcp) in  
235 seconds. I think it's dead, and I am evicting it.

OSS
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth0: link is  
not ready
May 24 09:53:02 oss1prdpom kernel: e1000e: eth0 NIC Link is Up 1000  
Mbps Full Duplex, Flow Control: RX/TX
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth0: link  
becomes ready
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_UP): eth1: link is  
not ready
May 24 09:53:02 oss1prdpom kernel: e1000e: eth1 NIC Link is Up 1000  
Mbps Full Duplex, Flow Control: RX/TX
May 24 09:53:02 oss1prdpom kernel: ADDRCONF(NETDEV_CHANGE): eth1: link  
becomes ready
May 24 09:53:02 oss1prdpom kernel: Lustre: OBD class driver, http://www.lustre.org/
May 24 09:53:02 oss1prdpom kernel: Lustre:     Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre:     Build Version:  
1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre:     Lustre Version: 1.8.3
May 24 09:53:02 oss1prdpom kernel: Lustre:     Build Version:  
1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
May 24 09:53:03 oss1prdpom kernel: Lustre: Added LNI  
172.16.100.121 at tcp [8/256/0/180]
May 24 09:53:03 oss1prdpom kernel: Lustre: Accept secure, port 988
May 24 09:53:03 oss1prdpom kernel: Lustre: Lustre Client File System; http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: init dynlocks cache
May 24 09:53:03 oss1prdpom kernel: ldiskfs created from ext3-2.6-rhel5
May 24 09:53:03 oss1prdpom kernel: kjournald starting.  Commit  
interval 5 seconds
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount  
count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with  
ordered data mode.
May 24 09:53:03 oss1prdpom multipathd: dm-0: umount map (uevent)
May 24 09:53:03 oss1prdpom kernel: kjournald starting.  Commit  
interval 5 seconds
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs warning: maximal mount  
count reached, running e2fsck is recommended
May 24 09:53:03 oss1prdpom kernel: LDISKFS FS on dm-0, internal journal
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mounted filesystem with  
ordered data mode.
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: file extents enabled
May 24 09:53:03 oss1prdpom kernel: LDISKFS-fs: mballoc enabled
May 24 09:53:03 oss1prdpom kernel: Lustre: MGC172.16.100.111 at tcp:  
Reactivating import
May 24 09:53:03 oss1prdpom kernel: Lustre: Filtering OBD driver; http://www.lustre.org/
May 24 09:53:03 oss1prdpom kernel: Lustre: lustre01-OST0000: Now  
serving lustre01-OST0000 on /dev/mpath/mpath1 with recovery enabled
May 24 09:53:04 oss1prdpom kernel: Bluetooth: Core ver 2.10
May 24 09:53:04 oss1prdpom kernel: NET: Registered protocol family 31
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI device and  
connection manager initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HCI socket layer  
initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP ver 2.8
May 24 09:53:04 oss1prdpom kernel: Bluetooth: L2CAP socket layer  
initialized
May 24 09:53:04 oss1prdpom kernel: Bluetooth: HIDP (Human Interface  
Emulation) ver 1.1
May 24 09:53:04 oss1prdpom hidd[4108]: Bluetooth HID daemon
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon  
running with check_in interval set to 240 seconds.
May 24 09:53:04 oss1prdpom rhnsd: Red Hat Network Services Daemon  
running with check_in interval set to 240 seconds.
May 24 09:53:04 oss1prdpom rhnsd[4181]: Red Hat Network Services  
Daemon starting up.
May 24 09:53:44 oss1prdpom kernel: Lustre: lustre01-OST0000: received  
MDS connection from 172.16.100.111 at tcp

CLIENT
May 24 09:51:28 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- 
ffff8101337bbc00: Connection to service lustre01-OST0000 via nid  
172.16.100.121 at tcp was lost; in progress operations using this service  
will wait for recovery to complete.
May 24 09:51:38 mdt02prdpom kernel: Lustre: There was an unexpected  
network error while writing to 172.16.100.121: -110.
May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336351691692671 sent  
from lustre01-OST0000-osc-ffff8101337bbc00 to NID 172.16.100.121 at tcp  
0s ago has failed due to network error (27s prior to deadline).
May 24 09:52:22 mdt02prdpom kernel:   req at ffff810109b92c00  
x1336351691692671/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274687569 ref 1 fl Rpc:N/0/0 rc 0/0
May 24 09:52:22 mdt02prdpom kernel: Lustre: 3796:0:(client.c: 
1463:ptlrpc_expire_one_request()) Skipped 23 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc-ffff8101337bbc00:  
tried all connections, increasing latency to 23s
May 24 09:53:18 mdt02prdpom kernel: Lustre: 3797:0:(import.c: 
517:import_select_connection()) Skipped 5 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c: 
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(lib-move.c: 
2441:LNetPut()) Skipped 4 previous similar messages
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c: 
66:request_out_callback()) @@@ type 4, status -113   
req at ffff810023fcfc00 x1336351691692679/t0 o8->lustre01-OST0000_UUID at 172.16.100.121 
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274687626 ref 2 fl Rpc:N/0/0 rc 0/0
May 24 09:53:18 mdt02prdpom kernel: LustreError: 3796:0:(events.c: 
66:request_out_callback()) Skipped 4 previous similar messages
May 24 09:54:16 mdt02prdpom kernel: LustreError: 167-0: This client  
was evicted by lustre01-OST0000; in progress operations using this  
service will fail.
May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c: 
858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff81012fa9f000  
x1336351691692691/t0 o4->lustre01-OST0000_UUID at 172.16.100.121@tcp:6/4  
lens 448/608 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
May 24 09:54:16 mdt02prdpom kernel: LustreError: 3795:0:(client.c: 
858:ptlrpc_import_delay_req()) Skipped 18 previous similar messages
May 24 09:54:16 mdt02prdpom kernel: Lustre: lustre01-OST0000-osc- 
ffff8101337bbc00: Connection restored to service lustre01-OST0000  
using nid 172.16.100.121 at tcp.

Thanks !!

Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:stefano.elmopi at sociale.it

"Ai sensi e per effetti della legge sulla tutela  della  riservatezza  
personale
(D.lgs n. 196/2003),  questa @mail e' destinata  unicamente alle  
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi  
strettamente
riservate. E' proibito leggere, copiare, usare o diffondere il  
contenuto della
presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"

Il giorno 21/mag/10, alle ore 16:56, Andreas Dilger ha scritto:

>
> On 2010-05-21, at 5:49, Stefano Elmopi <stefano.elmopi at sociale.it>  
> wrote:
>>
>> I realized that the time server differed much across machines,
>> there were at least a few hours of difference.
>> I'm doing the tests and have not been paying attention to time  
>> synchronization
>> but now I have aligned the time of all servers and I've configured  
>> ntpd service
>> and the problem no longer occurs.
>> I can imagine that the cause of the problem was just the time  
>> misalignment.
>
> The client and server clock should have nothing to do with the  
> functioning of lustre, so it surprising that this would be the cause.
>
>
>> Il giorno 20/mag/10, alle ore 13:28, Johann Lombardi ha scritto:
>>
>>> On Thu, May 20, 2010 at 12:29:41PM +0200, Stefano Elmopi wrote:
>>>> Hi Andreas
>>>> My version of Lustre 1.8.3
>>>> Sorry for my bad English but I used the wrong word, "crash" is  
>>>> not the
>>>> right word.
>>>> I try to explain better, I start copying a large file on the file  
>>>> system
>>>> and while the copy process continues, I reboot the server OSS,
>>>> and the copy process enters state "- stalled -".
>>>> I expected that once the server back online, the copy process to  
>>>> resume
>>>> normal
>>>> and complete copy of the file, instead the copy process fault.
>>>> Therefore the copy process that goes wrong, Lustre continues to  
>>>> perform
>>>> good.
>>>
>>> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This  
>>> client was
>>> evicted by lustre01-OST0000; in progress operations using this  
>>> service
>>> will fail.
>>>
>>> The cp process failed because the client got evicted by the OSS.
>>> We need to look at the OSS logs to figure out the root cause of
>>> the eviction.
>>>
>>> Johann
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100524/1eb81f20/attachment.htm>