[Lustre-discuss] How To change server recovery timeout

Wojciech Turek wjt27 at cam.ac.uk
Wed Nov 7 10:46:03 PST 2007


Hi Cliff,

On 7 Nov 2007, at 17:58, Cliff White wrote:

> Wojciech Turek wrote:
>> Hi,
>> Our lustre environment is:
>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>> I would like to change recovery timeout from default value 250s to  
>> something longer
>> I tried example from manual:
>> set_timeout <secs> Sets the timeout (obd_timeout) for a server
>> to wait before failing recovery.
>> We performed that experiment on our test lustre installation with  
>> one OST.
>> storage02 is our OSS
>> [root at storage02 ~]# lctl dl
>>   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5
>>   1 UP ost OSS OSS_uuid 3
>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>> set_timeout has been deprecated. Use conf_param instead.
>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>> usage: conf_param obd_timeout=<secs>
>> run <command> after connecting to device <devno>
>> --device <devno> <command [args ...]>
>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>> No device found for name MGS: Invalid argument
>> error: conf_param: No such device
>> It looks like I need to run this command from MGS node so I  moved  
>> then to MGS server called storage03
>> [root at storage03 ~]# lctl dl
>>   0 UP mgs MGS MGS 9
>>   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>   2 UP mdt MDS MDS_uuid 3
>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>> [root at storage03 ~]# lctl device 5
>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>> error: conf_param: Function not implemented
>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>> error: conf_param: Function not implemented
>> [root at storage03 ~]# lctl help conf_param
>> conf_param: set a permanent config param. This command must be run  
>> on the MGS node
>> usage: conf_param <target.keyword=val> ...
>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
>> error: conf_param: Invalid argument
>> [root at storage03 ~]#
>> I searched whole /proc/*/lustre for file that can store this  
>> timeout value but nothing were found.
>> Could someone advise how to change value for recovery timeout?
>> Cheers,
>> Wojciech Turek
>
> It looks like your file system is named 'home' - you can confirm with
> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>
> The correct command (Run on the MGS) would be
> # lctl conf_param home.sys.timeout=<val>
>
> Example:
> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
> Lustre FS:  lustre
> [root at ft4 ~]# cat /proc/sys/lustre/timeout
> 130
> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
> [root at ft4 ~]# cat /proc/sys/lustre/timeout
> 150
Thanks for your email. I am afraid your tips aren't very helpful in  
this case. As stated in the subject I am asking about recovery timeout.
You can find it for example in /proc/fs/lustre/obdfilter/<OST>/ 
recovery_status whilst one of your OST's is in recovery state. By  
default this timeout is 250s.
Whereas you are talking about system obd timeout (according to CFS  
documentation chapter 4.1.2 ) which is not a subject of my concern.

Any way I tried your example just to see if it works and again I am  
afraid it doesn't work for me, see below:
I have combined mgs and mds configuration.

[[root at storage03 ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             10317828   3452824   6340888  36% /
/dev/sda6              7605856     49788   7169708   1% /local
/dev/sda3              4127108     41000   3876460   2% /tmp
/dev/sda2              4127108    753668   3163792  20% /var
/dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
/dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
/dev/dm-3            286696376   1461588 268850900   1% /mnt/home-md/mdt
[root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS"
Lustre FS:  home-md
Lustre FS:  home-md
[root at storage03 ~]# cat /proc/sys/lustre/timeout
100
[root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
error: conf_param: Invalid argument
[root at storage03 ~]#

Cheers,

Wojciech Turek



>
> cliffw
>
>> --------------------------------------------------------------------- 
>> ---
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071107/be8d112f/attachment.htm>


More information about the lustre-discuss mailing list