[Lustre-discuss] How To change server recovery timeout
Cliff White
Cliff.White at Sun.COM
Wed Nov 7 12:38:33 PST 2007
Wojciech Turek wrote:
> Hi Cliff,
>
> On 7 Nov 2007, at 17:58, Cliff White wrote:
>
>> Wojciech Turek wrote:
>>> Hi,
>>> Our lustre environment is:
>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>> I would like to change recovery timeout from default value 250s to
>>> something longer
>>> I tried example from manual:
>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server
>>> to wait before failing recovery.
>>> We performed that experiment on our test lustre installation with one
>>> OST.
>>> storage02 is our OSS
>>> [root at storage02 ~]# lctl dl
>>> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5
>>> 1 UP ost OSS OSS_uuid 3
>>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>> set_timeout has been deprecated. Use conf_param instead.
>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>> usage: conf_param obd_timeout=<secs>
>>> run <command> after connecting to device <devno>
>>> --device <devno> <command [args ...]>
>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>>> No device found for name MGS: Invalid argument
>>> error: conf_param: No such device
>>> It looks like I need to run this command from MGS node so I moved
>>> then to MGS server called storage03
>>> [root at storage03 ~]# lctl dl
>>> 0 UP mgs MGS MGS 9
>>> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>> 2 UP mdt MDS MDS_uuid 3
>>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>> [root at storage03 ~]# lctl device 5
>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>> error: conf_param: Function not implemented
>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>>> error: conf_param: Function not implemented
>>> [root at storage03 ~]# lctl help conf_param
>>> conf_param: set a permanent config param. This command must be run on
>>> the MGS node
>>> usage: conf_param <target.keyword=val> ...
>>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
>>> error: conf_param: Invalid argument
>>> [root at storage03 ~]#
>>> I searched whole /proc/*/lustre for file that can store this timeout
>>> value but nothing were found.
>>> Could someone advise how to change value for recovery timeout?
>>> Cheers,
>>> Wojciech Turek
>>
>> It looks like your file system is named 'home' - you can confirm with
>> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>>
>> The correct command (Run on the MGS) would be
>> # lctl conf_param home.sys.timeout=<val>
>>
>> Example:
>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
>> Lustre FS: lustre
>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>> 130
>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>> 150
> Thanks for your email. I am afraid your tips aren't very helpful in this
> case. As stated in the subject I am asking about recovery timeout.
> You can find it for example in
> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your OST's
> is in recovery state. By default this timeout is 250s.
> Whereas you are talking about system obd timeout (according to CFS
> documentation chapter 4.1.2 ) which is not a subject of my concern.
>
> Any way I tried your example just to see if it works and again I am
> afraid it doesn't work for me, see below:
> I have combined mgs and mds configuration.
>
> [[root at storage03 ~]# df
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/sda1 10317828 3452824 6340888 36% /
> /dev/sda6 7605856 49788 7169708 1% /local
> /dev/sda3 4127108 41000 3876460 2% /tmp
> /dev/sda2 4127108 753668 3163792 20% /var
> /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb
> /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc
> /dev/dm-3 286696376 1461588 268850900 1% /mnt/home-md/mdt
> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS"
> Lustre FS: home-md
> Lustre FS: home-md
> [root at storage03 ~]# cat /proc/sys/lustre/timeout
> 100
> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
> error: conf_param: Invalid argument
> [root at storage03 ~]#
>
Hmm, not sure why that isn't working for you, I tested the example I
gave. Sorry about the mis-read. The obd recovery timeout is defined in
relation to obd_timeout, and afaik not changeable at runtime:
lustre/include/lustre_lib.h
#define OBD_RECOVERY_TIMEOUT (obd_timeout * 5 / 2)
...which gives the default 250 seconds for the default obd_timeout (100
seconds)
cliffw
> Cheers,
>
> Wojciech Turek
>
>
>
>>
>> cliffw
>>
>>> ------------------------------------------------------------------------
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>
>
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service
> email: wjt27 at cam.ac.uk
> tel. +441223763517
>
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list