[Lustre-discuss] How To change server recovery timeout

Nathan Rutman Nathan.Rutman at Sun.COM
Wed Nov 7 14:31:00 PST 2007


Cliff White wrote:
> Wojciech Turek wrote:
>   
>> Hi Cliff,
>>
>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>
>>     
>>> Wojciech Turek wrote:
>>>       
>>>> Hi,
>>>> Our lustre environment is:
>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>> I would like to change recovery timeout from default value 250s to 
>>>> something longer
>>>> I tried example from manual:
>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server
>>>> to wait before failing recovery.
>>>> We performed that experiment on our test lustre installation with one 
>>>> OST.
>>>> storage02 is our OSS
>>>> [root at storage02 ~]# lctl dl
>>>>   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>>   1 UP ost OSS OSS_uuid 3
>>>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>>> set_timeout has been deprecated. Use conf_param instead.
>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>         
sorry about this bad help message.  It's wrong.
>>>> usage: conf_param obd_timeout=<secs>
>>>> run <command> after connecting to device <devno>
>>>> --device <devno> <command [args ...]>
>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>>>> No device found for name MGS: Invalid argument
>>>> error: conf_param: No such device
>>>> It looks like I need to run this command from MGS node so I  moved 
>>>> then to MGS server called storage03
>>>> [root at storage03 ~]# lctl dl
>>>>   0 UP mgs MGS MGS 9
>>>>   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>>   2 UP mdt MDS MDS_uuid 3
>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>>> [root at storage03 ~]# lctl device 5
>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>>> error: conf_param: Function not implemented
>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>>>> error: conf_param: Function not implemented
>>>> [root at storage03 ~]# lctl help conf_param
>>>> conf_param: set a permanent config param. This command must be run on 
>>>> the MGS node
>>>> usage: conf_param <target.keyword=val> ...
>>>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
>>>> error: conf_param: Invalid argument
>>>> [root at storage03 ~]#
>>>> I searched whole /proc/*/lustre for file that can store this timeout 
>>>> value but nothing were found.
>>>> Could someone advise how to change value for recovery timeout?
>>>> Cheers,
>>>> Wojciech Turek
>>>>         
>>> It looks like your file system is named 'home' - you can confirm with
>>> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>>>
>>> The correct command (Run on the MGS) would be
>>> # lctl conf_param home.sys.timeout=<val>
>>>
>>> Example:
>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
>>> Lustre FS:  lustre
>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>> 130
>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>> 150
>>>       
>> Thanks for your email. I am afraid your tips aren't very helpful in this 
>> case. As stated in the subject I am asking about recovery timeout.
>> You can find it for example in 
>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your OST's 
>> is in recovery state. By default this timeout is 250s.
>> Whereas you are talking about system obd timeout (according to CFS 
>> documentation chapter 4.1.2 ) which is not a subject of my concern.
>>
>> Any way I tried your example just to see if it works and again I am 
>> afraid it doesn't work for me, see below:
>> I have combined mgs and mds configuration.
>>
>> [[root at storage03 ~]# df
>> Filesystem           1K-blocks      Used Available Use% Mounted on
>> /dev/sda1             10317828   3452824   6340888  36% /
>> /dev/sda6              7605856     49788   7169708   1% /local
>> /dev/sda3              4127108     41000   3876460   2% /tmp
>> /dev/sda2              4127108    753668   3163792  20% /var
>> /dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
>> /dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
>> /dev/dm-3            286696376   1461588 268850900   1% /mnt/home-md/mdt
>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS"
>> Lustre FS:  home-md
>> Lustre FS:  home-md
>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>> 100
>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
>> error: conf_param: Invalid argument
>> [root at storage03 ~]#
>>     
You need to do this on the MGS node, with the MGS running.

mgs> lctl conf_param testfs.sys.timeout=150
anynode> cat /proc/sys/lustre/timeout



> Hmm, not sure why that isn't working for you, I tested the example I 
> gave. Sorry about the mis-read. The obd recovery timeout is defined in 
> relation to obd_timeout, and afaik not changeable at runtime:
>
> lustre/include/lustre_lib.h
> #define OBD_RECOVERY_TIMEOUT (obd_timeout * 5 / 2)
> ...which gives the default 250 seconds for the default obd_timeout (100 
> seconds)
>
> cliffw
>
>   
That's correct.  These are tied together before lustre 1.6.4.

>> Cheers,
>>
>> Wojciech Turek
>>
>>
>>
>>     
>>> cliffw
>>>
>>>       
>>>> ------------------------------------------------------------------------
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>>         
>> Mr Wojciech Turek
>> Assistant System Manager
>> University of Cambridge
>> High Performance Computing service
>> email: wjt27 at cam.ac.uk
>> tel. +441223763517
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list