[Lustre-discuss] How To change server recovery timeout

Thu Nov 8 10:54:55 PST 2007

Wojciech Turek wrote:
>
> On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
>
>> Cliff White wrote:
>>> Wojciech Turek wrote:
>>>
>>>   
>>>
>>>> Hi Cliff,
>>>>
>>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>>
>>>>     
>>>>
>>>>> Wojciech Turek wrote:
>>>>>
>>>>>       
>>>>>
>>>>>> Hi,
>>>>>> Our lustre environment is:
>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>>> I would like to change recovery timeout from default value 250s 
>>>>>> to something longer
>>>>>> I tried example from manual:
>>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server
>>>>>> to wait before failing recovery.
>>>>>> We performed that experiment on our test lustre installation with 
>>>>>> one OST.
>>>>>> storage02 is our OSS
>>>>>> [root at storage02 ~]# lctl dl
>>>>>>   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>>>>   1 UP ost OSS OSS_uuid 3
>>>>>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>>>>> set_timeout has been deprecated. Use conf_param instead.
>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>>>
>>>>>>         
>>>>>>
>> sorry about this bad help message.  It's wrong.
>>>>>> usage: conf_param obd_timeout=<secs>
>>>>>> run <command> after connecting to device <devno>
>>>>>> --device <devno> <command [args ...]>
>>>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>>>>>> No device found for name MGS: Invalid argument
>>>>>> error: conf_param: No such device
>>>>>> It looks like I need to run this command from MGS node so I  
>>>>>> moved then to MGS server called storage03
>>>>>> [root at storage03 ~]# lctl dl
>>>>>>   0 UP mgs MGS MGS 9
>>>>>>   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>>>>   2 UP mdt MDS MDS_uuid 3
>>>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>>>>> [root at storage03 ~]# lctl device 5
>>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>>>>> error: conf_param: Function not implemented
>>>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>>>>>> error: conf_param: Function not implemented
>>>>>> [root at storage03 ~]# lctl help conf_param
>>>>>> conf_param: set a permanent config param. This command must be 
>>>>>> run on the MGS node
>>>>>> usage: conf_param <target.keyword=val> ...
>>>>>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
>>>>>> error: conf_param: Invalid argument
>>>>>> [root at storage03 ~]#
>>>>>> I searched whole /proc/*/lustre for file that can store this 
>>>>>> timeout value but nothing were found.
>>>>>> Could someone advise how to change value for recovery timeout?
>>>>>> Cheers,
>>>>>> Wojciech Turek
>>>>>>
>>>>>>         
>>>>>>
>>>>> It looks like your file system is named 'home' - you can confirm with
>>>>> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>>>>>
>>>>> The correct command (Run on the MGS) would be
>>>>> # lctl conf_param home.sys.timeout=<val>
>>>>>
>>>>> Example:
>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
>>>>> Lustre FS:  lustre
>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>> 130
>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>> 150
>>>>>
>>>>>       
>>>>>
>>>> Thanks for your email. I am afraid your tips aren't very helpful in 
>>>> this case. As stated in the subject I am asking about recovery timeout.
>>>> You can find it for example in 
>>>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your 
>>>> OST's is in recovery state. By default this timeout is 250s.
>>>> Whereas you are talking about system obd timeout (according to CFS 
>>>> documentation chapter 4.1.2 ) which is not a subject of my concern.
>>>>
>>>> Any way I tried your example just to see if it works and again I am 
>>>> afraid it doesn't work for me, see below:
>>>> I have combined mgs and mds configuration.
>>>>
>>>> [[root at storage03 ~]# df
>>>> Filesystem           1K-blocks      Used Available Use% Mounted on
>>>> /dev/sda1             10317828   3452824   6340888  36% /
>>>> /dev/sda6              7605856     49788   7169708   1% /local
>>>> /dev/sda3              4127108     41000   3876460   2% /tmp
>>>> /dev/sda2              4127108    753668   3163792  20% /var
>>>> /dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
>>>> /dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
>>>> /dev/dm-3            286696376   1461588 268850900   1% 
>>>> /mnt/home-md/mdt
>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS"
>>>> Lustre FS:  home-md
>>>> Lustre FS:  home-md
>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>>> 100
>>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
>>>> error: conf_param: Invalid argument
>>>> [root at storage03 ~]#
>>>>
>>>>     
>>>>
>> You need to do this on the MGS node, with the MGS running.
>>
>> mgs> lctl conf_param testfs.sys.timeout=150
>> anynode> cat /proc/sys/lustre/timeout
> This isn't working for me. In my production configuration I have MGS 
> combined with MDT on the same server. My lustre configuration consists 
> of two file systems.
> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>    Read previous values:
> Target:     ddn-home-MDT0000
> Index:      0
> Lustre FS:  ddn-home
> Mount type: ldiskfs
> Flags:      0x5
>               (MDT MGS )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp
>
>
>    Permanent disk data:
> Target:     ddn-home-MDT0000
> Index:      0
> Lustre FS:  ddn-home
> Mount type: ldiskfs
> Flags:      0x5
>               (MDT MGS )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp
>
> exiting before disk write.
> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1
> checking for existing Lustre data: found CONFIGS/mountdata
> Reading CONFIGS/mountdata
>
>    Read previous values:
> Target:     ddn-data-MDT0000
> Index:      0
> Lustre FS:  ddn-data
> Mount type: ldiskfs
> Flags:      0x1
>               (MDT )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp
>
>
>    Permanent disk data:
> Target:     ddn-data-MDT0000
> Index:      0
> Lustre FS:  ddn-data
> Mount type: ldiskfs
> Flags:      0x1
>               (MDT )
> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp
>
> exiting before disk write.
> [root at mds01 ~]# 
>
> As you can see above MGS is on /dev/dm-0 combined with MDT for 
> ddn-home file system.
> If I try command line from your example I get this:
> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
> error: conf_param: Invalid argument
>
> Server mds01 is 100% MGS node. What is wrong here then? The only two 
> reasons for that problem I can think of is that file system name 
> contain "-" character. However I didn't find anything in documentation 
> that would say that this character is not allowed to be used. Another 
> reason is that MGS is combined with MDS?
>
> syslog contains following messages:
>
> Nov  7 18:38:35 mds01 kernel: LustreError: 
> 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is 'ddn-home'
> Nov  7 18:38:35 mds01 kernel: LustreError: 
> 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:39:46 mds01 kernel: LustreError: 
> 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is 'ddn-data'
> Nov  7 18:39:46 mds01 kernel: LustreError: 
> 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:39:54 mds01 kernel: LustreError: 
> 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is 'ddn-data'
> Nov  7 18:39:54 mds01 kernel: LustreError: 
> 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:40:01 mds01 kernel: LustreError: 
> 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is 'ddn-data'
> Nov  7 18:40:01 mds01 kernel: LustreError: 
> 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:41:06 mds01 kernel: LustreError: 
> 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is 'ddn-data'
> Nov  7 18:41:06 mds01 kernel: LustreError: 
> 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
> Nov  7 18:41:15 mds01 kernel: LustreError: 
> 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>  cfg_device from lctl is 'ddn-home'
> Nov  7 18:41:15 mds01 kernel: LustreError: 
> 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>
> From above it looks like only first part of file system name is 
> recognized "ddn" and -home or -data is omitted.
>
> Please advise.
>
> Wojciech Turek

You seem to have found a bug.  I just tried this myself and it doesn't 
work with a "-" in the name.  Maybe use a '.' instead until we fix it.