[Lustre-discuss] How To change server recovery timeout

Thu Nov 8 11:04:42 PST 2007

Nathan Rutman wrote:
> Wojciech Turek wrote:
>   
>> On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
>>
>>     
>>> Cliff White wrote:
>>>       
>>>> Wojciech Turek wrote:
>>>>
>>>>   
>>>>
>>>>         
>>>>> Hi Cliff,
>>>>>
>>>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>>>
>>>>>     
>>>>>
>>>>>           
>>>>>> Wojciech Turek wrote:
>>>>>>
>>>>>>       
>>>>>>
>>>>>>             
>>>>>>> Hi,
>>>>>>> Our lustre environment is:
>>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>>>> I would like to change recovery timeout from default value 250s 
>>>>>>> to something longer
>>>>>>> I tried example from manual:
>>>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server
>>>>>>> to wait before failing recovery.
>>>>>>> We performed that experiment on our test lustre installation with 
>>>>>>> one OST.
>>>>>>> storage02 is our OSS
>>>>>>> [root at storage02 ~]# lctl dl
>>>>>>>   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>>>>>   1 UP ost OSS OSS_uuid 3
>>>>>>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>>>>>> set_timeout has been deprecated. Use conf_param instead.
>>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>               
>>> sorry about this bad help message.  It's wrong.
>>>       
>>>>>>> usage: conf_param obd_timeout=<secs>
>>>>>>> run <command> after connecting to device <devno>
>>>>>>> --device <devno> <command [args ...]>
>>>>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>>>>>>> No device found for name MGS: Invalid argument
>>>>>>> error: conf_param: No such device
>>>>>>> It looks like I need to run this command from MGS node so I  
>>>>>>> moved then to MGS server called storage03
>>>>>>> [root at storage03 ~]# lctl dl
>>>>>>>   0 UP mgs MGS MGS 9
>>>>>>>   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>>>>>   2 UP mdt MDS MDS_uuid 3
>>>>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>>>>>> [root at storage03 ~]# lctl device 5
>>>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>>>>>> error: conf_param: Function not implemented
>>>>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>>>>>>> error: conf_param: Function not implemented
>>>>>>> [root at storage03 ~]# lctl help conf_param
>>>>>>> conf_param: set a permanent config param. This command must be 
>>>>>>> run on the MGS node
>>>>>>> usage: conf_param <target.keyword=val> ...
>>>>>>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
>>>>>>> error: conf_param: Invalid argument
>>>>>>> [root at storage03 ~]#
>>>>>>> I searched whole /proc/*/lustre for file that can store this 
>>>>>>> timeout value but nothing were found.
>>>>>>> Could someone advise how to change value for recovery timeout?
>>>>>>> Cheers,
>>>>>>> Wojciech Turek
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>               
>>>>>> It looks like your file system is named 'home' - you can confirm with
>>>>>> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>>>>>>
>>>>>> The correct command (Run on the MGS) would be
>>>>>> # lctl conf_param home.sys.timeout=<val>
>>>>>>
>>>>>> Example:
>>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
>>>>>> Lustre FS:  lustre
>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>> 130
>>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>> 150
>>>>>>
>>>>>>       
>>>>>>
>>>>>>             
>>>>> Thanks for your email. I am afraid your tips aren't very helpful in 
>>>>> this case. As stated in the subject I am asking about recovery timeout.
>>>>> You can find it for example in 
>>>>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your 
>>>>> OST's is in recovery state. By default this timeout is 250s.
>>>>> Whereas you are talking about system obd timeout (according to CFS 
>>>>> documentation chapter 4.1.2 ) which is not a subject of my concern.
>>>>>
>>>>> Any way I tried your example just to see if it works and again I am 
>>>>> afraid it doesn't work for me, see below:
>>>>> I have combined mgs and mds configuration.
>>>>>
>>>>> [[root at storage03 ~]# df
>>>>> Filesystem           1K-blocks      Used Available Use% Mounted on
>>>>> /dev/sda1             10317828   3452824   6340888  36% /
>>>>> /dev/sda6              7605856     49788   7169708   1% /local
>>>>> /dev/sda3              4127108     41000   3876460   2% /tmp
>>>>> /dev/sda2              4127108    753668   3163792  20% /var
>>>>> /dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
>>>>> /dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
>>>>> /dev/dm-3            286696376   1461588 268850900   1% 
>>>>> /mnt/home-md/mdt
>>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS"
>>>>> Lustre FS:  home-md
>>>>> Lustre FS:  home-md
>>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>>>> 100
>>>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
>>>>> error: conf_param: Invalid argument
>>>>> [root at storage03 ~]#
>>>>>
>>>>>     
>>>>>
>>>>>           
>>> You need to do this on the MGS node, with the MGS running.
>>>
>>> mgs> lctl conf_param testfs.sys.timeout=150
>>> anynode> cat /proc/sys/lustre/timeout
>>>       
>> This isn't working for me. In my production configuration I have MGS 
>> combined with MDT on the same server. My lustre configuration consists 
>> of two file systems.
>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0
>> checking for existing Lustre data: found CONFIGS/mountdata
>> Reading CONFIGS/mountdata
>>
>>    Read previous values:
>> Target:     ddn-home-MDT0000
>> Index:      0
>> Lustre FS:  ddn-home
>> Mount type: ldiskfs
>> Flags:      0x5
>>               (MDT MGS )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp
>>
>>
>>    Permanent disk data:
>> Target:     ddn-home-MDT0000
>> Index:      0
>> Lustre FS:  ddn-home
>> Mount type: ldiskfs
>> Flags:      0x5
>>               (MDT MGS )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp
>>
>> exiting before disk write.
>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1
>> checking for existing Lustre data: found CONFIGS/mountdata
>> Reading CONFIGS/mountdata
>>
>>    Read previous values:
>> Target:     ddn-data-MDT0000
>> Index:      0
>> Lustre FS:  ddn-data
>> Mount type: ldiskfs
>> Flags:      0x1
>>               (MDT )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp
>>
>>
>>    Permanent disk data:
>> Target:     ddn-data-MDT0000
>> Index:      0
>> Lustre FS:  ddn-data
>> Mount type: ldiskfs
>> Flags:      0x1
>>               (MDT )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp
>>
>> exiting before disk write.
>> [root at mds01 ~]# 
>>
>> As you can see above MGS is on /dev/dm-0 combined with MDT for 
>> ddn-home file system.
>> If I try command line from your example I get this:
>> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
>> error: conf_param: Invalid argument
>>
>> Server mds01 is 100% MGS node. What is wrong here then? The only two 
>> reasons for that problem I can think of is that file system name 
>> contain "-" character. However I didn't find anything in documentation 
>> that would say that this character is not allowed to be used. Another 
>> reason is that MGS is combined with MDS?
>>
>> syslog contains following messages:
>>
>> Nov  7 18:38:35 mds01 kernel: LustreError: 
>> 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is 'ddn-home'
>> Nov  7 18:38:35 mds01 kernel: LustreError: 
>> 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:39:46 mds01 kernel: LustreError: 
>> 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is 'ddn-data'
>> Nov  7 18:39:46 mds01 kernel: LustreError: 
>> 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:39:54 mds01 kernel: LustreError: 
>> 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is 'ddn-data'
>> Nov  7 18:39:54 mds01 kernel: LustreError: 
>> 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:40:01 mds01 kernel: LustreError: 
>> 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is 'ddn-data'
>> Nov  7 18:40:01 mds01 kernel: LustreError: 
>> 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:41:06 mds01 kernel: LustreError: 
>> 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is 'ddn-data'
>> Nov  7 18:41:06 mds01 kernel: LustreError: 
>> 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov  7 18:41:15 mds01 kernel: LustreError: 
>> 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn. 
>>  cfg_device from lctl is 'ddn-home'
>> Nov  7 18:41:15 mds01 kernel: LustreError: 
>> 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>
>> From above it looks like only first part of file system name is 
>> recognized "ddn" and -home or -data is omitted.
>>
>> Please advise.
>>
>> Wojciech Turek
>>     
>
> You seem to have found a bug.  I just tried this myself and it doesn't 
> work with a "-" in the name.  Maybe use a '.' instead until we fix it.
>   
Argh, sorry, that doesn't work with conf_param either.  But an 
underscore '_' does.  I'm filing a bug report...