[Lustre-discuss] How To change server recovery timeout
Nathan Rutman
Nathan.Rutman at Sun.COM
Thu Nov 8 11:04:42 PST 2007
Nathan Rutman wrote:
> Wojciech Turek wrote:
>
>> On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
>>
>>
>>> Cliff White wrote:
>>>
>>>> Wojciech Turek wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> Hi Cliff,
>>>>>
>>>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Wojciech Turek wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>> Our lustre environment is:
>>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>>>> I would like to change recovery timeout from default value 250s
>>>>>>> to something longer
>>>>>>> I tried example from manual:
>>>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server
>>>>>>> to wait before failing recovery.
>>>>>>> We performed that experiment on our test lustre installation with
>>>>>>> one OST.
>>>>>>> storage02 is our OSS
>>>>>>> [root at storage02 ~]# lctl dl
>>>>>>> 0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4-c760-45d3df426d86 5
>>>>>>> 1 UP ost OSS OSS_uuid 3
>>>>>>> 2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>>>>>> set_timeout has been deprecated. Use conf_param instead.
>>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>> sorry about this bad help message. It's wrong.
>>>
>>>>>>> usage: conf_param obd_timeout=<secs>
>>>>>>> run <command> after connecting to device <devno>
>>>>>>> --device <devno> <command [args ...]>
>>>>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>>>>>>> No device found for name MGS: Invalid argument
>>>>>>> error: conf_param: No such device
>>>>>>> It looks like I need to run this command from MGS node so I
>>>>>>> moved then to MGS server called storage03
>>>>>>> [root at storage03 ~]# lctl dl
>>>>>>> 0 UP mgs MGS MGS 9
>>>>>>> 1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
>>>>>>> 2 UP mdt MDS MDS_uuid 3
>>>>>>> 3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>>>> 4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>>>> 5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>>>>>> [root at storage03 ~]# lctl device 5
>>>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>>>>>> error: conf_param: Function not implemented
>>>>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>>>>>>> error: conf_param: Function not implemented
>>>>>>> [root at storage03 ~]# lctl help conf_param
>>>>>>> conf_param: set a permanent config param. This command must be
>>>>>>> run on the MGS node
>>>>>>> usage: conf_param <target.keyword=val> ...
>>>>>>> [root at storage03 ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
>>>>>>> error: conf_param: Invalid argument
>>>>>>> [root at storage03 ~]#
>>>>>>> I searched whole /proc/*/lustre for file that can store this
>>>>>>> timeout value but nothing were found.
>>>>>>> Could someone advise how to change value for recovery timeout?
>>>>>>> Cheers,
>>>>>>> Wojciech Turek
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> It looks like your file system is named 'home' - you can confirm with
>>>>>> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>>>>>>
>>>>>> The correct command (Run on the MGS) would be
>>>>>> # lctl conf_param home.sys.timeout=<val>
>>>>>>
>>>>>> Example:
>>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
>>>>>> Lustre FS: lustre
>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>> 130
>>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>> 150
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> Thanks for your email. I am afraid your tips aren't very helpful in
>>>>> this case. As stated in the subject I am asking about recovery timeout.
>>>>> You can find it for example in
>>>>> /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your
>>>>> OST's is in recovery state. By default this timeout is 250s.
>>>>> Whereas you are talking about system obd timeout (according to CFS
>>>>> documentation chapter 4.1.2 ) which is not a subject of my concern.
>>>>>
>>>>> Any way I tried your example just to see if it works and again I am
>>>>> afraid it doesn't work for me, see below:
>>>>> I have combined mgs and mds configuration.
>>>>>
>>>>> [[root at storage03 ~]# df
>>>>> Filesystem 1K-blocks Used Available Use% Mounted on
>>>>> /dev/sda1 10317828 3452824 6340888 36% /
>>>>> /dev/sda6 7605856 49788 7169708 1% /local
>>>>> /dev/sda3 4127108 41000 3876460 2% /tmp
>>>>> /dev/sda2 4127108 753668 3163792 20% /var
>>>>> /dev/dm-2 1845747840 447502120 1398245720 25% /mnt/sdb
>>>>> /dev/dm-1 6140723200 4632947344 1507775856 76% /mnt/sdc
>>>>> /dev/dm-3 286696376 1461588 268850900 1%
>>>>> /mnt/home-md/mdt
>>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS"
>>>>> Lustre FS: home-md
>>>>> Lustre FS: home-md
>>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>>>> 100
>>>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
>>>>> error: conf_param: Invalid argument
>>>>> [root at storage03 ~]#
>>>>>
>>>>>
>>>>>
>>>>>
>>> You need to do this on the MGS node, with the MGS running.
>>>
>>> mgs> lctl conf_param testfs.sys.timeout=150
>>> anynode> cat /proc/sys/lustre/timeout
>>>
>> This isn't working for me. In my production configuration I have MGS
>> combined with MDT on the same server. My lustre configuration consists
>> of two file systems.
>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0
>> checking for existing Lustre data: found CONFIGS/mountdata
>> Reading CONFIGS/mountdata
>>
>> Read previous values:
>> Target: ddn-home-MDT0000
>> Index: 0
>> Lustre FS: ddn-home
>> Mount type: ldiskfs
>> Flags: 0x5
>> (MDT MGS )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp
>>
>>
>> Permanent disk data:
>> Target: ddn-home-MDT0000
>> Index: 0
>> Lustre FS: ddn-home
>> Mount type: ldiskfs
>> Flags: 0x5
>> (MDT MGS )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: failover.node=10.143.245.202 at tcp mgsnode=10.143.245.202 at tcp
>>
>> exiting before disk write.
>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1
>> checking for existing Lustre data: found CONFIGS/mountdata
>> Reading CONFIGS/mountdata
>>
>> Read previous values:
>> Target: ddn-data-MDT0000
>> Index: 0
>> Lustre FS: ddn-data
>> Mount type: ldiskfs
>> Flags: 0x1
>> (MDT )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp
>>
>>
>> Permanent disk data:
>> Target: ddn-data-MDT0000
>> Index: 0
>> Lustre FS: ddn-data
>> Mount type: ldiskfs
>> Flags: 0x1
>> (MDT )
>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>> Parameters: mgsnode=10.143.245.201 at tcp failover.node=10.143.245.202 at tcp
>>
>> exiting before disk write.
>> [root at mds01 ~]#
>>
>> As you can see above MGS is on /dev/dm-0 combined with MDT for
>> ddn-home file system.
>> If I try command line from your example I get this:
>> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
>> error: conf_param: Invalid argument
>>
>> Server mds01 is 100% MGS node. What is wrong here then? The only two
>> reasons for that problem I can think of is that file system name
>> contain "-" character. However I didn't find anything in documentation
>> that would say that this character is not allowed to be used. Another
>> reason is that MGS is combined with MDS?
>>
>> syslog contains following messages:
>>
>> Nov 7 18:38:35 mds01 kernel: LustreError:
>> 3273:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn.
>> cfg_device from lctl is 'ddn-home'
>> Nov 7 18:38:35 mds01 kernel: LustreError:
>> 3273:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov 7 18:39:46 mds01 kernel: LustreError:
>> 3274:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn.
>> cfg_device from lctl is 'ddn-data'
>> Nov 7 18:39:46 mds01 kernel: LustreError:
>> 3274:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov 7 18:39:54 mds01 kernel: LustreError:
>> 3275:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn.
>> cfg_device from lctl is 'ddn-data'
>> Nov 7 18:39:54 mds01 kernel: LustreError:
>> 3275:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov 7 18:40:01 mds01 kernel: LustreError:
>> 3282:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn.
>> cfg_device from lctl is 'ddn-data'
>> Nov 7 18:40:01 mds01 kernel: LustreError:
>> 3282:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov 7 18:41:06 mds01 kernel: LustreError:
>> 3305:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn.
>> cfg_device from lctl is 'ddn-data'
>> Nov 7 18:41:06 mds01 kernel: LustreError:
>> 3305:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>> Nov 7 18:41:15 mds01 kernel: LustreError:
>> 3306:0:(mgs_llog.c:1957:mgs_setparam()) No filesystem targets for ddn.
>> cfg_device from lctl is 'ddn-home'
>> Nov 7 18:41:15 mds01 kernel: LustreError:
>> 3306:0:(mgs_handler.c:605:mgs_iocontrol()) setparam err -22
>>
>> From above it looks like only first part of file system name is
>> recognized "ddn" and -home or -data is omitted.
>>
>> Please advise.
>>
>> Wojciech Turek
>>
>
> You seem to have found a bug. I just tried this myself and it doesn't
> work with a "-" in the name. Maybe use a '.' instead until we fix it.
>
Argh, sorry, that doesn't work with conf_param either. But an
underscore '_' does. I'm filing a bug report...
More information about the lustre-discuss
mailing list