[Lustre-discuss] How To change server recovery timeout

Wojciech Turek wjt27 at cam.ac.uk
Thu Nov 8 18:38:37 PST 2007


Hi,

It is  a lesson for me to do not change old habits. I always used "_"  
and for latest filesystem I did exception for the impression that it  
looks neater with "-" and here we go.
Can I change file system name without reformatting everything? File  
system with bad name is in production and it is essential for me to  
fix it without long service downtime.

Thanks

Wojciech Turek

On 8 Nov 2007, at 19:04, Nathan Rutman wrote:

> Nathan Rutman wrote:
>> Wojciech Turek wrote:
>>
>>> On 7 Nov 2007, at 22:31, Nathan Rutman wrote:
>>>
>>>
>>>> Cliff White wrote:
>>>>
>>>>> Wojciech Turek wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Hi Cliff,
>>>>>>
>>>>>> On 7 Nov 2007, at 17:58, Cliff White wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Wojciech Turek wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> Our lustre environment is:
>>>>>>>> 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>>>>>>> I would like to change recovery timeout from default value  
>>>>>>>> 250s to something longer
>>>>>>>> I tried example from manual:
>>>>>>>> set_timeout <secs> Sets the timeout (obd_timeout) for a server
>>>>>>>> to wait before failing recovery.
>>>>>>>> We performed that experiment on our test lustre installation  
>>>>>>>> with one OST.
>>>>>>>> storage02 is our OSS
>>>>>>>> [root at storage02 ~]# lctl dl
>>>>>>>>   0 UP mgc MGC10.143.245.3 at tcp 31259d9b-e655-cdc4- 
>>>>>>>> c760-45d3df426d86 5
>>>>>>>>   1 UP ost OSS OSS_uuid 3
>>>>>>>>   2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
>>>>>>>> [root at storage02 ~]# lctl --device 2 set_timeout 600
>>>>>>>> set_timeout has been deprecated. Use conf_param instead.
>>>>>>>> e.g. conf_param lustre-MDT0000 obd_timeout=50
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>> sorry about this bad help message.  It's wrong.
>>>>
>>>>>>>> usage: conf_param obd_timeout=<secs>
>>>>>>>> run <command> after connecting to device <devno>
>>>>>>>> --device <devno> <command [args ...]>
>>>>>>>> [root at storage02 ~]# lctl --device 1 conf_param obd_timeout=600
>>>>>>>> No device found for name MGS: Invalid argument
>>>>>>>> error: conf_param: No such device
>>>>>>>> It looks like I need to run this command from MGS node so I   
>>>>>>>> moved then to MGS server called storage03
>>>>>>>> [root at storage03 ~]# lctl dl
>>>>>>>>   0 UP mgs MGS MGS 9
>>>>>>>>   1 UP mgc MGC10.143.245.3 at tcp f51a910b-a08e-4be6-5ada- 
>>>>>>>> b602a5ca9ab3 5
>>>>>>>>   2 UP mdt MDS MDS_uuid 3
>>>>>>>>   3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
>>>>>>>>   4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
>>>>>>>>   5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
>>>>>>>> [root at storage03 ~]# lctl device 5
>>>>>>>> [root at storage03 ~]# lctl conf_param obd_timeout=600
>>>>>>>> error: conf_param: Function not implemented
>>>>>>>> [root at storage03 ~]# lctl --device 5 conf_param obd_timeout=600
>>>>>>>> error: conf_param: Function not implemented
>>>>>>>> [root at storage03 ~]# lctl help conf_param
>>>>>>>> conf_param: set a permanent config param. This command must  
>>>>>>>> be run on the MGS node
>>>>>>>> usage: conf_param <target.keyword=val> ...
>>>>>>>> [root at storage03 ~]# lctl conf_param home-md- 
>>>>>>>> MDT0000.obd_timeout=600
>>>>>>>> error: conf_param: Invalid argument
>>>>>>>> [root at storage03 ~]#
>>>>>>>> I searched whole /proc/*/lustre for file that can store this  
>>>>>>>> timeout value but nothing were found.
>>>>>>>> Could someone advise how to change value for recovery timeout?
>>>>>>>> Cheers,
>>>>>>>> Wojciech Turek
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> It looks like your file system is named 'home' - you can  
>>>>>>> confirm with
>>>>>>> tunefs.lustre --print <MDS device> | grep "Lustre FS"
>>>>>>>
>>>>>>> The correct command (Run on the MGS) would be
>>>>>>> # lctl conf_param home.sys.timeout=<val>
>>>>>>>
>>>>>>> Example:
>>>>>>> [root at ft4 ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
>>>>>>> Lustre FS:  lustre
>>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>>> 130
>>>>>>> [root at ft4 ~]# lctl conf_param lustre.sys.timeout=150
>>>>>>> [root at ft4 ~]# cat /proc/sys/lustre/timeout
>>>>>>> 150
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Thanks for your email. I am afraid your tips aren't very  
>>>>>> helpful in this case. As stated in the subject I am asking  
>>>>>> about recovery timeout.
>>>>>> You can find it for example in /proc/fs/lustre/obdfilter/<OST>/ 
>>>>>> recovery_status whilst one of your OST's is in recovery state.  
>>>>>> By default this timeout is 250s.
>>>>>> Whereas you are talking about system obd timeout (according to  
>>>>>> CFS documentation chapter 4.1.2 ) which is not a subject of my  
>>>>>> concern.
>>>>>>
>>>>>> Any way I tried your example just to see if it works and again  
>>>>>> I am afraid it doesn't work for me, see below:
>>>>>> I have combined mgs and mds configuration.
>>>>>>
>>>>>> [[root at storage03 ~]# df
>>>>>> Filesystem           1K-blocks      Used Available Use%  
>>>>>> Mounted on
>>>>>> /dev/sda1             10317828   3452824   6340888  36% /
>>>>>> /dev/sda6              7605856     49788   7169708   1% /local
>>>>>> /dev/sda3              4127108     41000   3876460   2% /tmp
>>>>>> /dev/sda2              4127108    753668   3163792  20% /var
>>>>>> /dev/dm-2            1845747840 447502120 1398245720  25% /mnt/ 
>>>>>> sdb
>>>>>> /dev/dm-1            6140723200 4632947344 1507775856  76% / 
>>>>>> mnt/sdc
>>>>>> /dev/dm-3            286696376   1461588 268850900   1% /mnt/ 
>>>>>> home-md/mdt
>>>>>> [root at storage03 ~]# tunefs.lustre --print /dev/dm-3 |grep  
>>>>>> "Lustre FS"
>>>>>> Lustre FS:  home-md
>>>>>> Lustre FS:  home-md
>>>>>> [root at storage03 ~]# cat /proc/sys/lustre/timeout
>>>>>> 100
>>>>>> [root at storage03 ~]# lctl conf_param home-md.sys.timeout=150
>>>>>> error: conf_param: Invalid argument
>>>>>> [root at storage03 ~]#
>>>>>>
>>>>>>
>>>>>>
>>>> You need to do this on the MGS node, with the MGS running.
>>>>
>>>> mgs> lctl conf_param testfs.sys.timeout=150
>>>> anynode> cat /proc/sys/lustre/timeout
>>>>
>>> This isn't working for me. In my production configuration I have  
>>> MGS combined with MDT on the same server. My lustre configuration  
>>> consists of two file systems.
>>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-0
>>> checking for existing Lustre data: found CONFIGS/mountdata
>>> Reading CONFIGS/mountdata
>>>
>>>    Read previous values:
>>> Target:     ddn-home-MDT0000
>>> Index:      0
>>> Lustre FS:  ddn-home
>>> Mount type: ldiskfs
>>> Flags:      0x5
>>>               (MDT MGS )
>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>> Parameters: failover.node=10.143.245.202 at tcp  
>>> mgsnode=10.143.245.202 at tcp
>>>
>>>
>>>    Permanent disk data:
>>> Target:     ddn-home-MDT0000
>>> Index:      0
>>> Lustre FS:  ddn-home
>>> Mount type: ldiskfs
>>> Flags:      0x5
>>>               (MDT MGS )
>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>> Parameters: failover.node=10.143.245.202 at tcp  
>>> mgsnode=10.143.245.202 at tcp
>>>
>>> exiting before disk write.
>>> [root at mds01 ~]# tunefs.lustre --print /dev/dm-1
>>> checking for existing Lustre data: found CONFIGS/mountdata
>>> Reading CONFIGS/mountdata
>>>
>>>    Read previous values:
>>> Target:     ddn-data-MDT0000
>>> Index:      0
>>> Lustre FS:  ddn-data
>>> Mount type: ldiskfs
>>> Flags:      0x1
>>>               (MDT )
>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>> Parameters: mgsnode=10.143.245.201 at tcp  
>>> failover.node=10.143.245.202 at tcp
>>>
>>>
>>>    Permanent disk data:
>>> Target:     ddn-data-MDT0000
>>> Index:      0
>>> Lustre FS:  ddn-data
>>> Mount type: ldiskfs
>>> Flags:      0x1
>>>               (MDT )
>>> Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
>>> Parameters: mgsnode=10.143.245.201 at tcp  
>>> failover.node=10.143.245.202 at tcp
>>>
>>> exiting before disk write.
>>> [root at mds01 ~]#
>>> As you can see above MGS is on /dev/dm-0 combined with MDT for  
>>> ddn-home file system.
>>> If I try command line from your example I get this:
>>> [root at mds01 ~]# lctl conf_param ddn-home.sys.timeout=200
>>> error: conf_param: Invalid argument
>>>
>>> Server mds01 is 100% MGS node. What is wrong here then? The only  
>>> two reasons for that problem I can think of is that file system  
>>> name contain "-" character. However I didn't find anything in  
>>> documentation that would say that this character is not allowed  
>>> to be used. Another reason is that MGS is combined with MDS?
>>>
>>> syslog contains following messages:
>>>
>>> Nov  7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is 'ddn-home'
>>> Nov  7 18:38:35 mds01 kernel: LustreError: 3273:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is 'ddn-data'
>>> Nov  7 18:39:46 mds01 kernel: LustreError: 3274:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is 'ddn-data'
>>> Nov  7 18:39:54 mds01 kernel: LustreError: 3275:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is 'ddn-data'
>>> Nov  7 18:40:01 mds01 kernel: LustreError: 3282:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is 'ddn-data'
>>> Nov  7 18:41:06 mds01 kernel: LustreError: 3305:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>> Nov  7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_llog.c: 
>>> 1957:mgs_setparam()) No filesystem targets for ddn.  cfg_device  
>>> from lctl is 'ddn-home'
>>> Nov  7 18:41:15 mds01 kernel: LustreError: 3306:0:(mgs_handler.c: 
>>> 605:mgs_iocontrol()) setparam err -22
>>>
>>> From above it looks like only first part of file system name is  
>>> recognized "ddn" and -home or -data is omitted.
>>>
>>> Please advise.
>>>
>>> Wojciech Turek
>>>
>>
>> You seem to have found a bug.  I just tried this myself and it  
>> doesn't work with a "-" in the name.  Maybe use a '.' instead  
>> until we fix it.
>>
> Argh, sorry, that doesn't work with conf_param either.  But an  
> underscore '_' does.  I'm filing a bug report...
>

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071109/72792580/attachment.htm>


More information about the lustre-discuss mailing list