[lustre-discuss] Suspended jobs and rebooting lustre servers

Thu Feb 28 05:12:46 PST 2019

This is very good advice, and you can also vary it to aid in removing old OSTs (thinking of the previous message) - simply take the old ones you wish to remove out of the pool, then new files will not be created there.  Makes migration easier.

One thing though:
Setting a default layout everywhere may be prohibitively slow for a large fs.

If you set a default layout on the root of the file system, it is automatically used as the default for all directories that do not have another default set.

So if you have not previously set a default layout on any directories, there is no need to go through the fs changing them like this.  (And perhaps if you have, you can find those directories and handle them manually, rather than setting a pool on every directory.)

- Patrick
________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Jongwoo Han <jongwoohan at gmail.com>
Sent: Thursday, February 28, 2019 6:09:18 AM
To: Stephane Thiell
Cc: lustre-discuss
Subject: Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

My strategy for adding new OSTs on live filesystem is to define a pool with currently running OST and apply pool stripe (lfs setstripe -p [live-ost-pool]) on all existing directories. It is better when it is done at first filesystem creation.

After that, you can safely add new OSTs without newly created files filling in like flood - newly added OST will remain silently until you add them to pool.

Try failover tests with new OSTs and OSSes while it do not store files. After the failover/restart test is done on new OSS and OSTs, you can add new OSTs to the pool then they will start to store files shortly after.

If you did not create a pool, create a pool with old OSTs and

# lfs find <mntpoint> -type d | while read DIR ; do echo "processing :" $DIR; lfs setstripe -p <pool> $DIR ; done

will mark all subdirectories on the pool, so newly added OSTs are safe from files coming in until these new OSTs are added to the pool.

I always expand live filesystem in this manner, not to worry about heavily loaded situation.

On Thu, Feb 28, 2019 at 1:02 AM Stephane Thiell <sthiell at stanford.edu<mailto:sthiell at stanford.edu>> wrote:
On one of our filesystem, we add a few new OSTs almost every month with no downtime, this is very convenient. The only thing that I would recommend is to avoid doing that during a peak of I/Os on your filesystem (we usually do it as early as possible in the morning), as the added OSTs will immediately an heavy I/O load, likely because they are empty.

Best,

Stephane

> On Feb 22, 2019, at 2:03 PM, Andreas Dilger <adilger at whamcloud.com<mailto:adilger at whamcloud.com>> wrote:
>
> This is not really correct.
>
> Lustre clients can handle the addition of OSTs to a running filesystem. The MGS will register the new OSTs, and the clients will be notified by the MGS that the OSTs have been added, so no need to unmount the clients during this process.
>
>
> Cheers, Andreas
>
> On Feb 21, 2019, at 19:23, Raj <rajgautam at gmail.com<mailto:rajgautam at gmail.com>> wrote:
>
>> Hello Raj,
>> It’s best and safe to unmount from all the clients and then do the upgrade. Your FS is getting more OSTs and changing conf in the existing ones, your client needs to get the new layout by remounting it.
>> Also you mentioned about client eviction, during eviction the client has to drop it’s dirty pages and all the open file descriptors in the FS will be gone.
>>
>> On Thu, Feb 21, 2019 at 12:25 PM Raj Ayyampalayam <ansraj at gmail.com<mailto:ansraj at gmail.com>> wrote:
>> What can I expect to happen to the jobs that are suspended during the file system restart?
>> Will the processes holding an open file handle die when I unsuspend them after the filesystem restart?
>>
>> Thanks!
>> -Raj
>>
>>
>> On Thu, Feb 21, 2019 at 12:52 PM Colin Faber <cfaber at gmail.com<mailto:cfaber at gmail.com>> wrote:
>> Ah yes,
>>
>> If you're adding to an existing OSS, then you will need to reconfigure the file system which requires writeconf event.
>>
>> On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam <ansraj at gmail.com<mailto:ansraj at gmail.com>> wrote:
>> The new OST's will be added to the existing file system (the OSS nodes are already part of the filesystem), I will have to re-configure the current HA resource configuration to tell it about the 4 new OST's.
>> Our exascaler's HA monitors the individual OST and I need to re-configure the HA on the existing filesystem.
>>
>> Our vendor support has confirmed that we would have to restart the filesystem if we want to regenerate the HA configs to include the new OST's.
>>
>> Thanks,
>> -Raj
>>
>>
>> On Thu, Feb 21, 2019 at 11:23 AM Colin Faber <cfaber at gmail.com<mailto:cfaber at gmail.com>> wrote:
>> It seems to me that steps may still be missing?
>>
>> You're going to rack/stack and provision the OSS nodes with new OSTs'.
>>
>> Then you're going to introduce failover options somewhere? new osts? existing system? etc?
>>
>> If you're introducing failover with the new OST's and leaving the existing system in place, you should be able to accomplish this without bringing the system offline.
>>
>> If you're going to be introducing failover to your existing system then you will need to reconfigure the file system to accommodate the new failover settings (failover nides, etc.)
>>
>> -cf
>>
>>
>> On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam <ansraj at gmail.com<mailto:ansraj at gmail.com>> wrote:
>> Our upgrade strategy is as follows:
>>
>> 1) Load all disks into the storage array.
>> 2) Create RAID pools and virtual disks.
>> 3) Create lustre file system using mkfs.lustre command. (I still have to figure out all the parameters used on the existing OSTs).
>> 4) Create mount points on all OSSs.
>> 5) Mount the lustre OSTs.
>> 6) Maybe rebalance the filesystem.
>> My understanding is that the above can be done without bringing the filesystem down. I want to create the HA configuration (corosync and pacemaker) for the new OSTs. This step requires the filesystem to be down. I want to know what would happen to the suspended processes across the cluster when I bring the filesystem down to re-generate the HA configs.
>>
>> Thanks,
>> -Raj
>>
>> On Thu, Feb 21, 2019 at 12:59 AM Colin Faber <cfaber at gmail.com<mailto:cfaber at gmail.com>> wrote:
>> Can you provide more details on your upgrade strategy? In some cases expanding your storage shouldn't impact client / job activity at all.
>>
>> On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam <ansraj at gmail.com<mailto:ansraj at gmail.com>> wrote:
>> Hello,
>>
>> We are planning on expanding our storage by adding more OSTs to our lustre file system. It looks like it would be easier to expand if we bring the filesystem down and perform the necessary operations. We are planning to suspend all the jobs running on the cluster. We originally planned to add new OSTs to the live filesystem.
>>
>> We are trying to determine the potential impact to the suspended jobs if we bring down the filesystem for the upgrade.
>> One of the questions we have is what would happen to the suspended processes that hold an open file handle in the lustre file system when the filesystem is brought down for the upgrade?
>> Will they recover from the client eviction?
>>
>> We do have vendor support and have engaged them. I wanted to ask the community and get some feedback.
>>
>> Thanks,
>> -Raj
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
Jongwoo Han
+82-505-227-6108
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190228/61608875/attachment-0001.html>