[lustre-discuss] Suspended jobs and rebooting lustre servers

Thu Feb 21 10:25:31 PST 2019

What can I expect to happen to the jobs that are suspended during the file
system restart?
Will the processes holding an open file handle die when I unsuspend them
after the filesystem restart?

Thanks!
-Raj

On Thu, Feb 21, 2019 at 12:52 PM Colin Faber <cfaber at gmail.com> wrote:

> Ah yes,
>
> If you're adding to an existing OSS, then you will need to reconfigure the
> file system which requires writeconf event.
>
> On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam <ansraj at gmail.com>
> wrote:
>
>> The new OST's will be added to the existing file system (the OSS nodes
>> are already part of the filesystem), I will have to re-configure the
>> current HA resource configuration to tell it about the 4 new OST's.
>> Our exascaler's HA monitors the individual OST and I need to re-configure
>> the HA on the existing filesystem.
>>
>> Our vendor support has confirmed that we would have to restart the
>> filesystem if we want to regenerate the HA configs to include the new OST's.
>>
>> Thanks,
>> -Raj
>>
>>
>> On Thu, Feb 21, 2019 at 11:23 AM Colin Faber <cfaber at gmail.com> wrote:
>>
>>> It seems to me that steps may still be missing?
>>>
>>> You're going to rack/stack and provision the OSS nodes with new OSTs'.
>>>
>>> Then you're going to introduce failover options somewhere? new osts?
>>> existing system? etc?
>>>
>>> If you're introducing failover with the new OST's and leaving the
>>> existing system in place, you should be able to accomplish this without
>>> bringing the system offline.
>>>
>>> If you're going to be introducing failover to your existing system then
>>> you will need to reconfigure the file system to accommodate the new
>>> failover settings (failover nides, etc.)
>>>
>>> -cf
>>>
>>>
>>> On Thu, Feb 21, 2019 at 9:13 AM Raj Ayyampalayam <ansraj at gmail.com>
>>> wrote:
>>>
>>>> Our upgrade strategy is as follows:
>>>>
>>>> 1) Load all disks into the storage array.
>>>> 2) Create RAID pools and virtual disks.
>>>> 3) Create lustre file system using mkfs.lustre command. (I still have
>>>> to figure out all the parameters used on the existing OSTs).
>>>> 4) Create mount points on all OSSs.
>>>> 5) Mount the lustre OSTs.
>>>> 6) Maybe rebalance the filesystem.
>>>> My understanding is that the above can be done without bringing the
>>>> filesystem down. I want to create the HA configuration (corosync and
>>>> pacemaker) for the new OSTs. This step requires the filesystem to be down.
>>>> I want to know what would happen to the suspended processes across the
>>>> cluster when I bring the filesystem down to re-generate the HA configs.
>>>>
>>>> Thanks,
>>>> -Raj
>>>>
>>>> On Thu, Feb 21, 2019 at 12:59 AM Colin Faber <cfaber at gmail.com> wrote:
>>>>
>>>>> Can you provide more details on your upgrade strategy? In some cases
>>>>> expanding your storage shouldn't impact client / job activity at all.
>>>>>
>>>>> On Wed, Feb 20, 2019, 11:09 AM Raj Ayyampalayam <ansraj at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We are planning on expanding our storage by adding more OSTs to our
>>>>>> lustre file system. It looks like it would be easier to expand if we bring
>>>>>> the filesystem down and perform the necessary operations. We are planning
>>>>>> to suspend all the jobs running on the cluster. We originally planned to
>>>>>> add new OSTs to the live filesystem.
>>>>>>
>>>>>> We are trying to determine the potential impact to the suspended jobs
>>>>>> if we bring down the filesystem for the upgrade.
>>>>>> One of the questions we have is what would happen to the suspended
>>>>>> processes that hold an open file handle in the lustre file system when the
>>>>>> filesystem is brought down for the upgrade?
>>>>>> Will they recover from the client eviction?
>>>>>>
>>>>>> We do have vendor support and have engaged them. I wanted to ask the
>>>>>> community and get some feedback.
>>>>>>
>>>>>> Thanks,
>>>>>> -Raj
>>>>>>
>>>>> _______________________________________________
>>>>>> lustre-discuss mailing list
>>>>>> lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190221/32a2209d/attachment-0001.html>