[lustre-discuss] Imbalanced incoming and outgoing network load

Kulyavtsev, Alex Ivanovich alexku at anl.gov
Fri Jul 7 09:18:55 PDT 2023


There is QoS in lustre, the feature called NRS - Network Request Scheduler.
It is possible to set different policies.
Will it address the issue ?

The manual has entry and there were few presentations on LUG/LAD.

I did not use NRS myself but I would like to learn.
Alex.

> On Jul 7, 2023, at 06:48, Anna Fuchs via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
> 
> Dear all,
> 
> I have some questions regarding the following scenario:
>  - A large HPC system.
> - Let's assume that Job X is running on 1 compute node and is reading a very large file with a stripecount (>>1)..-1. Alternatively, tons of files are read at once with smaller striping each, but distributed across all OSS/OSTs.
> - The compute node is connected, for example, with a 100Gb/s link, and there are 50 servers, each with a 200Gb/s link. This generates a network load of 50x200Gb/s, which is processed at 100Gb/s.
> - Job Y, which requires the same network and potentially doesn't even perform I/O, suffers a lot as a result.
> 
> Does this scenario sound familiar to you?
> Is the sequence of events correct?
> What could be done in this situation?
> 
> To avoid:
> a) having such single/few-nodes jobs
> b) striping large files with up to -1
> c) reading millions of files at once
> One could try, but I have concerns that the users will persist in doing it, either intentionally or accidentally, and it would only shift the problem, rather than solving it.
> One could tweak the network design, reconfigure it, separate I/O from communication, but it would hardly optimize all use cases. Virtual lanes could potentially be a solution as well. Though, that might not help if the Job Y also involves some I/O.
> 
> Wouldn't it be better if Lustre somehow recognized this imbalance between incoming and outgoing network traffic and loaded the file(s)/data gradually rather than all at once, saturating or slightly overloading the consumer 100Gb/s connection rather than by a factor of 100? Does this sound reasonable, and is there already a solution for it?
> I would appreciate any opinions.
> 
> Best regards
> Anna
> 
> --
> Anna Fuchs
> Universität Hamburg
> https://wr.informatik.uni-hamburg.de/people/anna_fuchs
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list