[lustre-discuss] Imbalanced incoming and outgoing network load

Anna Fuchs anna.fuchs at uni-hamburg.de
Fri Jul 7 04:48:34 PDT 2023


Dear all,

I have some questions regarding the following scenario:
  - A large HPC system.
- Let's assume that Job X is running on 1 compute node and is reading a 
very large file with a stripecount (>>1)..-1. Alternatively, tons of 
files are read at once with smaller striping each, but distributed 
across all OSS/OSTs.
- The compute node is connected, for example, with a 100Gb/s link, and 
there are 50 servers, each with a 200Gb/s link. This generates a network 
load of 50x200Gb/s, which is processed at 100Gb/s.
- Job Y, which requires the same network and potentially doesn't even 
perform I/O, suffers a lot as a result.

Does this scenario sound familiar to you?
Is the sequence of events correct?
What could be done in this situation?

To avoid:
a) having such single/few-nodes jobs
b) striping large files with up to -1
c) reading millions of files at once
One could try, but I have concerns that the users will persist in doing 
it, either intentionally or accidentally, and it would only shift the 
problem, rather than solving it.
One could tweak the network design, reconfigure it, separate I/O from 
communication, but it would hardly optimize all use cases. Virtual lanes 
could potentially be a solution as well. Though, that might not help if 
the Job Y also involves some I/O.

Wouldn't it be better if Lustre somehow recognized this imbalance 
between incoming and outgoing network traffic and loaded the 
file(s)/data gradually rather than all at once, saturating or slightly 
overloading the consumer 100Gb/s connection rather than by a factor of 
100? Does this sound reasonable, and is there already a solution for it?
I would appreciate any opinions.

Best regards
Anna

--
Anna Fuchs
Universität Hamburg
https://wr.informatik.uni-hamburg.de/people/anna_fuchs


More information about the lustre-discuss mailing list