<div dir="ltr"><div class="gmail_default" style="font-size:small">We recently updated to Lustre 2.8 on our cluster, and have started seeing some unusal load issues.<br></div><div class="gmail_default" style="font-size:small">Last night our MDS load climbed to well over 100, and client performance dropped to almost zero.<br></div><div class="gmail_default" style="font-size:small">Initially this appeared to be related to a number of jobs that were doing large numbers of opens/closes, but even after killing those jobs, the MDS load did not recover.<br><br></div><div class="gmail_default" style="font-size:small">Looking at stats in /proc/fs/lustre/mdt/scratch-MDT0000/exports showed little to no activity on the MDS.  Looking at iostat showed almost no disk activity to the MDT (or to any device, for that matter), and minimal IO wait.<br></div><div class="gmail_default" style="font-size:small">Memory usage (the machine has 128GB) showed over half of that memory free.<br><br></div><div class="gmail_default" style="font-size:small">I eventually ended up unmounting the MDT and failing it over to a backup MDS, which promptly recovered and now has a load of near zero.<br><br></div><div class="gmail_default" style="font-size:small">Has anyone seen this before?  Any suggestions for what I should look at if this happens again?<br><br></div><div class="gmail_default" style="font-size:small">Thanks!<br></div><div class="gmail_default" style="font-size:small">Kevin<br><br>--<br></div><div class="gmail_default" style="font-size:small">Kevin Hildebrand<br></div><div class="gmail_default" style="font-size:small">University of Maryland, College Park<br></div><div class="gmail_default" style="font-size:small">Division of IT<br></div><div class="gmail_default" style="font-size:small"><br></div></div>