[Lustre-discuss] quotacheck blows up MDT

Fri Apr 24 06:28:39 PDT 2009

Hi all,

in a recent shutdown of our Lustre cluster (net reconfig, Version
upgrade to 1.6.7_patched), I decided to try to switch on quotas - this
had failed when the cluster went operational last year.

Again, I suffered from the same error as last year - failure, and
"device/resource busy". This time, I was sure there was no activity at
all on the system. But on the MDS, I observed a steep increase of the
machine load, up to values of 70. The machine reacted very slowly. It
is, however, an 8 Core Xeon - 32 GB RAM - Raptor disk - server, and in
normal operation, this machine did never show any sign of overloading,
no matter what our users do.
Nevertheless, the Lustre log complained about connection losses to some
OSTs (at least one was set incative), Heartbeat, which controls the IP
of the MGS, complained about timeouts, and so did DRBD, which mirrors
the MGS and MDT disks to a slave server. Probably the machine simply
lost its own eth0/1/2/3/4 network interfaces which are used by these
services.

After 30 min, the "lfs quotacheck -ug /lustre" command aborted with the
said errors. This happened again today, when we gave it another try.
This time, we umounted Lustre, of course removed all Lustre modules,
mounted it again and repeated the quotacheck. Similar behavior on the
MDS, but this time the command ran through,  the services recovered,
Lustre survived and was mountable and - the quotas seem to work.

So, after this lengthy intro, my question: Is this extreme loading or
overloading of the MDS during quotacheck a "normal" feature?

Is there a connection to the fact that the filesystem is already 75%
full, with 128 TB?

We have 68 OSTs, half of them 2.3TB, half of them 2.7 TB .
All servers run Debian Etch 64, Kernel 2.6.22.

Regards,
Thomas