[Lustre-discuss] Client hangs when reading from Lustre ...

Mon Feb 4 17:37:35 PST 2008

On 2/4/08 5:07 PM, "Andreas Dilger" <adilger at sun.com>did etch on stone
tablets:

> On Feb 04, 2008  15:47 -0800, Klaus Steden wrote:
>> Thanks Andreas ... That would make sense, although the only error message
>> (or, message vaguely resembling an error message) that I could find was this
>> one:
>> 
>> -- cut --
>> /var/log/messages.1:Feb  1 09:28:09 tiger-oss-0-0 kernel: LDISKFS-fs error
>> (device sdb): ldiskfs_journal_start_sb: Detected aborted journal
>> -- cut --
>> 
>> I'm assuming that's causing the problem -- but what's the next step? Punt
>> all the clients, stop Lustre, and run e2fsck on the affected device?
> 
> Yes.  An aborted journal means an error at the journal layer...  Maybe with
> a "JBD" error message?
> 
I didn't see anything like that, but I did see a bundle of journal commit
errors, a number of errors from the SCSI layer, and a message about the LUN
being remounted read-only.

Two questions ... 

1. Assuming all the bad blocks can be re-mapped at the device layer, what is
the potential for data loss from running e2fsck?

2. Is it possible to get notification from a cluster component when
something like this happens, via SNMP, Ganglia, or some other monitoring
system?

cheers,
Klaus