[Lustre-discuss] potential issue with data corruption

Thu Jul 14 13:05:32 PDT 2011

Hello!

On Jul 14, 2011, at 3:55 PM, Lisa Giacchetti wrote:
>>> Jul  7 07:10:08 cmsls6 kernel: Lustre: 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
>>> Some of these errors seem really bad - like the bulk IO comm error or the eviction due to a locking call back.
>>> What should I be looking for here?  I have determined some of the messages that say a client has been evicted cause the
>>> OSS thinks its dead are not due the system being down. So what makes the OSS think the client is dead?
>> Well, the clients become unresponsive for some reason, you really need to look at the client side logs for some clues on that.
> I have been doing this as I was waiting for a reply and going through the manual and lustre-discuss archives.
> Here is an example of one of the client's logs during the appropriate time frame:
> Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred while communicating with 131.225.191.164 at tcp. The obd_ping operation failed with -107
> Jul  7 11:55:33 cmswn1526 kernel: Lustre: cmsprod1-OST0033-osc-ffff810617966400: Connection to service cmsprod1-OST0033 via nid 131.225.191.164 at tcp was lost; in progress operations using this service will wait for recovery to complete.

This is way too late in the game, here the server already evicted the client.
Was there anything before then?

>>> Also is there any way to determine what files are involved in these errors?
>> Well, the lock blocking callbacks message will provide you with ost number and object index that you might be able to backreference to a file.
> I know there is a way to do this from the /proc file system (at least I think its /proc) but I can't find any reference to this
> in the book I got from class on this or in the manual.
> Can someone refresh my memory?

Actually I think you can do it with combination of lfs find and lfs getattr.

>> All that said, 1.8.3 is quite old and I think it would be a much better idea to try 1.8.6 and see if it improves things.
> downtimes are few and far between for us so this may take a while to get scheduled.
> If there is anything that can be done in the meantime I'd like to try it.

I suspect the might have been several bugs since 1.8.3 that might have manifested in slowness to reply to lock callback requests
and you'll end up having downtime to upgrade the clients one way or the other.

Bye,
    Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.