[Lustre-discuss] potential issue with data corruption

Thu Jul 14 13:07:24 PDT 2011

On 7/14/11 3:05 PM, Oleg Drokin wrote:
> Hello!
>
> On Jul 14, 2011, at 3:55 PM, Lisa Giacchetti wrote:
>>>> Jul  7 07:10:08 cmsls6 kernel: Lustre: 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
>>>> Some of these errors seem really bad - like the bulk IO comm error or the eviction due to a locking call back.
>>>> What should I be looking for here?  I have determined some of the messages that say a client has been evicted cause the
>>>> OSS thinks its dead are not due the system being down. So what makes the OSS think the client is dead?
>>> Well, the clients become unresponsive for some reason, you really need to look at the client side logs for some clues on that.
>> I have been doing this as I was waiting for a reply and going through the manual and lustre-discuss archives.
>> Here is an example of one of the client's logs during the appropriate time frame:
>> Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred while communicating with 131.225.191.164 at tcp. The obd_ping operation failed with -107
>> Jul  7 11:55:33 cmswn1526 kernel: Lustre: cmsprod1-OST0033-osc-ffff810617966400: Connection to service cmsprod1-OST0033 via nid 131.225.191.164 at tcp was lost; in progress operations using this service will wait for recovery to complete.
> This is way too late in the game, here the server already evicted the client.
> Was there anything before then?
>
No there is nothing before then.

>>>> Also is there any way to determine what files are involved in these errors?
>>> Well, the lock blocking callbacks message will provide you with ost number and object index that you might be able to backreference to a file.
>> I know there is a way to do this from the /proc file system (at least I think its /proc) but I can't find any reference to this
>> in the book I got from class on this or in the manual.
>> Can someone refresh my memory?
> Actually I think you can do it with combination of lfs find and lfs getattr.
Hmm. Ok let me try that

>>> All that said, 1.8.3 is quite old and I think it would be a much better idea to try 1.8.6 and see if it improves things.
>> downtimes are few and far between for us so this may take a while to get scheduled.
>> If there is anything that can be done in the meantime I'd like to try it.
> I suspect the might have been several bugs since 1.8.3 that might have manifested in slowness to reply to lock callback requests
> and you'll end up having downtime to upgrade the clients one way or the other.
>
> Bye,
>      Oleg
> --
> Oleg Drokin
> Senior Software Engineer
> Whamcloud, Inc.
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110714/cf66ea9a/attachment.vcf>