[Lustre-discuss] LustreError: 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable error -5

Parinay Kondekar parinay_kondekar at xyratex.com
Fri Apr 11 05:19:46 PDT 2014


I would go to see whats in man pages in this case.

http://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idp554480
e.g.
If the reported error is anything else (such as -5, "I/O error"), it likely
indicates a storage failure. The low-level file system returns this error
if it is unable to read from the storage device.

 lctl ping <NID> would make us understand if there is anything wrong with
n/w communication.

As I said, look into server logs, that would help.

One more question, I have in my mind is why Lustre 2.3 here ?

HTH


On 11 April 2014 17:13, Vijay Amirtharaj A <vijayamirtharajit at gmail.com>wrote:

> Hi Parinay Kondekar,
>
> Thanks for your reply.
>
> I am new to lustre, please explain me how to gather the information.
>
> Regards,
>
> Vijay Amirtharaj A
>
>
> On Fri, Apr 11, 2014 at 2:55 PM, Parinay Kondekar <
> parinay_kondekar at xyratex.com> wrote:
>
>> Apr 11 04:31:19 node16 kernel: LustreError: 3185:0:(osc_request.c:1689:osc_brw_redo_request())
>> @@@ redo for recoverable error -5  req at ffff8802d1826400x1464726686245296/t0(0) o4->lustre-OST0002-osc-
>> ffff88106ab4dc00 at 192.168.1.46@o2ib:6/4 lens 488/416 e 0 to 0 dl
>> 1397170923 ref 2 fl Interpret:R/0/0 rc -5/-5
>>
>> The ost_write operation failed with -5 . o4 = OST_WRITE
>>
>> The ost_read operation failed with -5 . o3 = OST_READ
>>
>>  #define>EIO>    >       >        5>     /* I/O error */
>>
>> IMO, check the n/w, esp between clients and OSS.
>> It would be good to know, whats happening on the servers.
>>
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>> On 11 April 2014 14:12, Vijay Amirtharaj A <vijayamirtharajit at gmail.com>wrote:
>>
>>> Hi,
>>>
>>>
>>> We have 50 TB storage on lustre, we are using lustre
>>> 2.3.0-2.6.32_279.5.1.el6.x86_64.x86_64 OS: Centos 6.3
>>>
>>> We have 31 compute nodes.
>>>
>>> My issue is:
>>>
>>> When we are restarting storage my jobs are running fine, that is writing
>>> with out any issue.
>>>
>>> After some time, my jobs coming out with this error message:
>>>
>>> /var/spool/torque/mom_priv/jobs/8321.taavare.tuecms.com.SC: line 10: :
>>> No such file or directory
>>> -bash: /lustre/home/bala/.bash_profile: Cannot send after transport
>>> endpoint shutdown
>>> -bash: mpdallexit: command not found
>>>
>>> Following lustre errors are repeating in computing nodes.
>>>
>>> Apr 11 04:31:19 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.46 at o2ib. The ost_write operation
>>> failed with -5
>>> Apr 11 04:31:19 node16 kernel: LustreError: Skipped 1 previous similar
>>> message
>>> Apr 11 04:31:19 node16 kernel: LustreError:
>>> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff8802d1826400 x1464726686245296/t0(0)
>>> o4->lustre-OST0002-osc-ffff88106ab4dc00 at 192.168.1.46@o2ib:6/4 lens
>>> 488/416 e 0 to 0 dl 1397170923 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 04:31:19 node16 kernel: LustreError:
>>> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 1 previous
>>> similar message
>>> Apr 11 05:34:07 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.44 at o2ib. The ost_write operation
>>> failed with -5
>>> Apr 11 05:34:07 node16 kernel: LustreError: Skipped 7 previous similar
>>> messages
>>> Apr 11 05:34:07 node16 kernel: LustreError:
>>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff88081a33b400 x1464726686360348/t0(0)
>>> o4->lustre-OST0004-osc-ffff88106ab4dc00 at 192.168.1.44@o2ib:6/4 lens
>>> 488/416 e 0 to 0 dl 1397174691 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 05:34:07 node16 kernel: LustreError:
>>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 6 previous
>>> similar messages
>>> Apr 11 05:34:07 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.46 at o2ib. The ost_write operation
>>> failed with -5
>>> Apr 11 05:34:07 node16 kernel: LustreError: Skipped 2 previous similar
>>> messages
>>> Apr 11 05:34:07 node16 kernel: LustreError:
>>> 3199:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff880818b19800 x1464726686360319/t0(0)
>>> o4->lustre-OST0002-osc-ffff88106ab4dc00 at 192.168.1.46@o2ib:6/4 lens
>>> 488/416 e 0 to 0 dl 1397174691 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 05:34:07 node16 kernel: LustreError:
>>> 3199:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 2 previous
>>> similar messages
>>> Apr 11 05:54:13 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.44 at o2ib. The ost_write operation
>>> failed with -5
>>> Apr 11 05:54:13 node16 kernel: LustreError: Skipped 5 previous similar
>>> messages
>>> Apr 11 05:54:13 node16 kernel: LustreError:
>>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff88081a33cc00 x1464726686397633/t0(0)
>>> o4->lustre-OST0004-osc-ffff88106ab4dc00 at 192.168.1.44@o2ib:6/4 lens
>>> 488/416 e 0 to 0 dl 1397175897 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 05:54:13 node16 kernel: LustreError:
>>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 5 previous
>>> similar messages
>>> Apr 11 06:29:25 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.45 at o2ib. The ost_write operation
>>> failed with -5
>>> Apr 11 06:29:25 node16 kernel: LustreError:
>>> 3192:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff88081a249400 x1464726686461600/t0(0)
>>> o4->lustre-OST0006-osc-ffff88106ab4dc00 at 192.168.1.45@o2ib:6/4 lens
>>> 488/416 e 0 to 0 dl 1397177972 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 06:29:26 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.45 at o2ib. The ost_write operation
>>> failed with -5
>>> Apr 11 06:29:26 node16 kernel: LustreError: Skipped 4 previous similar
>>> messages
>>> Apr 11 06:29:26 node16 kernel: LustreError:
>>> 3184:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff8807814bac00 x1464726686461778/t0(0)
>>> o4->lustre-OST0006-osc-ffff88106ab4dc00 at 192.168.1.45@o2ib:6/4 lens
>>> 488/416 e 0 to 0 dl 1397177973 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 06:29:26 node16 kernel: LustreError:
>>> 3184:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 4 previous
>>> similar messages
>>> Apr 11 06:29:28 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.45 at o2ib. The ost_write operation
>>> failed with -5
>>> Apr 11 06:29:28 node16 kernel: LustreError: Skipped 4 previous similar
>>> messages
>>> Apr 11 06:29:28 node16 kernel: LustreError:
>>> 3192:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff88104a184c00 x1464726686461931/t0(0)
>>> o4->lustre-OST0006-osc-ffff88106ab4dc00 at 192.168.1.45@o2ib:6/4 lens
>>> 488/416 e 0 to 0 dl 1397177975 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 06:29:28 node16 kernel: LustreError:
>>> 3192:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 4 previous
>>> similar messages
>>> Apr 11 07:10:05 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.44 at o2ib. The ost_write operation
>>> failed with -5
>>> Apr 11 07:10:05 node16 kernel: LustreError: Skipped 4 previous similar
>>> messages
>>> Apr 11 07:10:05 node16 kernel: LustreError:
>>> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff88081a33c800 x1464726686536452/t0(0)
>>> o4->lustre-OST0004-osc-ffff88106ab4dc00 at 192.168.1.44@o2ib:6/4 lens
>>> 488/416 e 0 to 0 dl 1397180449 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 07:10:05 node16 kernel: LustreError:
>>> 3185:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 3 previous
>>> similar messages
>>>
>>>
>>> Apr 11 08:34:31 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.45 at o2ib. The ost_read operation
>>> failed with -5
>>> Apr 11 08:34:31 node16 kernel: LustreError:
>>> 3193:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff880bb45a4000 x1464726686700285/t0(0)
>>> o3->lustre-OST0006-osc-ffff88106ab4dc00 at 192.168.1.45@o2ib:6/4 lens
>>> 488/400 e 0 to 0 dl 1397185515 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 08:34:57 node16 kernel: LustreError: 11-0: an error occurred
>>> while communicating with 192.168.1.45 at o2ib. The ost_read operation
>>> failed with -5
>>> Apr 11 08:34:57 node16 kernel: LustreError: Skipped 17 previous similar
>>> messages
>>> Apr 11 08:34:57 node16 kernel: LustreError:
>>> 3196:0:(osc_request.c:1689:osc_brw_redo_request()) @@@ redo for recoverable
>>> error -5  req at ffff881052fe2800 x1464726686701760/t0(0)
>>> o3->lustre-OST0006-osc-ffff88106ab4dc00 at 192.168.1.45@o2ib:6/4 lens
>>> 488/400 e 0 to 0 dl 1397185541 ref 2 fl Interpret:R/0/0 rc -5/-5
>>> Apr 11 08:34:57 node16 kernel: LustreError:
>>> 3196:0:(osc_request.c:1689:osc_brw_redo_request()) Skipped 17 previous
>>> similar messages
>>> Apr 11 08:37:35 node16 mpd: mpd ending mpdid=node16_50196 (inside
>>> cleanup)
>>>
>>>
>>> Please help me to solve this issue.
>>>
>>> Regards,
>>> Vijay Amirtharaj A
>>>
>>> Vijay Amirtharaj A
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20140411/5af4886c/attachment.htm>


More information about the lustre-discuss mailing list