<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <br>
    Hi Ashok<br>
    <br>
    If you have a valid support contract log a call with you local SGI
    office, you have a couple of bad IB ports, maybe a cable or other
    such thing. Include the information you provided below<br>
    and ask them help out.<br>
    <br>
    <br>
    On 30-September-2011 6:37 PM, Ashok nulguda wrote:
    <blockquote
cite="mid:CACGS=M9ithnCXkE4PVR2vZOzuLef+JHGNF-bQ3fT4gDfQj8AmA@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=ISO-8859-1">
      Dear Sir,<br>
      <br>
      <br>
      Thanks for your help.<br>
      <br>
      My system is ICE 8400 cluster with 30 TB of lustre of 64 node.<br>
      oss1:~ # df -h <br>
      Filesystem            Size  Used Avail Use% Mounted on<br>
      /dev/sda3             100G  5.8G   95G   6% /<br>
      tmpfs                  12G  1.1M   12G   1% /dev<br>
      tmpfs                  12G   88K   12G   1% /dev/shm<br>
      /dev/sda1            1020M  181M  840M  18% /boot<br>
      /dev/sda4             170G  6.6M  170G   1% /data1<br>
      /dev/mapper/3600a0b8000755ee0000010964dc231bc_part1<br>
                            2.1T   74G  1.9T   4% /OST1<br>
      /dev/mapper/3600a0b8000755ed1000010614dc23425_part1<br>
                            1.7T   67G  1.5T   5% /OST4<br>
      /dev/mapper/3600a0b8000755ee0000010a04dc23323_part1<br>
                            2.1T   67G  1.9T   4% /OST5<br>
      /dev/mapper/3600a0b8000755f1f000011224dc239d7_part1<br>
                            1.7T   67G  1.5T   5% /OST8<br>
      /dev/mapper/3600a0b8000755dbe000010de4dc23997_part1<br>
                            2.1T   66G  1.9T   4% /OST9<br>
      /dev/mapper/3600a0b8000755f1f000011284dc23b5a_part1<br>
                            1.7T   66G  1.5T   5% /OST12<br>
      /dev/mapper/3600a0b8000755eb3000011304dc23db1_part1<br>
                            2.1T   66G  1.9T   4% /OST13<br>
      /dev/mapper/3600a0b8000755f22000011104dc23ec7_part1<br>
                            1.7T   66G  1.5T   5% /OST16<br>
      <br>
      <br>
      oss1:~ # rpm -qa | grep -i lustre<br>
      kernel-default-2.6.27.39-0.3_lustre.1.8.4<br>
      kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default<br>
      lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
      kernel-default-base-2.6.27.39-0.3_lustre.1.8.4<br>
      lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
      lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default<br>
      <br>
      <br>
      oss2:~ # Filesystem            Size  Used Avail Use% Mounted on<br>
      /dev/sdcw3            100G  8.3G   92G   9% /<br>
      tmpfs                  12G  1.1M   12G   1% /dev<br>
      tmpfs                  12G   88K   12G   1% /dev/shm<br>
      /dev/sdcw1           1020M  144M  876M  15% /boot<br>
      /dev/sdcw4            170G   13M  170G   1% /data1<br>
      /dev/mapper/3600a0b8000755ed10000105e4dc23397_part1<br>
                            1.7T   69G  1.5T   5% /OST2<br>
      /dev/mapper/3600a0b8000755ee00000109b4dc232a0_part1<br>
                            2.1T   68G  1.9T   4% /OST3<br>
      /dev/mapper/3600a0b8000755ed1000010644dc2349f_part1<br>
                            1.7T   67G  1.5T   5% /OST6<br>
      /dev/mapper/3600a0b8000755dbe000010d94dc23873_part1<br>
                            2.1T   67G  1.9T   4% /OST7<br>
      /dev/mapper/3600a0b8000755f1f000011254dc23add_part1<br>
                            1.7T   66G  1.5T   5% /OST10<br>
      /dev/mapper/3600a0b8000755dbe000010e34dc23a09_part1<br>
                            2.1T   66G  1.9T   4% /OST11<br>
      /dev/mapper/3600a0b8000755f220000110d4dc23e36_part1<br>
                            1.7T   66G  1.5T   5% /OST14<br>
      /dev/mapper/3600a0b8000755eb3000011354dc23e39_part1<br>
                            2.1T   66G  1.9T   4% /OST15<br>
      /dev/mapper/3600a0b8000755eb30000113a4dc23ec4_part1<br>
                            1.4T   66G  1.3T   6% /OST17<br>
      <br>
      [1]+  Done                    df -h<br>
      <br>
      oss2:~ # rpm -qa | grep -i lustre<br>
      lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
      kernel-default-base-2.6.27.39-0.3_lustre.1.8.4<br>
      kernel-default-2.6.27.39-0.3_lustre.1.8.4<br>
      kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default<br>
      lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default<br>
      lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
      <br>
      mdc1:~ # Filesystem            Size  Used Avail Use% Mounted on<br>
      /dev/sde2             100G  5.2G   95G   6% /<br>
      tmpfs                  12G  184K   12G   1% /dev<br>
      tmpfs                  12G   88K   12G   1% /dev/shm<br>
      /dev/sde1            1020M  181M  840M  18% /boot<br>
      /dev/sde4             167G  196M  159G   1% /data1<br>
      /dev/mapper/3600a0b8000755f22000011134dc23f7e_part1<br>
                            489G  2.3G  458G   1% /MDC<br>
      <br>
      [1]+  Done                    df -h<br>
      mdc1:~ # <br>
      <br>
      <br>
      mdc1:~ # rpm -qa | grep -i lustre<br>
      lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
      kernel-default-2.6.27.39-0.3_lustre.1.8.4<br>
      lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default<br>
      kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default<br>
      lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
      kernel-default-base-2.6.27.39-0.3_lustre.1.8.4<br>
      mdc1:~ # <br>
      <br>
      mdc2:~ # Filesystem            Size  Used Avail Use% Mounted on<br>
      /dev/sde3             100G  5.0G   95G   5% /<br>
      tmpfs                  18G  184K   18G   1% /dev<br>
      tmpfs                 7.8G   88K  7.8G   1% /dev/shm<br>
      /dev/sde1            1020M  144M  876M  15% /boot<br>
      /dev/sde4             170G  6.6M  170G   1% /data1<br>
      <br>
      [1]+  Done                    df -h<br>
      mdc2:~ # rpm -qqa | grep -i lustre<br>
      lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
      kernel-default-base-2.6.27.39-0.3_lustre.1.8.4<br>
      kernel-default-2.6.27.39-0.3_lustre.1.8.4<br>
      lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default<br>
      kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default<br>
      lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
      mdc2:~ # <br>
      <br>
      <br>
      service0:~ # ibstat<br>
      CA 'mlx4_0'<br>
          CA type: MT26428<br>
          Number of ports: 2<br>
          Firmware version: 2.7.0<br>
          Hardware version: a0<br>
          Node GUID: 0x0002c903000a6028<br>
          System image GUID: 0x0002c903000a602b<br>
          Port 1:<br>
              State: Active<br>
              Physical state: LinkUp<br>
              Rate: 40<br>
              Base lid: 9<br>
              LMC: 0<br>
              SM lid: 1<br>
              Capability mask: 0x02510868<br>
              Port GUID: 0x0002c903000a6029<br>
          Port 2:<br>
              State: Active<br>
              Physical state: LinkUp<br>
              Rate: 40<br>
              Base lid: 10<br>
              LMC: 0<br>
              SM lid: 1<br>
              Capability mask: 0x02510868<br>
              Port GUID: 0x0002c903000a602a<br>
      service0:~ # <br>
      <br>
      <br>
      <br>
      service0:~ # ibstatus <br>
      Infiniband device 'mlx4_0' port 1 status:<br>
          default gid:     fec0:0000:0000:0000:0002:c903:000a:6029<br>
          base lid:     0x9<br>
          sm lid:         0x1<br>
          state:         4: ACTIVE<br>
          phys state:     5: LinkUp<br>
          rate:         40 Gb/sec (4X QDR)<br>
      <br>
      Infiniband device 'mlx4_0' port 2 status:<br>
          default gid:     fec0:0000:0000:0000:0002:c903:000a:602a<br>
          base lid:     0xa<br>
          sm lid:         0x1<br>
          state:         4: ACTIVE<br>
          phys state:     5: LinkUp<br>
          rate:         40 Gb/sec (4X QDR)<br>
      <br>
      service0:~ # <br>
      <br>
      <br>
      <br>
      service0:~ # ibdiagnet <br>
      Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2<br>
      -W- Topology file is not specified.<br>
          Reports regarding cluster links will use direct routes.<br>
      Loading IBDM from: /usr/lib64/ibdm1.2<br>
      -W- A few ports of local device are up.<br>
          Since port-num was not specified (-p option), port 1 of device
      1 will be<br>
          used as the local port.<br>
      -I- Discovering ... 88 nodes (9 Switches & 79 CA-s)
      discovered.<br>
      <br>
      <br>
      -I---------------------------------------------------<br>
      -I- Bad Guids/LIDs Info<br>
      -I---------------------------------------------------<br>
      -I- No bad Guids were found<br>
      <br>
      -I---------------------------------------------------<br>
      -I- Links With Logical State = INIT<br>
      -I---------------------------------------------------<br>
      -I- No bad Links (with logical state = INIT) were found<br>
      <br>
      -I---------------------------------------------------<br>
      -I- PM Counters Info<br>
      -I---------------------------------------------------<br>
      -I- No illegal PM counters values were found<br>
      <br>
      -I---------------------------------------------------<br>
      -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts
      list)<br>
      -I---------------------------------------------------<br>
      -I-    PKey:0x7fff Hosts:81 full:81 partial:0<br>
      <br>
      -I---------------------------------------------------<br>
      -I- IPoIB Subnets Check<br>
      -I---------------------------------------------------<br>
      -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte
      rate:10Gbps SL:0x00<br>
      -W- Suboptimal rate for group. Lowest member rate:20Gbps >
      group-rate:10Gbps<br>
      <br>
      -I---------------------------------------------------<br>
      -I- Bad Links Info<br>
      -I- No bad link were found<br>
      -I---------------------------------------------------<br>
      ----------------------------------------------------------------<br>
      -I- Stages Status Report:<br>
          STAGE                                    Errors Warnings<br>
          Bad GUIDs/LIDs Check                     0      0     <br>
          Link State Active Check                  0      0     <br>
          Performance Counters Report              0      0     <br>
          Partitions Check                         0      0     <br>
          IPoIB Subnets Check                      0      1     <br>
      <br>
      Please see /tmp/ibdiagnet.log for complete log<br>
      ----------------------------------------------------------------<br>
       <br>
      -I- Done. Run time was 9 seconds.<br>
      service0:~ # <br>
      <br>
      <br>
      service0:~ # ibcheckerrors <br>
      #warn: counter VL15Dropped = 18584     (threshold 100) lid 1 port
      1<br>
      Error check on lid 1 (r1lead HCA-1) port 1:  FAILED <br>
      #warn: counter SymbolErrors = 42829     (threshold 10) lid 9 port
      1<br>
      #warn: counter RcvErrors = 9279     (threshold 10) lid 9 port 1<br>
      Error check on lid 9 (service0 HCA-1) port 1:  FAILED <br>
      <br>
      ## Summary: 88 nodes checked, 0 bad nodes found<br>
      ##          292 ports checked, 2 ports have errors beyond
      threshold<br>
      service0:~ # <br>
      <br>
      <br>
      service0:~ # ibchecknet <br>
      <br>
      # Checking Ca: nodeguid 0x0002c903000abfc2<br>
      <br>
      # Checking Ca: nodeguid 0x0002c903000ac00e<br>
      <br>
      # Checking Ca: nodeguid 0x0002c903000a69dc<br>
      <br>
      # Checking Ca: nodeguid 0x0002c9030009cd46<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d878<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d880<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d87c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d884<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d888<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d88c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d890<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d894<br>
      <br>
      # Checking Ca: nodeguid 0x0002c9020029fa50<br>
      #warn: counter VL15Dropped = 18617     (threshold 100) lid 1 port
      1<br>
      Error check on lid 1 (r1lead HCA-1) port 1:  FAILED <br>
      <br>
      # Checking Ca: nodeguid 0x0002c90300054eac<br>
      <br>
      # Checking Ca: nodeguid 0x0002c9030009cebe<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4c9f8<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db08<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db40<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db44<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db48<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db4c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db0c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dca0<br>
      <br>
      # Checking Ca: nodeguid 0x0002c903000abfe2<br>
      <br>
      # Checking Ca: nodeguid 0x0002c903000abfe6<br>
      <br>
      # Checking Ca: nodeguid 0x0002c9030009dd28<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db54<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db58<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4c9f4<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db50<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db3c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db38<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db14<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db10<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d8a8<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d8ac<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d8b4<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d8b0<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db70<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db68<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db64<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db78<br>
      <br>
      # Checking Ca: nodeguid 0x0002c903000a69f0<br>
      <br>
      # Checking Ca: nodeguid 0x0002c9030006004a<br>
      <br>
      # Checking Ca: nodeguid 0x0002c9030009dd2c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d8b8<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d8bc<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d8a4<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d8a0<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db7c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db80<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db6c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db74<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dcb8<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dcd0<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc5c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc60<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc54<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc50<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc4c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dcd4<br>
      <br>
      # Checking Ca: nodeguid 0x0002c903000a6164<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dcf0<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db5c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc90<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc8c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc58<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc94<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dc9c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db60<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d89c<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d898<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dad8<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4dadc<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db30<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4db34<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d874<br>
      <br>
      # Checking Ca: nodeguid 0x003048fffff4d870<br>
      <br>
      # Checking Ca: nodeguid 0x0002c903000a6028<br>
      #warn: counter SymbolErrors = 44150     (threshold 10) lid 9 port
      1<br>
      #warn: counter RcvErrors = 9283     (threshold 10) lid 9 port 1<br>
      Error check on lid 9 (service0 HCA-1) port 1:  FAILED <br>
      <br>
      ## Summary: 88 nodes checked, 0 bad nodes found<br>
      ##          292 ports checked, 0 bad ports found<br>
      ##          2 ports have errors beyond threshold<br>
      <br>
      <br>
      <br>
      service0:~ # ibcheckstate<br>
      <br>
      ## Summary: 88 nodes checked, 0 bad nodes found<br>
      ##          292 ports checked, 0 ports with bad state found<br>
      service0:~ # ibcheckwidth<br>
      <br>
      ## Summary: 88 nodes checked, 0 bad nodes found<br>
      ##          292 ports checked, 0 ports with 1x width in error
      found<br>
      service0:~ # <br>
      <br>
      <br>
      Thanks and Regards<br>
      Ashok<br>
      <br>
      <br>
      <br>
      <div class="gmail_quote">On 30 September 2011 12:39, Brian
        O'Connor <span dir="ltr"><<a moz-do-not-send="true"
            href="mailto:briano@sgi.com">briano@sgi.com</a>></span>
        wrote:<br>
        <blockquote class="gmail_quote" style="margin:0 0 0
          .8ex;border-left:1px #ccc solid;padding-left:1ex;">
          <div bgcolor="#FFFFFF" text="#000000"> Hello Ashok<br>
            <br>
            is the cluster hanging or otherwise behaving badly? The logs
            below show that the client<br>
            lost connection to 10.148.0.106 for 10seconds or so. It
            should have recovered ok.<br>
            <br>
            If you want further help from the list you need to add more
            detail about the cluster i.e.<br>
            A general description of the number of OSS/OST, clients,
            version of lustre etc, and a description<br>
            of what is actually going wrong... ie hanging, offline etc<br>
            <br>
            The first thing is to check the infrastructure.. ie. in this
            case you should check your IB network for errors
            <div>
              <div class="h5"><br>
                <br>
                <br>
                <br>
                On 30-September-2011 2:39 PM, Ashok nulguda wrote: </div>
            </div>
            <blockquote type="cite">
              <div>
                <div class="h5"> Dear All,<br>
                  <br>
                  I am having lustre error on my HPC as given
                  below.Please any one can help me to resolve this
                  problem. <br>
                  Thanks in Advance.<br>
                  Sep 30 08:40:23 service0 kernel: [343138.837222]
                  Lustre:
                  8300:0:(client.c:1476:ptlrpc_expire_one_request())
                  Skipped 1 previous similar message<br>
                  Sep 30 08:40:23 service0 kernel: [343138.837233]
                  Lustre: lustre-OST0008-osc-ffff880b272cf800:
                  Connection to service lustre-OST0008 via nid
                  10.148.0.106@o2ib was lost; in progress operations
                  using this service will wait for recovery to complete.<br>
                  Sep 30 08:40:24 service0 kernel: [343139.837260]
                  Lustre:
                  8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
                  Request x1380984193067288 sent from
                  lustre-OST0006-osc-ffff880b272cf800 to NID
                  10.148.0.106@o2ib 7s ago has timed out (7s prior to
                  deadline).<br>
                  Sep 30 08:40:24 service0 kernel: [343139.837263]  
                  req@ffff880a5f800c00 x1380984193067288/t0 o3-><a
                    moz-do-not-send="true"
                    href="mailto:lustre-OST0006_UUID@10.148.0.106@o2ib:6/4"
                    target="_blank">lustre-OST0006_UUID@10.148.0.106@o2ib:6/4</a>
                  lens 448/592 e 0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0
                  rc 0/0<br>
                  Sep 30 08:40:24 service0 kernel: [343139.837269]
                  Lustre:
                  8300:0:(client.c:1476:ptlrpc_expire_one_request())
                  Skipped 38 previous similar messages<br>
                  Sep 30 08:40:24 service0 kernel: [343140.129284]
                  LustreError:
                  9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got
                  rc -11 from cancel RPC: canceling anyway<br>
                  Sep 30 08:40:24 service0 kernel: [343140.129290]
                  LustreError:
                  9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req())
                  Skipped 1 previous similar message<br>
                  Sep 30 08:40:24 service0 kernel: [343140.129295]
                  LustreError:
                  9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
                  ldlm_cli_cancel_list: -11<br>
                  Sep 30 08:40:24 service0 kernel: [343140.129299]
                  LustreError:
                  9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
                  Skipped 1 previous similar message<br>
                  Sep 30 08:40:25 service0 kernel: [343140.837308]
                  Lustre:
                  8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
                  Request x1380984193067299 sent from
                  lustre-OST0010-osc-ffff880b272cf800 to NID
                  10.148.0.106@o2ib 7s ago has timed out (7s prior to
                  deadline).<br>
                  Sep 30 08:40:25 service0 kernel: [343140.837311]  
                  req@ffff880a557c4400 x1380984193067299/t0 o3-><a
                    moz-do-not-send="true"
                    href="mailto:lustre-OST0010_UUID@10.148.0.106@o2ib:6/4"
                    target="_blank">lustre-OST0010_UUID@10.148.0.106@o2ib:6/4</a>
                  lens 448/592 e 0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0
                  rc 0/0<br>
                  Sep 30 08:40:25 service0 kernel: [343140.837316]
                  Lustre:
                  8300:0:(client.c:1476:ptlrpc_expire_one_request())
                  Skipped 4 previous similar messages<br>
                  Sep 30 08:40:26 service0 kernel: [343141.245365]
                  LustreError:
                  30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req())
                  Got rc -11 from cancel RPC: canceling anyway<br>
                  Sep 30 08:40:26 service0 kernel: [343141.245371]
                  LustreError:
                  22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
                  ldlm_cli_cancel_list: -11<br>
                  Sep 30 08:40:26 service0 kernel: [343141.245378]
                  LustreError:
                  30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req())
                  Skipped 1 previous similar message<br>
                  Sep 30 08:40:33 service0 kernel: [343148.245683]
                  Lustre:
                  22725:0:(client.c:1476:ptlrpc_expire_one_request())
                  @@@ Request x1380984193067302 sent from
                  lustre-OST0004-osc-ffff880b272cf800 to NID
                  10.148.0.106@o2ib 14s ago has timed out (14s prior to
                  deadline).<br>
                  Sep 30 08:40:33 service0 kernel: [343148.245686]  
                  req@ffff8805c879e800 x1380984193067302/t0 o103-><a
                    moz-do-not-send="true"
                    href="mailto:lustre-OST0004_UUID@10.148.0.106@o2ib:17/18"
                    target="_blank">lustre-OST0004_UUID@10.148.0.106@o2ib:17/18</a>
                  lens 296/384 e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0
                  rc 0/0<br>
                  Sep 30 08:40:33 service0 kernel: [343148.245692]
                  Lustre:
                  22725:0:(client.c:1476:ptlrpc_expire_one_request())
                  Skipped 2 previous similar messages<br>
                  Sep 30 08:40:33 service0 kernel: [343148.245708]
                  LustreError:
                  22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req())
                  Got rc -11 from cancel RPC: canceling anyway<br>
                  Sep 30 08:40:33 service0 kernel: [343148.245714]
                  LustreError:
                  22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
                  ldlm_cli_cancel_list: -11<br>
                  Sep 30 08:40:33 service0 kernel: [343148.245717]
                  LustreError:
                  22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
                  Skipped 1 previous similar message<br>
                  Sep 30 08:40:36 service0 kernel: [343151.548005]
                  LustreError: 11-0: an error occurred while
                  communicating with 10.148.0.106@o2ib. The ost_connect
                  operation failed with -16<br>
                  Sep 30 08:40:36 service0 kernel: [343151.548008]
                  LustreError: Skipped 1 previous similar message<br>
                  Sep 30 08:40:36 service0 kernel: [343151.548024]
                  LustreError: 167-0: This client was evicted by
                  lustre-OST000b; in progress operations using this
                  service will fail.<br>
                  Sep 30 08:40:36 service0 kernel: [343151.548250]
                  LustreError:
                  30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't
                  unlock -5<br>
                  Sep 30 08:40:36 service0 kernel: [343151.550210]
                  LustreError:
                  8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@
                  IMP_INVALID  req@ffff88049528c400 x1380984193067406/t0
                  o3-><a moz-do-not-send="true"
                    href="mailto:lustre-OST000b_UUID@10.148.0.106@o2ib:6/4"
                    target="_blank">lustre-OST000b_UUID@10.148.0.106@o2ib:6/4</a>
                  lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0<br>
                  Sep 30 08:40:36 service0 kernel: [343151.594742]
                  Lustre: lustre-OST0000-osc-ffff880b272cf800:
                  Connection restored to service lustre-OST0000 using
                  nid 10.148.0.106@o2ib.<br>
                  Sep 30 08:40:36 service0 kernel: [343151.837203]
                  Lustre: lustre-OST0006-osc-ffff880b272cf800:
                  Connection restored to service lustre-OST0006 using
                  nid 10.148.0.106@o2ib.<br>
                  Sep 30 08:40:37 service0 kernel: [343152.842631]
                  Lustre: lustre-OST0003-osc-ffff880b272cf800:
                  Connection restored to service lustre-OST0003 using
                  nid 10.148.0.106@o2ib.<br>
                  Sep 30 08:40:37 service0 kernel: [343152.842636]
                  Lustre: Skipped 3 previous similar messages<br>
                  <br>
                  <br>
                  Thanks and Regards<br>
                  Ashok<br clear="all">
                  <br>
                  -- <br>
                  <div style="margin:0in 0in 0pt"><b><font
                        face="Cambria">Ashok Nulguda<br>
                      </font></b></div>
                  <div style="margin:0in 0in 0pt"><b><font
                        face="Cambria">TATA ELXSI LTD</font></b></div>
                  <div style="margin:0in 0in 0pt"><span
                      style="font-family:'Cambria','serif'"></span></div>
                  <div style="margin:0in 0in 0pt"><span
                      style="font-family:'Cambria','serif'"></span></div>
                  <div style="margin:0in 0in 0pt"><span
                      style="font-family:'Cambria','serif'"><b>Mb : +91
                        9689945767<br>
                      </b></span></div>
                  <div style="margin:0in 0in 0pt"><span
                      style="font-family:'Cambria','serif'"></span><span
                      style="font-family:'Cambria','serif'"><font
                        color="#0000ff"><b>Email :<a
                            moz-do-not-send="true"
                            href="mailto:tshrikant@tataelxsi.co.in"
                            target="_blank">ashokn@tataelxsi.co.in</a></b></font></span></div>
                  <br>
                  <br>
                  <fieldset></fieldset>
                  <br>
                </div>
              </div>
              <pre>_______________________________________________
Lustre-discuss mailing list
<a moz-do-not-send="true" href="mailto:Lustre-discuss@lists.lustre.org" target="_blank">Lustre-discuss@lists.lustre.org</a>
<a moz-do-not-send="true" href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a>
</pre>
            </blockquote>
            <font color="#888888"> <br>
              <br>
              <pre cols="72">-- 
Brian O'Connor
-------------------------------------------------
SGI Consulting
Email: <a moz-do-not-send="true" href="mailto:briano@sgi.com" target="_blank">briano@sgi.com</a>, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124 
AUSTRALIA <a moz-do-not-send="true" href="http://www.sgi.com/support/services" target="_blank">http://www.sgi.com/support/services</a>
-------------------------------------------------

 
</pre>
            </font></div>
        </blockquote>
      </div>
      <br>
      <br clear="all">
      <br>
      -- <br>
      <div style="margin:0in 0in 0pt"><b><font face="Cambria">Ashok
            Nulguda<br>
          </font></b></div>
      <div style="margin:0in 0in 0pt"><b><font face="Cambria">TATA ELXSI
            LTD</font></b></div>
      <div style="margin:0in 0in 0pt"><span
          style="font-family:'Cambria','serif'"></span></div>
      <div style="margin:0in 0in 0pt"><span
          style="font-family:'Cambria','serif'"></span></div>
      <div style="margin:0in 0in 0pt"><span
          style="font-family:'Cambria','serif'"><b>Mb : +91 9689945767<br>
          </b></span></div>
      <div style="margin:0in 0in 0pt"><span
          style="font-family:'Cambria','serif'"></span><span
          style="font-family:'Cambria','serif'"><font color="#0000ff"><b>Email
              :<a moz-do-not-send="true"
                href="mailto:tshrikant@tataelxsi.co.in" target="_blank">ashokn@tataelxsi.co.in</a></b></font></span></div>
      <br>
    </blockquote>
    <br>
    <br>
    <pre class="moz-signature" cols="72">-- 
Brian O'Connor
-------------------------------------------------
SGI Consulting
Email: <a class="moz-txt-link-abbreviated" href="mailto:briano@sgi.com">briano@sgi.com</a>, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124 
AUSTRALIA <a class="moz-txt-link-freetext" href="http://www.sgi.com/support/services">http://www.sgi.com/support/services</a>
-------------------------------------------------

 
</pre>
  </body>
</html>