<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
Hi Ashok<br>
<br>
If you have a valid support contract log a call with you local SGI
office, you have a couple of bad IB ports, maybe a cable or other
such thing. Include the information you provided below<br>
and ask them help out.<br>
<br>
<br>
On 30-September-2011 6:37 PM, Ashok nulguda wrote:
<blockquote
cite="mid:CACGS=M9ithnCXkE4PVR2vZOzuLef+JHGNF-bQ3fT4gDfQj8AmA@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
Dear Sir,<br>
<br>
<br>
Thanks for your help.<br>
<br>
My system is ICE 8400 cluster with 30 TB of lustre of 64 node.<br>
oss1:~ # df -h <br>
Filesystem Size Used Avail Use% Mounted on<br>
/dev/sda3 100G 5.8G 95G 6% /<br>
tmpfs 12G 1.1M 12G 1% /dev<br>
tmpfs 12G 88K 12G 1% /dev/shm<br>
/dev/sda1 1020M 181M 840M 18% /boot<br>
/dev/sda4 170G 6.6M 170G 1% /data1<br>
/dev/mapper/3600a0b8000755ee0000010964dc231bc_part1<br>
2.1T 74G 1.9T 4% /OST1<br>
/dev/mapper/3600a0b8000755ed1000010614dc23425_part1<br>
1.7T 67G 1.5T 5% /OST4<br>
/dev/mapper/3600a0b8000755ee0000010a04dc23323_part1<br>
2.1T 67G 1.9T 4% /OST5<br>
/dev/mapper/3600a0b8000755f1f000011224dc239d7_part1<br>
1.7T 67G 1.5T 5% /OST8<br>
/dev/mapper/3600a0b8000755dbe000010de4dc23997_part1<br>
2.1T 66G 1.9T 4% /OST9<br>
/dev/mapper/3600a0b8000755f1f000011284dc23b5a_part1<br>
1.7T 66G 1.5T 5% /OST12<br>
/dev/mapper/3600a0b8000755eb3000011304dc23db1_part1<br>
2.1T 66G 1.9T 4% /OST13<br>
/dev/mapper/3600a0b8000755f22000011104dc23ec7_part1<br>
1.7T 66G 1.5T 5% /OST16<br>
<br>
<br>
oss1:~ # rpm -qa | grep -i lustre<br>
kernel-default-2.6.27.39-0.3_lustre.1.8.4<br>
kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default<br>
lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
kernel-default-base-2.6.27.39-0.3_lustre.1.8.4<br>
lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default<br>
<br>
<br>
oss2:~ # Filesystem Size Used Avail Use% Mounted on<br>
/dev/sdcw3 100G 8.3G 92G 9% /<br>
tmpfs 12G 1.1M 12G 1% /dev<br>
tmpfs 12G 88K 12G 1% /dev/shm<br>
/dev/sdcw1 1020M 144M 876M 15% /boot<br>
/dev/sdcw4 170G 13M 170G 1% /data1<br>
/dev/mapper/3600a0b8000755ed10000105e4dc23397_part1<br>
1.7T 69G 1.5T 5% /OST2<br>
/dev/mapper/3600a0b8000755ee00000109b4dc232a0_part1<br>
2.1T 68G 1.9T 4% /OST3<br>
/dev/mapper/3600a0b8000755ed1000010644dc2349f_part1<br>
1.7T 67G 1.5T 5% /OST6<br>
/dev/mapper/3600a0b8000755dbe000010d94dc23873_part1<br>
2.1T 67G 1.9T 4% /OST7<br>
/dev/mapper/3600a0b8000755f1f000011254dc23add_part1<br>
1.7T 66G 1.5T 5% /OST10<br>
/dev/mapper/3600a0b8000755dbe000010e34dc23a09_part1<br>
2.1T 66G 1.9T 4% /OST11<br>
/dev/mapper/3600a0b8000755f220000110d4dc23e36_part1<br>
1.7T 66G 1.5T 5% /OST14<br>
/dev/mapper/3600a0b8000755eb3000011354dc23e39_part1<br>
2.1T 66G 1.9T 4% /OST15<br>
/dev/mapper/3600a0b8000755eb30000113a4dc23ec4_part1<br>
1.4T 66G 1.3T 6% /OST17<br>
<br>
[1]+ Done df -h<br>
<br>
oss2:~ # rpm -qa | grep -i lustre<br>
lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
kernel-default-base-2.6.27.39-0.3_lustre.1.8.4<br>
kernel-default-2.6.27.39-0.3_lustre.1.8.4<br>
kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default<br>
lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default<br>
lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
<br>
mdc1:~ # Filesystem Size Used Avail Use% Mounted on<br>
/dev/sde2 100G 5.2G 95G 6% /<br>
tmpfs 12G 184K 12G 1% /dev<br>
tmpfs 12G 88K 12G 1% /dev/shm<br>
/dev/sde1 1020M 181M 840M 18% /boot<br>
/dev/sde4 167G 196M 159G 1% /data1<br>
/dev/mapper/3600a0b8000755f22000011134dc23f7e_part1<br>
489G 2.3G 458G 1% /MDC<br>
<br>
[1]+ Done df -h<br>
mdc1:~ # <br>
<br>
<br>
mdc1:~ # rpm -qa | grep -i lustre<br>
lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
kernel-default-2.6.27.39-0.3_lustre.1.8.4<br>
lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default<br>
kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default<br>
lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
kernel-default-base-2.6.27.39-0.3_lustre.1.8.4<br>
mdc1:~ # <br>
<br>
mdc2:~ # Filesystem Size Used Avail Use% Mounted on<br>
/dev/sde3 100G 5.0G 95G 5% /<br>
tmpfs 18G 184K 18G 1% /dev<br>
tmpfs 7.8G 88K 7.8G 1% /dev/shm<br>
/dev/sde1 1020M 144M 876M 15% /boot<br>
/dev/sde4 170G 6.6M 170G 1% /data1<br>
<br>
[1]+ Done df -h<br>
mdc2:~ # rpm -qqa | grep -i lustre<br>
lustre-modules-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
kernel-default-base-2.6.27.39-0.3_lustre.1.8.4<br>
kernel-default-2.6.27.39-0.3_lustre.1.8.4<br>
lustre-ldiskfs-3.1.3-2.6.27_39_0.3_lustre.1.8.4_default<br>
kernel-ib-1.5.1-2.6.27.39_0.3_lustre.1.8.4_default<br>
lustre-1.8.4-2.6.27_39_0.3_lustre.1.8.4_default<br>
mdc2:~ # <br>
<br>
<br>
service0:~ # ibstat<br>
CA 'mlx4_0'<br>
CA type: MT26428<br>
Number of ports: 2<br>
Firmware version: 2.7.0<br>
Hardware version: a0<br>
Node GUID: 0x0002c903000a6028<br>
System image GUID: 0x0002c903000a602b<br>
Port 1:<br>
State: Active<br>
Physical state: LinkUp<br>
Rate: 40<br>
Base lid: 9<br>
LMC: 0<br>
SM lid: 1<br>
Capability mask: 0x02510868<br>
Port GUID: 0x0002c903000a6029<br>
Port 2:<br>
State: Active<br>
Physical state: LinkUp<br>
Rate: 40<br>
Base lid: 10<br>
LMC: 0<br>
SM lid: 1<br>
Capability mask: 0x02510868<br>
Port GUID: 0x0002c903000a602a<br>
service0:~ # <br>
<br>
<br>
<br>
service0:~ # ibstatus <br>
Infiniband device 'mlx4_0' port 1 status:<br>
default gid: fec0:0000:0000:0000:0002:c903:000a:6029<br>
base lid: 0x9<br>
sm lid: 0x1<br>
state: 4: ACTIVE<br>
phys state: 5: LinkUp<br>
rate: 40 Gb/sec (4X QDR)<br>
<br>
Infiniband device 'mlx4_0' port 2 status:<br>
default gid: fec0:0000:0000:0000:0002:c903:000a:602a<br>
base lid: 0xa<br>
sm lid: 0x1<br>
state: 4: ACTIVE<br>
phys state: 5: LinkUp<br>
rate: 40 Gb/sec (4X QDR)<br>
<br>
service0:~ # <br>
<br>
<br>
<br>
service0:~ # ibdiagnet <br>
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2<br>
-W- Topology file is not specified.<br>
Reports regarding cluster links will use direct routes.<br>
Loading IBDM from: /usr/lib64/ibdm1.2<br>
-W- A few ports of local device are up.<br>
Since port-num was not specified (-p option), port 1 of device
1 will be<br>
used as the local port.<br>
-I- Discovering ... 88 nodes (9 Switches & 79 CA-s)
discovered.<br>
<br>
<br>
-I---------------------------------------------------<br>
-I- Bad Guids/LIDs Info<br>
-I---------------------------------------------------<br>
-I- No bad Guids were found<br>
<br>
-I---------------------------------------------------<br>
-I- Links With Logical State = INIT<br>
-I---------------------------------------------------<br>
-I- No bad Links (with logical state = INIT) were found<br>
<br>
-I---------------------------------------------------<br>
-I- PM Counters Info<br>
-I---------------------------------------------------<br>
-I- No illegal PM counters values were found<br>
<br>
-I---------------------------------------------------<br>
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts
list)<br>
-I---------------------------------------------------<br>
-I- PKey:0x7fff Hosts:81 full:81 partial:0<br>
<br>
-I---------------------------------------------------<br>
-I- IPoIB Subnets Check<br>
-I---------------------------------------------------<br>
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte
rate:10Gbps SL:0x00<br>
-W- Suboptimal rate for group. Lowest member rate:20Gbps >
group-rate:10Gbps<br>
<br>
-I---------------------------------------------------<br>
-I- Bad Links Info<br>
-I- No bad link were found<br>
-I---------------------------------------------------<br>
----------------------------------------------------------------<br>
-I- Stages Status Report:<br>
STAGE Errors Warnings<br>
Bad GUIDs/LIDs Check 0 0 <br>
Link State Active Check 0 0 <br>
Performance Counters Report 0 0 <br>
Partitions Check 0 0 <br>
IPoIB Subnets Check 0 1 <br>
<br>
Please see /tmp/ibdiagnet.log for complete log<br>
----------------------------------------------------------------<br>
<br>
-I- Done. Run time was 9 seconds.<br>
service0:~ # <br>
<br>
<br>
service0:~ # ibcheckerrors <br>
#warn: counter VL15Dropped = 18584 (threshold 100) lid 1 port
1<br>
Error check on lid 1 (r1lead HCA-1) port 1: FAILED <br>
#warn: counter SymbolErrors = 42829 (threshold 10) lid 9 port
1<br>
#warn: counter RcvErrors = 9279 (threshold 10) lid 9 port 1<br>
Error check on lid 9 (service0 HCA-1) port 1: FAILED <br>
<br>
## Summary: 88 nodes checked, 0 bad nodes found<br>
## 292 ports checked, 2 ports have errors beyond
threshold<br>
service0:~ # <br>
<br>
<br>
service0:~ # ibchecknet <br>
<br>
# Checking Ca: nodeguid 0x0002c903000abfc2<br>
<br>
# Checking Ca: nodeguid 0x0002c903000ac00e<br>
<br>
# Checking Ca: nodeguid 0x0002c903000a69dc<br>
<br>
# Checking Ca: nodeguid 0x0002c9030009cd46<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d878<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d880<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d87c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d884<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d888<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d88c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d890<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d894<br>
<br>
# Checking Ca: nodeguid 0x0002c9020029fa50<br>
#warn: counter VL15Dropped = 18617 (threshold 100) lid 1 port
1<br>
Error check on lid 1 (r1lead HCA-1) port 1: FAILED <br>
<br>
# Checking Ca: nodeguid 0x0002c90300054eac<br>
<br>
# Checking Ca: nodeguid 0x0002c9030009cebe<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4c9f8<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db08<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db40<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db44<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db48<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db4c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db0c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dca0<br>
<br>
# Checking Ca: nodeguid 0x0002c903000abfe2<br>
<br>
# Checking Ca: nodeguid 0x0002c903000abfe6<br>
<br>
# Checking Ca: nodeguid 0x0002c9030009dd28<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db54<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db58<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4c9f4<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db50<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db3c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db38<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db14<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db10<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d8a8<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d8ac<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d8b4<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d8b0<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db70<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db68<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db64<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db78<br>
<br>
# Checking Ca: nodeguid 0x0002c903000a69f0<br>
<br>
# Checking Ca: nodeguid 0x0002c9030006004a<br>
<br>
# Checking Ca: nodeguid 0x0002c9030009dd2c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d8b8<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d8bc<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d8a4<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d8a0<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db7c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db80<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db6c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db74<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dcb8<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dcd0<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc5c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc60<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc54<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc50<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc4c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dcd4<br>
<br>
# Checking Ca: nodeguid 0x0002c903000a6164<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dcf0<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db5c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc90<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc8c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc58<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc94<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dc9c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db60<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d89c<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d898<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dad8<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4dadc<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db30<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4db34<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d874<br>
<br>
# Checking Ca: nodeguid 0x003048fffff4d870<br>
<br>
# Checking Ca: nodeguid 0x0002c903000a6028<br>
#warn: counter SymbolErrors = 44150 (threshold 10) lid 9 port
1<br>
#warn: counter RcvErrors = 9283 (threshold 10) lid 9 port 1<br>
Error check on lid 9 (service0 HCA-1) port 1: FAILED <br>
<br>
## Summary: 88 nodes checked, 0 bad nodes found<br>
## 292 ports checked, 0 bad ports found<br>
## 2 ports have errors beyond threshold<br>
<br>
<br>
<br>
service0:~ # ibcheckstate<br>
<br>
## Summary: 88 nodes checked, 0 bad nodes found<br>
## 292 ports checked, 0 ports with bad state found<br>
service0:~ # ibcheckwidth<br>
<br>
## Summary: 88 nodes checked, 0 bad nodes found<br>
## 292 ports checked, 0 ports with 1x width in error
found<br>
service0:~ # <br>
<br>
<br>
Thanks and Regards<br>
Ashok<br>
<br>
<br>
<br>
<div class="gmail_quote">On 30 September 2011 12:39, Brian
O'Connor <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:briano@sgi.com">briano@sgi.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div bgcolor="#FFFFFF" text="#000000"> Hello Ashok<br>
<br>
is the cluster hanging or otherwise behaving badly? The logs
below show that the client<br>
lost connection to 10.148.0.106 for 10seconds or so. It
should have recovered ok.<br>
<br>
If you want further help from the list you need to add more
detail about the cluster i.e.<br>
A general description of the number of OSS/OST, clients,
version of lustre etc, and a description<br>
of what is actually going wrong... ie hanging, offline etc<br>
<br>
The first thing is to check the infrastructure.. ie. in this
case you should check your IB network for errors
<div>
<div class="h5"><br>
<br>
<br>
<br>
On 30-September-2011 2:39 PM, Ashok nulguda wrote: </div>
</div>
<blockquote type="cite">
<div>
<div class="h5"> Dear All,<br>
<br>
I am having lustre error on my HPC as given
below.Please any one can help me to resolve this
problem. <br>
Thanks in Advance.<br>
Sep 30 08:40:23 service0 kernel: [343138.837222]
Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request())
Skipped 1 previous similar message<br>
Sep 30 08:40:23 service0 kernel: [343138.837233]
Lustre: lustre-OST0008-osc-ffff880b272cf800:
Connection to service lustre-OST0008 via nid
10.148.0.106@o2ib was lost; in progress operations
using this service will wait for recovery to complete.<br>
Sep 30 08:40:24 service0 kernel: [343139.837260]
Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1380984193067288 sent from
lustre-OST0006-osc-ffff880b272cf800 to NID
10.148.0.106@o2ib 7s ago has timed out (7s prior to
deadline).<br>
Sep 30 08:40:24 service0 kernel: [343139.837263]
req@ffff880a5f800c00 x1380984193067288/t0 o3-><a
moz-do-not-send="true"
href="mailto:lustre-OST0006_UUID@10.148.0.106@o2ib:6/4"
target="_blank">lustre-OST0006_UUID@10.148.0.106@o2ib:6/4</a>
lens 448/592 e 0 to 1 dl 1317352224 ref 2 fl Rpc:/0/0
rc 0/0<br>
Sep 30 08:40:24 service0 kernel: [343139.837269]
Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request())
Skipped 38 previous similar messages<br>
Sep 30 08:40:24 service0 kernel: [343140.129284]
LustreError:
9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got
rc -11 from cancel RPC: canceling anyway<br>
Sep 30 08:40:24 service0 kernel: [343140.129290]
LustreError:
9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req())
Skipped 1 previous similar message<br>
Sep 30 08:40:24 service0 kernel: [343140.129295]
LustreError:
9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11<br>
Sep 30 08:40:24 service0 kernel: [343140.129299]
LustreError:
9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
Skipped 1 previous similar message<br>
Sep 30 08:40:25 service0 kernel: [343140.837308]
Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1380984193067299 sent from
lustre-OST0010-osc-ffff880b272cf800 to NID
10.148.0.106@o2ib 7s ago has timed out (7s prior to
deadline).<br>
Sep 30 08:40:25 service0 kernel: [343140.837311]
req@ffff880a557c4400 x1380984193067299/t0 o3-><a
moz-do-not-send="true"
href="mailto:lustre-OST0010_UUID@10.148.0.106@o2ib:6/4"
target="_blank">lustre-OST0010_UUID@10.148.0.106@o2ib:6/4</a>
lens 448/592 e 0 to 1 dl 1317352225 ref 2 fl Rpc:/0/0
rc 0/0<br>
Sep 30 08:40:25 service0 kernel: [343140.837316]
Lustre:
8300:0:(client.c:1476:ptlrpc_expire_one_request())
Skipped 4 previous similar messages<br>
Sep 30 08:40:26 service0 kernel: [343141.245365]
LustreError:
30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req())
Got rc -11 from cancel RPC: canceling anyway<br>
Sep 30 08:40:26 service0 kernel: [343141.245371]
LustreError:
22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11<br>
Sep 30 08:40:26 service0 kernel: [343141.245378]
LustreError:
30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req())
Skipped 1 previous similar message<br>
Sep 30 08:40:33 service0 kernel: [343148.245683]
Lustre:
22725:0:(client.c:1476:ptlrpc_expire_one_request())
@@@ Request x1380984193067302 sent from
lustre-OST0004-osc-ffff880b272cf800 to NID
10.148.0.106@o2ib 14s ago has timed out (14s prior to
deadline).<br>
Sep 30 08:40:33 service0 kernel: [343148.245686]
req@ffff8805c879e800 x1380984193067302/t0 o103-><a
moz-do-not-send="true"
href="mailto:lustre-OST0004_UUID@10.148.0.106@o2ib:17/18"
target="_blank">lustre-OST0004_UUID@10.148.0.106@o2ib:17/18</a>
lens 296/384 e 0 to 1 dl 1317352233 ref 1 fl Rpc:N/0/0
rc 0/0<br>
Sep 30 08:40:33 service0 kernel: [343148.245692]
Lustre:
22725:0:(client.c:1476:ptlrpc_expire_one_request())
Skipped 2 previous similar messages<br>
Sep 30 08:40:33 service0 kernel: [343148.245708]
LustreError:
22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req())
Got rc -11 from cancel RPC: canceling anyway<br>
Sep 30 08:40:33 service0 kernel: [343148.245714]
LustreError:
22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11<br>
Sep 30 08:40:33 service0 kernel: [343148.245717]
LustreError:
22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list())
Skipped 1 previous similar message<br>
Sep 30 08:40:36 service0 kernel: [343151.548005]
LustreError: 11-0: an error occurred while
communicating with 10.148.0.106@o2ib. The ost_connect
operation failed with -16<br>
Sep 30 08:40:36 service0 kernel: [343151.548008]
LustreError: Skipped 1 previous similar message<br>
Sep 30 08:40:36 service0 kernel: [343151.548024]
LustreError: 167-0: This client was evicted by
lustre-OST000b; in progress operations using this
service will fail.<br>
Sep 30 08:40:36 service0 kernel: [343151.548250]
LustreError:
30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't
unlock -5<br>
Sep 30 08:40:36 service0 kernel: [343151.550210]
LustreError:
8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@
IMP_INVALID req@ffff88049528c400 x1380984193067406/t0
o3-><a moz-do-not-send="true"
href="mailto:lustre-OST000b_UUID@10.148.0.106@o2ib:6/4"
target="_blank">lustre-OST000b_UUID@10.148.0.106@o2ib:6/4</a>
lens 448/592 e 0 to 1 dl 0 ref 2 fl Rpc:/0/0 rc 0/0<br>
Sep 30 08:40:36 service0 kernel: [343151.594742]
Lustre: lustre-OST0000-osc-ffff880b272cf800:
Connection restored to service lustre-OST0000 using
nid 10.148.0.106@o2ib.<br>
Sep 30 08:40:36 service0 kernel: [343151.837203]
Lustre: lustre-OST0006-osc-ffff880b272cf800:
Connection restored to service lustre-OST0006 using
nid 10.148.0.106@o2ib.<br>
Sep 30 08:40:37 service0 kernel: [343152.842631]
Lustre: lustre-OST0003-osc-ffff880b272cf800:
Connection restored to service lustre-OST0003 using
nid 10.148.0.106@o2ib.<br>
Sep 30 08:40:37 service0 kernel: [343152.842636]
Lustre: Skipped 3 previous similar messages<br>
<br>
<br>
Thanks and Regards<br>
Ashok<br clear="all">
<br>
-- <br>
<div style="margin:0in 0in 0pt"><b><font
face="Cambria">Ashok Nulguda<br>
</font></b></div>
<div style="margin:0in 0in 0pt"><b><font
face="Cambria">TATA ELXSI LTD</font></b></div>
<div style="margin:0in 0in 0pt"><span
style="font-family:'Cambria','serif'"></span></div>
<div style="margin:0in 0in 0pt"><span
style="font-family:'Cambria','serif'"></span></div>
<div style="margin:0in 0in 0pt"><span
style="font-family:'Cambria','serif'"><b>Mb : +91
9689945767<br>
</b></span></div>
<div style="margin:0in 0in 0pt"><span
style="font-family:'Cambria','serif'"></span><span
style="font-family:'Cambria','serif'"><font
color="#0000ff"><b>Email :<a
moz-do-not-send="true"
href="mailto:tshrikant@tataelxsi.co.in"
target="_blank">ashokn@tataelxsi.co.in</a></b></font></span></div>
<br>
<br>
<fieldset></fieldset>
<br>
</div>
</div>
<pre>_______________________________________________
Lustre-discuss mailing list
<a moz-do-not-send="true" href="mailto:Lustre-discuss@lists.lustre.org" target="_blank">Lustre-discuss@lists.lustre.org</a>
<a moz-do-not-send="true" href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a>
</pre>
</blockquote>
<font color="#888888"> <br>
<br>
<pre cols="72">--
Brian O'Connor
-------------------------------------------------
SGI Consulting
Email: <a moz-do-not-send="true" href="mailto:briano@sgi.com" target="_blank">briano@sgi.com</a>, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA <a moz-do-not-send="true" href="http://www.sgi.com/support/services" target="_blank">http://www.sgi.com/support/services</a>
-------------------------------------------------
</pre>
</font></div>
</blockquote>
</div>
<br>
<br clear="all">
<br>
-- <br>
<div style="margin:0in 0in 0pt"><b><font face="Cambria">Ashok
Nulguda<br>
</font></b></div>
<div style="margin:0in 0in 0pt"><b><font face="Cambria">TATA ELXSI
LTD</font></b></div>
<div style="margin:0in 0in 0pt"><span
style="font-family:'Cambria','serif'"></span></div>
<div style="margin:0in 0in 0pt"><span
style="font-family:'Cambria','serif'"></span></div>
<div style="margin:0in 0in 0pt"><span
style="font-family:'Cambria','serif'"><b>Mb : +91 9689945767<br>
</b></span></div>
<div style="margin:0in 0in 0pt"><span
style="font-family:'Cambria','serif'"></span><span
style="font-family:'Cambria','serif'"><font color="#0000ff"><b>Email
:<a moz-do-not-send="true"
href="mailto:tshrikant@tataelxsi.co.in" target="_blank">ashokn@tataelxsi.co.in</a></b></font></span></div>
<br>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Brian O'Connor
-------------------------------------------------
SGI Consulting
Email: <a class="moz-txt-link-abbreviated" href="mailto:briano@sgi.com">briano@sgi.com</a>, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA <a class="moz-txt-link-freetext" href="http://www.sgi.com/support/services">http://www.sgi.com/support/services</a>
-------------------------------------------------
</pre>
</body>
</html>