[Lustre-discuss] Kernel panics while mounting OSTs

Wed Mar 25 16:47:21 PDT 2009

Hi list-

I'm in the process of configuring my first Lustre cluster for testing.  
I've got access to 10 nodes right that are all made up of identical 
hardware. I decided to start small and configure 3 nodes, after those 
are up and working, move on setting the remaining OSS's up in the same 
way.  I've successfully gotten 3 nodes up and running in a very simple 
lustre setup with the MDT/MGS located on one node and the other two 
setup as OSS's.  The OST's of the OSS's are local 40gb drives (hdb: 
Maxtor 6E040L0, ATA DISK drive).   Everything mounts fine, connects to 
the MGS, and mounts on client nodes just fine.  Decided to move onto 
nodes 4 and 5.

Here's where kernel panics began.  I set these up as OSS's following the 
same steps on 2 and 3 and using the same stock lustre kernel and RPMs.  
Had no trouble creating the lustre filesystem on /dev/hdb1, however, as 
soon as I issued mount /dev/hdb1 /mnt/ost my kernel would panic.  The 
same on both nodes.    Wondering if it might be a networking issue, I 
isolated one of the faulty nodes (node 4) with a known working OSS node 
(node 2).  I set up a new lustre configuration with node 4 acting as a 
combined MGS/MDT server and node 2 as the OSS.  I was able to create the 
FS on node 4 and even mount the device at /mnt/mdt, however, as soon as 
I tried mounting the OST on node 2, node 4 went into kernel panic again.

Frustrated, I powered off node 4 and 5 and moved onto nodes 6 and 7 to 
see how they fair.  Node 6 works great, node 7 runs into the same 
problems.  Again, all of these boxes contain identical hardware. 

I've started from scratch 2 times, switching CentOS 5.2 to RHEL5 on all 
nodes and downgrade from Lustre 1.6.7 to 1.6.6 to see if that made a 
difference.

I'm lost as to where to begin to resolve this.  I've attached relevant 
info, any tips anyone may provide will be greatly appreciated. 

Thanks.

Network setup: node 1 contains both MDT/MGS.  All other nodes act as 
OSS's mounting /dev/hdb1 as their only OST.

Hardware on all nodes:
Intel(R) Xeon(TM) CPU 1.70GHz
hda: WDC WD102BA, ATA DISK drive
hdb: Maxtor 6E040L0, ATA DISK drive

On all nodes: Linux 2.6.18-92.1.10.el5_lustre.1.6.6smp #1 SMP Tue Aug 26 
12:05:09 EDT 2008 i686 i686 i386 GNU/Linux

BUG: soft lockup - CPU#0 stuck for 10s [socknal_cd00:2785]

esi: c048bd14   edi: d303403c   ebp: d3034000   esp: d1279df8
ds: 007b        es: 007b        ss:0068
Process socknal_cd00 (pid:2776, ti=d1278000 task=df195550 task.ti=d1278000)
Stack:  c063858a d5b74d44 83010183 d3034000 d5b74d40 d3947180 e0d8fc20 
d1255400
        00000000 00000000 00000000 00000005 00000005 0000000e 00000001 
00000000
        d1269240 00000000 d1269240 e0d86854 00000000 d1279f00 d1279f00 
00001388
Call Trace:
[<e0d8fc20>] ksocknal_read_callback+0x100/0x270 [ksocklnd]
[<e0d86854>] ksocknal_create_conn+0x19a4/0x1f90 [ksocklnd]
[<e0b66b31>] libcfs_sock_write+0xb1/0x3a0 [libcfs]
[<c0483471>] __posix_lock_file_conf+0x431/0x48e
[<e0e2c641>] lnet_connect+0xb1/0x150 [lnet]
[<e0d89ee4>] ksocknal_connect+0x124/0x540 [ksocklnd]
[<e0d8f792>] ksocknal_connd+0x2a2/0x400 [ksocklnd]
[<c044b863>] audit_syscall_exit+0x2cc/0x2e2
[<c043631d>] autoremove_wake_function+0x0/0x2d
[<e0d8f4f0>] ksocknal_connd+0x0/0x400 [ksocklnd]
[<c0405c0f>] kernel_thread_helper+0x7/0x10
========================

Code: 74 17 50 52 68 4d 85 63 c0 e8 f6 eb f3 ff 0f 0b 1a 00 ff 84 63 c0 
82 c4 0c 8b 06 39 d8 74 17 50 53 68 8a 85 63 c0 e8 d9 eb f3 ff <0f> 0b 
1f 00 ff 84 63 c0 83 c4 0c 89 7b 04 89 1f 89 77 04 89 3

EIP: [<c04e7d9d>] __list_add+0x39/0x52 SS:ESP 0068:d1279df8
<0>Kernel panic - not syncing: Fatal exception in interrupt