[Lustre-discuss] odd kernel crash after a heartbeat failover

Cliff White cliff.white at oracle.com
Thu Apr 15 21:45:03 PDT 2010


John White wrote:
> This is actually happening repeatedly, any idea if this is a lustre-side error?
> Apr 15 14:37:57 n0008.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:  
> Apr 15 14:37:57 n0008.lustre kernel: <2>LDISKFS-fs error (device dm-7) in ldiskfs_reserve_inode_write: Journal has aborted 
> Apr 15 14:37:57 n0008.lustre kernel: Oops: 0002 [1] SMP  
> Apr 15 14:37:57 n0008.lustre kernel: Oops: 0002 [1] SMP  
> Apr 15 14:37:57 n0008.lustre kernel: last sysfs file: /block/dm-7/dev 
> Apr 15 14:37:57 n0008.lustre kernel: RIP  [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e 
> Apr 15 14:37:57 n0008.lustre kernel: CR2: 0000000000000000 
> Apr 15 14:37:57 n0008.lustre kernel: CR2: 0000000000000000 
> 

Looks more like a messed-up-disk problem. I would sugest checking the 
health of your journal device.
cliffw

> ----------------
> John White
> High Performance Computing Services (HPCS)
> (510) 486-7307
> One Cyclotron Rd, MS: 50B-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
> 
> 
> 
> 
> 
> 
> 
> 
> On Apr 15, 2010, at 1:10 PM, John White wrote:
> 
>> Hello Folks,
>> 	We just had a very odd crash after a heartbeat failover that may or may not be related to each other.  I'm not specifically sure if this was an IO error on the disk (I see no actual EIO, just the journal commit crash).  Any ideas?  The FS went through recovery just fine and doesn't appear to have any corruption:
>> [...]
>> Apr 15 11:49:32 n0007.lustre heartbeat: [10940]: info: mach_down takeover complete. 
>> Apr 15 12:09:55 n0007.lustre kernel: Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:  
>> Apr 15 12:09:55 n0007.lustre kernel:  [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e 
>> Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP  
>> Apr 15 12:09:55 n0007.lustre kernel: Oops: 0002 [1] SMP  
>> Apr 15 12:09:55 n0007.lustre kernel: last sysfs file: /block/dm-3/dev 
>> Apr 15 12:09:55 n0007.lustre kernel: RIP  [<ffffffff88abc375>] :jbd:journal_commit_transaction+0xc33/0x132e 
>> Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 
>> Apr 15 12:09:55 n0007.lustre kernel: CR2: 0000000000000000 
>> Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4890 
>> Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 
>> Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.3719 
>> Apr 15 12:13:25 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358805.4807 
>> Apr 15 12:13:26 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358806.3725 
>> Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3714 
>> Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4796 
>> Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.3740 
>> Apr 15 12:13:30 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358810.4991 
>> Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.3727 
>> Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5109 
>> Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.5072 
>> Apr 15 12:13:39 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358819.4812 
>> Apr 15 12:13:40 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358820.3720 
>> Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.3732 
>> Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.5047 
>> Apr 15 12:13:41 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271358821.4862 
>> Apr 15 12:24:09 n0006.lustre kernel: LustreError: dumping log to /tmp/lustre-log.1271359449.3725
>>
>>
>> logs available upon request.
>> ----------------
>> John White
>> High Performance Computing Services (HPCS)
>> (510) 486-7307
>> One Cyclotron Rd, MS: 50B-3209C
>> Lawrence Berkeley National Lab
>> Berkeley, CA 94720
>>
>>
>>
>>
>>
>>
>>
>>
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list