[Lustre-discuss] lustre client goes wacky?

Eric Mei Eric.Mei at Sun.COM
Wed Feb 13 10:01:11 PST 2008


Yes there seems some problems. I filed a bug 14881 to track this.

Ron, thanks for reporting this. In the mean time please don't use CVS 
version with your 1.6.4 server until 14881 get fixed.

--
Eric

Nathaniel Rutman wrote:
> The clients you pulled from CVS have a feature called adaptive timeouts 
> which apparently
> are having an issue with your 1.6.4.1 servers.  Eric, can you make sure 
> our interoperability is working?
> 
> Moving this thread to lustre-discuss; devel is more for 
> architecture/coding stuff.
> 
> Ron wrote:
>> Hi,
>> I don't know if this is a bug or it's it's a misconfig or something
>> else.
>>
>> What I have is:
>>     server = 1.6.4.1+vanilla 2.6.18.8   (mgs+2*ost+mdt all on a single
>> server)
>>    clients = cvs.20080116+2.6.23.12
>>
>> I mounted the server from several clients and several hours later
>> noticed the top display below.  dmesg show some lustre errors (also
>> below).Can someone comment on what could be going on?
>>
>> Thanks,
>> Ron
>>
>> top - 18:28:09 up 5 days,  3:36,  1 user,  load average: 12.00, 12.00,
>> 11.94
>> Tasks: 168 total,  13 running, 136 sleeping,   0 stopped,  19 zombie
>> Cpu(s):  0.0% us, 37.5% sy,  0.0% ni, 62.5% id,  0.0% wa,  0.0% hi,
>> 0.0% si
>> Mem:  16468196k total,   526828k used, 15941368k free,    42996k
>> buffers
>> Swap:  4192924k total,        0k used,  4192924k free,   294916k
>> cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>  1533 root      20   0     0    0    0 R  100  0.0 308:54.05
>> ll_cfg_requeue
>> 32071 root      20   0     0    0    0 R  100  0.0 308:15.95
>> socknal_reaper
>> 32073 root      20   0     0    0    0 R  100  0.0 308:48.90
>> ptlrpcd
>>     1 root      20   0  4832  588  492 R    0  0.0   0:02.48
>> init
>>     2 root      15  -5     0    0    0 S    0  0.0   0:00.00
>> kthreadd
>>
>>
>> Lustre: OBD class driver, info at clusterfs.com
>>         Lustre Version: 1.6.4.50
>>         Build Version: b1_6-20080210103536-
>> CHANGED-.usr.src.linux-2.6.23.12-2.6.23.12
>> Lustre: Added LNI 192.168.241.42 at tcp [8/256]
>> Lustre: Accept secure, port 988
>> Lustre: Lustre Client File System; info at clusterfs.com
>> Lustre: Binding irq 17 to CPU 0 with cmd: echo 1 > /proc/irq/17/
>> smp_affinity
>> Lustre: MGC192.168.241.247 at tcp: Reactivating import
>> Lustre: setting import datafs-OST0002_UUID INACTIVE by administrator
>> request
>> Lustre: datafs-OST0002-osc-ffff810241ad7800.osc: set parameter
>> active=0
>> LustreError: 32181:0:(lov_obd.c:230:lov_connect_obd()) not connecting
>> OSC datafs-OST0002_UUID; administratively disabled
>> Lustre: Client datafs-client has started
>> Lustre: Request x7684 sent from MGC192.168.241.247 at tcp to NID
>> 192.168.241.247 at tcp 15s ago has timed out (limit 15s).
>> LustreError: 166-1: MGC192.168.241.247 at tcp: Connection to service MGS
>> via nid 192.168.241.247 at tcp was lost; in progress operations using
>> this service will fail.
>> LustreError: 32073:0:(import.c:212:ptlrpc_invalidate_import()) MGS: rc
>> = -110 waiting for callback (1 != 0)
>> LustreError: 32073:0:(import.c:216:ptlrpc_invalidate_import()) @@@
>> still on sending list  req at ffff81040fa14600 x7684/t0 o400-
>>   
>>> MGS at 192.168.241.247@tcp:26/25 lens 128/256 e 0 to 11 dl 1202843837
>>>     
>> ref 1 fl Rpc:EXN/0/0 rc -4/0
>> Lustre: Request x7685 sent from datafs-MDT0000-mdc-ffff810241ad7800 to
>> NID 192.168.241.247 at tcp 115s ago has timed out (limit 15s).
>> Lustre: datafs-MDT0000-mdc-ffff810241ad7800: Connection to service
>> datafs-MDT0000 via nid 192.168.241.247 at tcp was lost; in progress
>> operations using this service will wait for recovery to complete.
>> Lustre: MGC192.168.241.247 at tcp: Reactivating import
>> Lustre: MGC192.168.241.247 at tcp: Connection restored to service MGS
>> using nid 192.168.241.247 at tcp.
>> LustreError: 32059:0:(events.c:116:reply_in_callback()) ASSERTION(ev-
>>   
>>> mlength == lustre_msg_early_size()) failed
>>>     
>> LustreError: 32059:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG
>>
>> Call Trace:
>>  [<ffffffff88000b53>] :libcfs:lbug_with_loc+0x73/0xc0
>>  [<ffffffff88007bd4>] :libcfs:libcfs_assertion_failed+0x54/0x60
>>  [<ffffffff8815c746>] :ptlrpc:reply_in_callback+0x426/0x430
>>  [<ffffffff88027f35>] :lnet:lnet_enq_event_locked+0xc5/0xf0
>>  [<ffffffff88028475>] :lnet:lnet_finalize+0x1e5/0x270
>>  [<ffffffff880625d9>] :ksocklnd:ksocknal_process_receive+0x469/0xab0
>>  [<ffffffff88060350>] :ksocklnd:ksocknal_tx_done+0x80/0x1e0
>>  [<ffffffff8806301c>] :ksocklnd:ksocknal_scheduler+0x12c/0x7e0
>>  [<ffffffff8024e850>] autoremove_wake_function+0x0/0x30
>>  [<ffffffff8024e850>] autoremove_wake_function+0x0/0x30
>>  [<ffffffff8020c918>] child_rip+0xa/0x12
>>  [<ffffffff88062ef0>] :ksocklnd:ksocknal_scheduler+0x0/0x7e0
>>  [<ffffffff8020c90e>] child_rip+0x0/0x12
>>
>> LustreError: dumping log to /tmp/lustre-log.1202843942.32059
>> Lustre: Request x7707 sent from MGC192.168.241.247 at tcp to NID
>> 192.168.241.247 at tcp 15s ago has timed out (limit 15s).
>> Lustre: Skipped 2 previous similar messages
>>
>>
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>   
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list