[Lustre-discuss] lustre client goes wacky?
Eric Mei
Eric.Mei at Sun.COM
Wed Feb 13 10:01:11 PST 2008
Yes there seems some problems. I filed a bug 14881 to track this.
Ron, thanks for reporting this. In the mean time please don't use CVS
version with your 1.6.4 server until 14881 get fixed.
--
Eric
Nathaniel Rutman wrote:
> The clients you pulled from CVS have a feature called adaptive timeouts
> which apparently
> are having an issue with your 1.6.4.1 servers. Eric, can you make sure
> our interoperability is working?
>
> Moving this thread to lustre-discuss; devel is more for
> architecture/coding stuff.
>
> Ron wrote:
>> Hi,
>> I don't know if this is a bug or it's it's a misconfig or something
>> else.
>>
>> What I have is:
>> server = 1.6.4.1+vanilla 2.6.18.8 (mgs+2*ost+mdt all on a single
>> server)
>> clients = cvs.20080116+2.6.23.12
>>
>> I mounted the server from several clients and several hours later
>> noticed the top display below. dmesg show some lustre errors (also
>> below).Can someone comment on what could be going on?
>>
>> Thanks,
>> Ron
>>
>> top - 18:28:09 up 5 days, 3:36, 1 user, load average: 12.00, 12.00,
>> 11.94
>> Tasks: 168 total, 13 running, 136 sleeping, 0 stopped, 19 zombie
>> Cpu(s): 0.0% us, 37.5% sy, 0.0% ni, 62.5% id, 0.0% wa, 0.0% hi,
>> 0.0% si
>> Mem: 16468196k total, 526828k used, 15941368k free, 42996k
>> buffers
>> Swap: 4192924k total, 0k used, 4192924k free, 294916k
>> cached
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>> COMMAND
>> 1533 root 20 0 0 0 0 R 100 0.0 308:54.05
>> ll_cfg_requeue
>> 32071 root 20 0 0 0 0 R 100 0.0 308:15.95
>> socknal_reaper
>> 32073 root 20 0 0 0 0 R 100 0.0 308:48.90
>> ptlrpcd
>> 1 root 20 0 4832 588 492 R 0 0.0 0:02.48
>> init
>> 2 root 15 -5 0 0 0 S 0 0.0 0:00.00
>> kthreadd
>>
>>
>> Lustre: OBD class driver, info at clusterfs.com
>> Lustre Version: 1.6.4.50
>> Build Version: b1_6-20080210103536-
>> CHANGED-.usr.src.linux-2.6.23.12-2.6.23.12
>> Lustre: Added LNI 192.168.241.42 at tcp [8/256]
>> Lustre: Accept secure, port 988
>> Lustre: Lustre Client File System; info at clusterfs.com
>> Lustre: Binding irq 17 to CPU 0 with cmd: echo 1 > /proc/irq/17/
>> smp_affinity
>> Lustre: MGC192.168.241.247 at tcp: Reactivating import
>> Lustre: setting import datafs-OST0002_UUID INACTIVE by administrator
>> request
>> Lustre: datafs-OST0002-osc-ffff810241ad7800.osc: set parameter
>> active=0
>> LustreError: 32181:0:(lov_obd.c:230:lov_connect_obd()) not connecting
>> OSC datafs-OST0002_UUID; administratively disabled
>> Lustre: Client datafs-client has started
>> Lustre: Request x7684 sent from MGC192.168.241.247 at tcp to NID
>> 192.168.241.247 at tcp 15s ago has timed out (limit 15s).
>> LustreError: 166-1: MGC192.168.241.247 at tcp: Connection to service MGS
>> via nid 192.168.241.247 at tcp was lost; in progress operations using
>> this service will fail.
>> LustreError: 32073:0:(import.c:212:ptlrpc_invalidate_import()) MGS: rc
>> = -110 waiting for callback (1 != 0)
>> LustreError: 32073:0:(import.c:216:ptlrpc_invalidate_import()) @@@
>> still on sending list req at ffff81040fa14600 x7684/t0 o400-
>>
>>> MGS at 192.168.241.247@tcp:26/25 lens 128/256 e 0 to 11 dl 1202843837
>>>
>> ref 1 fl Rpc:EXN/0/0 rc -4/0
>> Lustre: Request x7685 sent from datafs-MDT0000-mdc-ffff810241ad7800 to
>> NID 192.168.241.247 at tcp 115s ago has timed out (limit 15s).
>> Lustre: datafs-MDT0000-mdc-ffff810241ad7800: Connection to service
>> datafs-MDT0000 via nid 192.168.241.247 at tcp was lost; in progress
>> operations using this service will wait for recovery to complete.
>> Lustre: MGC192.168.241.247 at tcp: Reactivating import
>> Lustre: MGC192.168.241.247 at tcp: Connection restored to service MGS
>> using nid 192.168.241.247 at tcp.
>> LustreError: 32059:0:(events.c:116:reply_in_callback()) ASSERTION(ev-
>>
>>> mlength == lustre_msg_early_size()) failed
>>>
>> LustreError: 32059:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG
>>
>> Call Trace:
>> [<ffffffff88000b53>] :libcfs:lbug_with_loc+0x73/0xc0
>> [<ffffffff88007bd4>] :libcfs:libcfs_assertion_failed+0x54/0x60
>> [<ffffffff8815c746>] :ptlrpc:reply_in_callback+0x426/0x430
>> [<ffffffff88027f35>] :lnet:lnet_enq_event_locked+0xc5/0xf0
>> [<ffffffff88028475>] :lnet:lnet_finalize+0x1e5/0x270
>> [<ffffffff880625d9>] :ksocklnd:ksocknal_process_receive+0x469/0xab0
>> [<ffffffff88060350>] :ksocklnd:ksocknal_tx_done+0x80/0x1e0
>> [<ffffffff8806301c>] :ksocklnd:ksocknal_scheduler+0x12c/0x7e0
>> [<ffffffff8024e850>] autoremove_wake_function+0x0/0x30
>> [<ffffffff8024e850>] autoremove_wake_function+0x0/0x30
>> [<ffffffff8020c918>] child_rip+0xa/0x12
>> [<ffffffff88062ef0>] :ksocklnd:ksocknal_scheduler+0x0/0x7e0
>> [<ffffffff8020c90e>] child_rip+0x0/0x12
>>
>> LustreError: dumping log to /tmp/lustre-log.1202843942.32059
>> Lustre: Request x7707 sent from MGC192.168.241.247 at tcp to NID
>> 192.168.241.247 at tcp 15s ago has timed out (limit 15s).
>> Lustre: Skipped 2 previous similar messages
>>
>>
>> _______________________________________________
>> Lustre-devel mailing list
>> Lustre-devel at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-devel
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list