[Lustre-discuss] Contents of Lustre-discuss digest...
Cliff White
Cliff.White at Sun.COM
Tue Feb 12 14:01:47 PST 2008
ashok bharat bayana wrote:
> Hi,
> i just want to know whether there are any alternative file systems for HP SFS.
> I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway.
Polyserve is now owned by HP, so I would ask there.
cliffw
>
> Thanks and Regards,
> Ashok Bharat
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org
> Sent: Tue 2/12/2008 11:05 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Lustre-discuss Digest, Vol 25, Issue 19
>
> Send Lustre-discuss mailing list submissions to
> lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> or, via email, send a message with subject or body 'help' to
> lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
> lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Lustre-discuss digest..."
>
>
> Today's Topics:
>
> 1. Re: multihomed clients ignoring lnet options (Cliff White)
> 2. Re: multihomed clients ignoring lnet options (Joe Little)
> 3. Re: multihomed clients ignoring lnet options (Steden Klaus)
> 4. Re: Lustre-discuss Digest, Vol 25, Issue 17 (ashok bharat bayana)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 11 Feb 2008 20:00:10 -0800
> From: Cliff White <Cliff.White at Sun.COM>
> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
> To: Aaron Knister <aaron at iges.org>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID: <47B119CA.4050105 at sun.com>
> Content-Type: text/plain; format=flowed; charset=ISO-8859-1
>
> Aaron Knister wrote:
>> I believe that's correct. The nids of the various server components
>> are stored on the filesystem itself.
>
> Yes, and you can always see them with
> tunefs.lustre --print <device>
>
> cliffw
>
>> On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
>>
>>> never mind.. The problem was resolved by recreating again the MGS and
>>> the OST's using the same parameters on the server. I was able to
>>> change the parameters and still have the servers working, but my guess
>>> is that those options are permanently etched into the filesystem.
>>>
>>>
>>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
>>>> I have all of my servers and clients using eth1 for the tcp lustre
>>>> lnet.
>>>>
>>>> All have modprobe.conf entries of:
>>>>
>>>> options lnet networks="tcp0(eth1)"
>>>>
>>>> and all report with "lctl list_nids" that they are using the IP
>>>> address associated with that interface (a net 192.168.200.x address)
>>>>
>>>> However, when my client connects, it ignores the above and goes with
>>>> eth0 for routing, even though the mds/mgs is on that network range:
>>>>
>>>> client dmesg:
>>>>
>>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre
>>>> stack 8192
>>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
>>>> Lustre: Accept secure, port 988
>>>> Lustre: OBD class driver, info at clusterfs.com
>>>> Lustre Version: 1.6.4.2
>>>> Build Version:
>>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre-
>>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
>>>> Lustre: Lustre Client File System; info at clusterfs.com
>>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
>>>> -104 reading HELLO from 192.168.2.201
>>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
>>>> 192.168.2.201 on port 988 was reset: is it running a compatible
>>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
>>>>
>>>> server dmesg:
>>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
>>>> 192.168.2.201 at tcp: No matching NI
>>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> Aaron Knister
>> Associate Systems Analyst
>> Center for Ocean-Land-Atmosphere Studies
>>
>> (301) 595-7000
>> aaron at iges.org
>>
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 11 Feb 2008 20:51:20 -0800
> From: "Joe Little" <jmlittle at gmail.com>
> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
> To: "Cliff White" <Cliff.White at sun.com>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID:
> <e3849caa0802112051q7e24e6acv5af03a16f2bca2c3 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote:
>> Aaron Knister wrote:
>>> I believe that's correct. The nids of the various server components
>>> are stored on the filesystem itself.
>> Yes, and you can always see them with
>> tunefs.lustre --print <device>
>>
>> cliffw
>
> anyone to change them after the fact?
>>
>>> On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
>>>
>>>> never mind.. The problem was resolved by recreating again the MGS and
>>>> the OST's using the same parameters on the server. I was able to
>>>> change the parameters and still have the servers working, but my guess
>>>> is that those options are permanently etched into the filesystem.
>>>>
>>>>
>>>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
>>>>> I have all of my servers and clients using eth1 for the tcp lustre
>>>>> lnet.
>>>>>
>>>>> All have modprobe.conf entries of:
>>>>>
>>>>> options lnet networks="tcp0(eth1)"
>>>>>
>>>>> and all report with "lctl list_nids" that they are using the IP
>>>>> address associated with that interface (a net 192.168.200.x address)
>>>>>
>>>>> However, when my client connects, it ignores the above and goes with
>>>>> eth0 for routing, even though the mds/mgs is on that network range:
>>>>>
>>>>> client dmesg:
>>>>>
>>>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre
>>>>> stack 8192
>>>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
>>>>> Lustre: Accept secure, port 988
>>>>> Lustre: OBD class driver, info at clusterfs.com
>>>>> Lustre Version: 1.6.4.2
>>>>> Build Version:
>>>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre-
>>>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
>>>>> Lustre: Lustre Client File System; info at clusterfs.com
>>>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
>>>>> -104 reading HELLO from 192.168.2.201
>>>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
>>>>> 192.168.2.201 on port 988 was reset: is it running a compatible
>>>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
>>>>>
>>>>> server dmesg:
>>>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
>>>>> 192.168.2.201 at tcp: No matching NI
>>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> Aaron Knister
>>> Associate Systems Analyst
>>> Center for Ocean-Land-Atmosphere Studies
>>>
>>> (301) 595-7000
>>> aaron at iges.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 11 Feb 2008 20:53:41 -0800
> From: "Steden Klaus" <Klaus.Steden at thomson.net>
> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
> To: <jmlittle at gmail.com>, <Cliff.White at sun.com>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID:
> <23480D326186CF49819F5EF363276C9003AB2AB3 at BRBKSMAIL04.am.thmulti.com>
> Content-Type: text/plain; charset="utf-8"
>
>
> If you have root, you can change them using tunefs.lustre after the file system has been shut down.
>
> I've done this a number of times to test various lnet configs.
>
> Klaus
>
>
> ----- Original Message -----
> From: lustre-discuss-bounces at lists.lustre.org <lustre-discuss-bounces at lists.lustre.org>
> To: Cliff White <Cliff.White at sun.com>
> Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
> Sent: Mon Feb 11 20:51:20 2008
> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
>
> On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote:
>> Aaron Knister wrote:
>>> I believe that's correct. The nids of the various server components
>>> are stored on the filesystem itself.
>> Yes, and you can always see them with
>> tunefs.lustre --print <device>
>>
>> cliffw
>
> anyone to change them after the fact?
>>
>>> On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
>>>
>>>> never mind.. The problem was resolved by recreating again the MGS and
>>>> the OST's using the same parameters on the server. I was able to
>>>> change the parameters and still have the servers working, but my guess
>>>> is that those options are permanently etched into the filesystem.
>>>>
>>>>
>>>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
>>>>> I have all of my servers and clients using eth1 for the tcp lustre
>>>>> lnet.
>>>>>
>>>>> All have modprobe.conf entries of:
>>>>>
>>>>> options lnet networks="tcp0(eth1)"
>>>>>
>>>>> and all report with "lctl list_nids" that they are using the IP
>>>>> address associated with that interface (a net 192.168.200.x address)
>>>>>
>>>>> However, when my client connects, it ignores the above and goes with
>>>>> eth0 for routing, even though the mds/mgs is on that network range:
>>>>>
>>>>> client dmesg:
>>>>>
>>>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre
>>>>> stack 8192
>>>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
>>>>> Lustre: Accept secure, port 988
>>>>> Lustre: OBD class driver, info at clusterfs.com
>>>>> Lustre Version: 1.6.4.2
>>>>> Build Version:
>>>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre-
>>>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
>>>>> Lustre: Lustre Client File System; info at clusterfs.com
>>>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
>>>>> -104 reading HELLO from 192.168.2.201
>>>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
>>>>> 192.168.2.201 on port 988 was reset: is it running a compatible
>>>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
>>>>>
>>>>> server dmesg:
>>>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
>>>>> 192.168.2.201 at tcp: No matching NI
>>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> Aaron Knister
>>> Associate Systems Analyst
>>> Center for Ocean-Land-Atmosphere Studies
>>>
>>> (301) 595-7000
>>> aaron at iges.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> ------------------------------
>
> Message: 4
> Date: Tue, 12 Feb 2008 11:15:18 +0530
> From: "ashok bharat bayana" <ashok.bharat.bayana at iiitb.ac.in>
> Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 25, Issue 17
> To: <lustre-discuss at lists.lustre.org>
> Message-ID: <8626C1B7EB748940BCDD7596134632BE850213 at jal.iiitb.ac.in>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
> Hi,
> i just want to know whether there are any alternative file systems for HP SFS.
> I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway.
>
> Thanks and Regards,
> Ashok Bharat
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org
> Sent: Tue 2/12/2008 3:18 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Lustre-discuss Digest, Vol 25, Issue 17
>
> Send Lustre-discuss mailing list submissions to
> lustre-discuss at lists.lustre.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> or, via email, send a message with subject or body 'help' to
> lustre-discuss-request at lists.lustre.org
>
> You can reach the person managing the list at
> lustre-discuss-owner at lists.lustre.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Lustre-discuss digest..."
>
>
> Today's Topics:
>
> 1. Re: Benchmarking Lustre (Marty Barnaby)
> 2. Re: Luster clients getting evicted (Aaron Knister)
> 3. Re: Luster clients getting evicted (Tom.Wang)
> 4. Re: Luster clients getting evicted (Craig Prescott)
> 5. Re: rc -43: Identifier removed (Andreas Dilger)
> 6. Re: Luster clients getting evicted (Brock Palen)
> 7. Re: Luster clients getting evicted (Aaron Knister)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 11 Feb 2008 11:25:48 -0700
> From: "Marty Barnaby" <mlbarna at sandia.gov>
> Subject: Re: [Lustre-discuss] Benchmarking Lustre
> To: "lustre-discuss at lists.lustre.org"
> <lustre-discuss at lists.lustre.org>
> Message-ID: <47B0932C.2090200 at sandia.gov>
> Content-Type: text/plain; charset=iso-8859-1; format=flowed
>
> Do you have any special interests, like: writing from a true MPI job;
> collective vs. independent; one-file-per-processor vs. a single, share
> file; or writing via MPI-IO vs. posix?
>
>
> Marty Barnaby
>
>
> mayur bhosle wrote:
>> hi everyone,
>>
>> I am a student at Georgia Tech university, and
>> as a part of a project i need to benchmark lustre file system. I did a
>> lot of searching regarding
>> the possible benchmark, but i need some advice on which benchmarks
>> would be more suitable... if any one can post a sugesstion that would
>> be really helpful.......................
>>
>> thanks in advance ............
>>
>> Mayur
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 11 Feb 2008 14:16:20 -0500
> From: Aaron Knister <aaron at iges.org>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Tom.Wang <Tom.Wang at Sun.COM>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID: <79343CD8-77EA-4686-A2AE-BEE6FAC59914 at iges.org>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under
> load, the clients hand about every 10 minutes which is really bad for
> a production machine. The only way to fix the hang is to reboot the
> server. My users are getting extremely impatient :-/
>
> I see this on the clients-
>
> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@
> timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/
> t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl
> Rpc:/0/0 rc 0/-22
> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data-
> OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations
> using this service will wait for recovery to complete.
> LustreError: 11-0: an error occurred while communicating with
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> LustreError: 11-0: an error occurred while communicating with
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>
> I've increased the timeout to 300seconds and it has helped marginally.
>
> -Aaron
>
> On Feb 9, 2008, at 12:06 AM, Tom.Wang wrote:
>
>> Hi,
>> Aha, this is bug has been fixed in 14360.
>>
>> https://bugzilla.lustre.org/show_bug.cgi?id=14360
>>
>> The patch there should fix your problem, which should be released in
>> 1.6.5
>>
>> Thanks
>>
>> Brock Palen wrote:
>>> Sure, Attached, note though, we rebuilt our lustre source for
>>> another
>>> box that uses the largesmp kernel. but it used the same options and
>>> compiler.
>>>
>>>
>>> Brock Palen
>>> Center for Advanced Computing
>>> brockp at umich.edu
>>> (734)936-1985
>>>
>>>
>>> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote:
>>>
>>>> Hello,
>>>>
>>>> m45_amp214_om D 0000000000000000 0 2587 1 31389
>>>> 2586 (NOTLB)
>>>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001
>>>> 00000100080f1a40 0000000000000246 00000101f6b435a8
>>>> 0000000380136025
>>>> 00000102270a1030 00000000000000d0
>>>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689}
>>>> <ffffffff8030e45f>{__down+147}
>>>> <ffffffff80134659>{default_wake_function+0}
>>>> <ffffffff8030ff7d>{__down_failed+53}
>>>> <ffffffffa04292e1>{:lustre:.text.lock.file+5}
>>>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798}
>>>> <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456}
>>>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107}
>>>> <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213}
>>>> <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56}
>>>> <ffffffffa02c3dbc>{:ptlrpc:search_queue+284}
>>>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99}
>>>> <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915}
>>>> <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435}
>>>> <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313}
>>>> <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023}
>>>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}
>>>> <ffffffffa0268730>{:obdclass:class_handle2object+224}
>>>> <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794}
>>>> <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31}
>>>> <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595}
>>>> <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140}
>>>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}
>>>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154}
>>>> <ffffffffa039617d>{:mdc:mdc_intent_lock+685}
>>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}
>>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}
>>>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139}
>>>> <ffffffffa0418a32>{:lustre:ll_intent_file_open+450}
>>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>> <ffffffff80192006>{__d_lookup+287}
>>>> <ffffffffa0419724>{:lustre:ll_file_open+2100}
>>>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184}
>>>> <ffffffff80179bdb>{sys_access+349}
>>>> <ffffffff8017a1ee>{__dentry_open+201}
>>>> <ffffffff8017a3a9>{filp_open+95}
>>>> <ffffffff80179bdb>{sys_access+349}
>>>> <ffffffff801f00b5>{strncpy_from_user+74}
>>>> <ffffffff8017a598>{sys_open+57}
>>>> <ffffffff8011026a>{system_call+126}
>>>>
>>>> It seems blocking_ast process was blocked here. Could you dump the
>>>> lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send
>>>> to me?
>>>>
>>>> Thanks
>>>> WangDi
>>>>
>>>> Brock Palen wrote:
>>>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote:
>>>>>>>>> MDT dmesg:
>>>>>>>>>
>>>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg())
>>>>>>>>> @@@ processing error (-107) req at 000001002b
>>>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl
>>>>>>>>> Interpret:/0/0 rc -107/0
>>>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback())
>>>>>>>>> ###
>>>>>>>>> lock callback timer expired: evicting cl
>>>>>>>>> ient
>>>>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID
>>>>>>>>> nid 10.164.0.141 at tcp ns: mds-nobackup
>>>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc:
>>>>>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi
>>>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08
>>>>>>>>> expref: 372 pid 26925
>>>>>>>>>
>>>>>>>> The client was evicted because of this lock can not be released
>>>>>>>> on client
>>>>>>>> on time. Could you provide the stack strace of client at that
>>>>>>>> time?
>>>>>>>>
>>>>>>>> I assume increase obd_timeout could fix your problem. Then maybe
>>>>>>>> you should wait 1.6.5 released, including a new feature
>>>>>>>> adaptive_timeout,
>>>>>>>> which will adjust the timeout value according to the network
>>>>>>>> congestion
>>>>>>>> and server load. And it should help your problem.
>>>>>>> Waiting for the next version of lustre might be the best
>>>>>>> thing. I
>>>>>>> had upped the timeout a few days back but the next day i had
>>>>>>> errors on the MDS box. I have switched it back:
>>>>>>>
>>>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300
>>>>>>>
>>>>>>> I would love to give you that trace but I don't know how to get
>>>>>>> it. Is there a debug option to turn on in the clients?
>>>>>> You can get that by echo t > /proc/sysrq-trigger on client.
>>>>>>
>>>>> Cool command, output of the client is attached. The four
>>>>> processes
>>>>> m45_amp214_om, is the application that hung when working off of
>>>>> luster. you can see its stuck in IO state.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org
>
>
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 11 Feb 2008 15:04:05 -0500
> From: "Tom.Wang" <Tom.Wang at Sun.COM>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Aaron Knister <aaron at iges.org>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID: <47B0AA35.7070303 at sun.com>
> Content-Type: text/plain; format=flowed; charset=ISO-8859-1
>
> Aaron Knister wrote:
>> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under
>> load, the clients hand about every 10 minutes which is really bad for
>> a production machine. The only way to fix the hang is to reboot the
>> server. My users are getting extremely impatient :-/
>>
>> I see this on the clients-
>>
>> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@
>> timeout (sent at 1202756629, 301s ago) req at ffff8100af233600
>> x1796079/t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336
>> ref 1 fl Rpc:/0/0 rc 0/-22
> It means OST could not response the request(unlink, o6) in 300 seconds,
> so client disconnect the import to OST and try to reconnect.
> Does this disconnection always happened when do unlink ? Could you
> please post process trace and console msg of OST at that time?
>
> Thanks
> WangDi
>> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service
>> data-OST0000 via nid 192.168.64.71 at o2ib was lost; in progress
>> operations using this service will wait for recovery to complete.
>> LustreError: 11-0: an error occurred while communicating with
>> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>> LustreError: 11-0: an error occurred while communicating with
>> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>>
>> I've increased the timeout to 300seconds and it has helped marginally.
>>
>> -Aaron
>>
>
>>
>>
>>
>>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 11 Feb 2008 15:19:21 -0500
> From: Craig Prescott <prescott at hpc.ufl.edu>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Aaron Knister <aaron at iges.org>
> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
> Message-ID: <47B0ADC9.8020501 at hpc.ufl.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Aaron Knister wrote:
>> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under
>> load, the clients hand about every 10 minutes which is really bad for
>> a production machine. The only way to fix the hang is to reboot the
>> server. My users are getting extremely impatient :-/
>>
>> I see this on the clients-
>>
>> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@
>> timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/
>> t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl
>> Rpc:/0/0 rc 0/-22
>> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data-
>> OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations
>> using this service will wait for recovery to complete.
>> LustreError: 11-0: an error occurred while communicating with
>> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>> LustreError: 11-0: an error occurred while communicating with
>> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>>
>> I've increased the timeout to 300seconds and it has helped marginally.
>
> Hi Aaron;
>
> We set the timeout a big number (1000secs) on our 400 node cluster
> (mostly o2ib, some tcp clients). Until we did this, we had loads
> of evictions. In our case, it solved the problem.
>
> Cheers,
> Craig
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 11 Feb 2008 14:11:45 -0700
> From: Andreas Dilger <adilger at sun.com>
> Subject: Re: [Lustre-discuss] rc -43: Identifier removed
> To: Per Lundqvist <perl at nsc.liu.se>
> Cc: Lustre Discuss <lustre-discuss at lists.lustre.org>
> Message-ID: <20080211211145.GJ3029 at webber.adilger.int>
> Content-Type: text/plain; charset=us-ascii
>
> On Feb 11, 2008 17:04 +0100, Per Lundqvist wrote:
>> I got this error today when testing a newly set up 1.6 filesystem:
>>
>> n50 1% cd /mnt/test
>> n50 2% ls
>> ls: reading directory .: Identifier removed
>>
>> n50 3% ls -alrt
>> total 8
>> ?--------- ? ? ? ? ? dir1
>> ?--------- ? ? ? ? ? dir2
>> drwxr-xr-x 4 root root 4096 Feb 8 15:46 ../
>> drwxr-xr-x 4 root root 4096 Feb 11 15:11 ./
>>
>> n50 4% stat .
>> File: `.'
>> Size: 4096 Blocks: 8 IO Block: 4096 directory
>> Device: b438c888h/-1271347064d Inode: 27616681 Links: 2
>> Access: (0755/drwxr-xr-x) Uid: ( 1120/ faxen) Gid: ( 500/ nsc)
>> Access: 2008-02-11 16:11:48.336621154 +0100
>> Modify: 2008-02-11 15:11:27.000000000 +0100
>> Change: 2008-02-11 15:11:31.352841294 +0100
>>
>> this seems to be happen almost all the time when I am running as a
>> specific user on this system. Note that the stat call always works... I
>> haven't yet been able to reproduce this problem when running as my own
>> user.
>
> EIDRM (Identifier removed) means that your MDS has a user database
> (/etc/passwd and /etc/group) that is missing the particular user ID.
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
>
> ------------------------------
>
> Message: 6
> Date: Mon, 11 Feb 2008 16:17:37 -0500
> From: Brock Palen <brockp at umich.edu>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Craig Prescott <prescott at hpc.ufl.edu>
> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
> Message-ID: <38A6B1A2-E20A-40BC-80C2-CEBB971BDC09 at umich.edu>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
>>> I've increased the timeout to 300seconds and it has helped
>>> marginally.
>> Hi Aaron;
>>
>> We set the timeout a big number (1000secs) on our 400 node cluster
>> (mostly o2ib, some tcp clients). Until we did this, we had loads
>> of evictions. In our case, it solved the problem.
>
> This feels excessive. But at this point I guess Ill try it.
>
>> Cheers,
>> Craig
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
>
>
> ------------------------------
>
> Message: 7
> Date: Mon, 11 Feb 2008 16:48:05 -0500
> From: Aaron Knister <aaron at iges.org>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Brock Palen <brockp at umich.edu>
> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
> Message-ID: <7A1D46E5-CC69-4C37-9CC7-B229FCA43BA1 at iges.org>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> So far it's helped. If this doesn't fix it I'm going to apply the
> patch mentioned here - https://bugzilla.lustre.org/attachment.cgi?id=14006&action=edit
> I'll let you know how it goes. If you'd like a copy of the patched
> version let me know. Are you running RHEL/SLES? what version of the OS
> and lustre?
>
> -Aaron
>
> On Feb 11, 2008, at 4:17 PM, Brock Palen wrote:
>
>>>> I've increased the timeout to 300seconds and it has helped
>>>> marginally.
>>> Hi Aaron;
>>>
>>> We set the timeout a big number (1000secs) on our 400 node cluster
>>> (mostly o2ib, some tcp clients). Until we did this, we had loads
>>> of evictions. In our case, it solved the problem.
>> This feels excessive. But at this point I guess Ill try it.
>>
>>> Cheers,
>>> Craig
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org
>
>
>
>
>
>
> ------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> End of Lustre-discuss Digest, Vol 25, Issue 17
> **********************************************
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/ms-tnef
> Size: 11404 bytes
> Desc: not available
> Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080212/610bb025/attachment.bin
>
> ------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> End of Lustre-discuss Digest, Vol 25, Issue 19
> **********************************************
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list