[Lustre-discuss] Contents of Lustre-discuss digest...

Tue Feb 12 14:01:47 PST 2008

ashok bharat bayana wrote:
> Hi,
> i just want to know whether there are any alternative file systems for HP SFS.
> I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway.

Polyserve is now owned by HP, so I would ask there.
cliffw

> 
> Thanks and Regards,
> Ashok Bharat
> 
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org
> Sent: Tue 2/12/2008 11:05 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Lustre-discuss Digest, Vol 25, Issue 19
>  
> Send Lustre-discuss mailing list submissions to
> 	lustre-discuss at lists.lustre.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.lustre.org/mailman/listinfo/lustre-discuss
> or, via email, send a message with subject or body 'help' to
> 	lustre-discuss-request at lists.lustre.org
> 
> You can reach the person managing the list at
> 	lustre-discuss-owner at lists.lustre.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Lustre-discuss digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: multihomed clients ignoring lnet options (Cliff White)
>    2. Re: multihomed clients ignoring lnet options (Joe Little)
>    3. Re: multihomed clients ignoring lnet options (Steden Klaus)
>    4. Re: Lustre-discuss Digest, Vol 25, Issue 17 (ashok bharat bayana)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 11 Feb 2008 20:00:10 -0800
> From: Cliff White <Cliff.White at Sun.COM>
> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
> To: Aaron Knister <aaron at iges.org>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID: <47B119CA.4050105 at sun.com>
> Content-Type: text/plain; format=flowed; charset=ISO-8859-1
> 
> Aaron Knister wrote:
>> I believe that's correct. The nids of the various server components  
>> are stored on the filesystem itself.
> 
> Yes, and you can always see them with
> tunefs.lustre --print <device>
> 
> cliffw
> 
>> On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
>>
>>> never mind.. The problem was resolved by recreating again the MGS and
>>> the OST's using the same parameters on the server. I was able to
>>> change the parameters and still have the servers working, but my guess
>>> is that those options are permanently etched into the filesystem.
>>>
>>>
>>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
>>>> I have all of my servers and clients using eth1 for the tcp lustre  
>>>> lnet.
>>>>
>>>> All have modprobe.conf entries of:
>>>>
>>>> options lnet networks="tcp0(eth1)"
>>>>
>>>> and all report with "lctl list_nids" that they are using the IP
>>>> address associated with that interface (a net 192.168.200.x address)
>>>>
>>>> However, when my client connects, it ignores the above and goes with
>>>> eth0 for routing, even though the mds/mgs is on that network range:
>>>>
>>>> client dmesg:
>>>>
>>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre  
>>>> stack 8192
>>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
>>>> Lustre: Accept secure, port 988
>>>> Lustre: OBD class driver, info at clusterfs.com
>>>>        Lustre Version: 1.6.4.2
>>>>        Build Version:
>>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre- 
>>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
>>>> Lustre: Lustre Client File System; info at clusterfs.com
>>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
>>>> -104 reading HELLO from 192.168.2.201
>>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
>>>> 192.168.2.201 on port 988 was reset: is it running a compatible
>>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
>>>>
>>>> server dmesg:
>>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
>>>> 192.168.2.201 at tcp: No matching NI
>>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> Aaron Knister
>> Associate Systems Analyst
>> Center for Ocean-Land-Atmosphere Studies
>>
>> (301) 595-7000
>> aaron at iges.org
>>
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 11 Feb 2008 20:51:20 -0800
> From: "Joe Little" <jmlittle at gmail.com>
> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
> To: "Cliff White" <Cliff.White at sun.com>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID:
> 	<e3849caa0802112051q7e24e6acv5af03a16f2bca2c3 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote:
>> Aaron Knister wrote:
>>> I believe that's correct. The nids of the various server components
>>> are stored on the filesystem itself.
>> Yes, and you can always see them with
>> tunefs.lustre --print <device>
>>
>> cliffw
> 
> anyone to change them after the fact?
>>
>>> On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
>>>
>>>> never mind.. The problem was resolved by recreating again the MGS and
>>>> the OST's using the same parameters on the server. I was able to
>>>> change the parameters and still have the servers working, but my guess
>>>> is that those options are permanently etched into the filesystem.
>>>>
>>>>
>>>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
>>>>> I have all of my servers and clients using eth1 for the tcp lustre
>>>>> lnet.
>>>>>
>>>>> All have modprobe.conf entries of:
>>>>>
>>>>> options lnet networks="tcp0(eth1)"
>>>>>
>>>>> and all report with "lctl list_nids" that they are using the IP
>>>>> address associated with that interface (a net 192.168.200.x address)
>>>>>
>>>>> However, when my client connects, it ignores the above and goes with
>>>>> eth0 for routing, even though the mds/mgs is on that network range:
>>>>>
>>>>> client dmesg:
>>>>>
>>>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre
>>>>> stack 8192
>>>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
>>>>> Lustre: Accept secure, port 988
>>>>> Lustre: OBD class driver, info at clusterfs.com
>>>>>        Lustre Version: 1.6.4.2
>>>>>        Build Version:
>>>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre-
>>>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
>>>>> Lustre: Lustre Client File System; info at clusterfs.com
>>>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
>>>>> -104 reading HELLO from 192.168.2.201
>>>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
>>>>> 192.168.2.201 on port 988 was reset: is it running a compatible
>>>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
>>>>>
>>>>> server dmesg:
>>>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
>>>>> 192.168.2.201 at tcp: No matching NI
>>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> Aaron Knister
>>> Associate Systems Analyst
>>> Center for Ocean-Land-Atmosphere Studies
>>>
>>> (301) 595-7000
>>> aaron at iges.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Mon, 11 Feb 2008 20:53:41 -0800
> From: "Steden Klaus" <Klaus.Steden at thomson.net>
> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
> To: <jmlittle at gmail.com>,	<Cliff.White at sun.com>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID:
> 	<23480D326186CF49819F5EF363276C9003AB2AB3 at BRBKSMAIL04.am.thmulti.com>
> Content-Type: text/plain;	charset="utf-8"
> 
> 
> If you have root, you can change them using tunefs.lustre after the file system has been shut down.
> 
> I've done this a number of times to test various lnet configs.
> 
> Klaus
> 
> 
> ----- Original Message -----
> From: lustre-discuss-bounces at lists.lustre.org <lustre-discuss-bounces at lists.lustre.org>
> To: Cliff White <Cliff.White at sun.com>
> Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
> Sent: Mon Feb 11 20:51:20 2008
> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
> 
> On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote:
>> Aaron Knister wrote:
>>> I believe that's correct. The nids of the various server components
>>> are stored on the filesystem itself.
>> Yes, and you can always see them with
>> tunefs.lustre --print <device>
>>
>> cliffw
> 
> anyone to change them after the fact?
>>
>>> On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
>>>
>>>> never mind.. The problem was resolved by recreating again the MGS and
>>>> the OST's using the same parameters on the server. I was able to
>>>> change the parameters and still have the servers working, but my guess
>>>> is that those options are permanently etched into the filesystem.
>>>>
>>>>
>>>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
>>>>> I have all of my servers and clients using eth1 for the tcp lustre
>>>>> lnet.
>>>>>
>>>>> All have modprobe.conf entries of:
>>>>>
>>>>> options lnet networks="tcp0(eth1)"
>>>>>
>>>>> and all report with "lctl list_nids" that they are using the IP
>>>>> address associated with that interface (a net 192.168.200.x address)
>>>>>
>>>>> However, when my client connects, it ignores the above and goes with
>>>>> eth0 for routing, even though the mds/mgs is on that network range:
>>>>>
>>>>> client dmesg:
>>>>>
>>>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre
>>>>> stack 8192
>>>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
>>>>> Lustre: Accept secure, port 988
>>>>> Lustre: OBD class driver, info at clusterfs.com
>>>>>        Lustre Version: 1.6.4.2
>>>>>        Build Version:
>>>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre-
>>>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
>>>>> Lustre: Lustre Client File System; info at clusterfs.com
>>>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
>>>>> -104 reading HELLO from 192.168.2.201
>>>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
>>>>> 192.168.2.201 on port 988 was reset: is it running a compatible
>>>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
>>>>>
>>>>> server dmesg:
>>>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
>>>>> 192.168.2.201 at tcp: No matching NI
>>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> Aaron Knister
>>> Associate Systems Analyst
>>> Center for Ocean-Land-Atmosphere Studies
>>>
>>> (301) 595-7000
>>> aaron at iges.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 12 Feb 2008 11:15:18 +0530
> From: "ashok bharat bayana" <ashok.bharat.bayana at iiitb.ac.in>
> Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 25, Issue 17
> To: <lustre-discuss at lists.lustre.org>
> Message-ID: <8626C1B7EB748940BCDD7596134632BE850213 at jal.iiitb.ac.in>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> 
> Hi,
> i just want to know whether there are any alternative file systems for HP SFS.
> I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway.
> 
> Thanks and Regards,
> Ashok Bharat
> 
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org
> Sent: Tue 2/12/2008 3:18 AM
> To: lustre-discuss at lists.lustre.org
> Subject: Lustre-discuss Digest, Vol 25, Issue 17
>  
> Send Lustre-discuss mailing list submissions to
> 	lustre-discuss at lists.lustre.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.lustre.org/mailman/listinfo/lustre-discuss
> or, via email, send a message with subject or body 'help' to
> 	lustre-discuss-request at lists.lustre.org
> 
> You can reach the person managing the list at
> 	lustre-discuss-owner at lists.lustre.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Lustre-discuss digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: Benchmarking Lustre (Marty Barnaby)
>    2. Re: Luster clients getting evicted (Aaron Knister)
>    3. Re: Luster clients getting evicted (Tom.Wang)
>    4. Re: Luster clients getting evicted (Craig Prescott)
>    5. Re: rc -43: Identifier removed (Andreas Dilger)
>    6. Re: Luster clients getting evicted (Brock Palen)
>    7. Re: Luster clients getting evicted (Aaron Knister)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 11 Feb 2008 11:25:48 -0700
> From: "Marty Barnaby" <mlbarna at sandia.gov>
> Subject: Re: [Lustre-discuss] Benchmarking Lustre
> To: "lustre-discuss at lists.lustre.org"
> 	<lustre-discuss at lists.lustre.org>
> Message-ID: <47B0932C.2090200 at sandia.gov>
> Content-Type: text/plain; charset=iso-8859-1; format=flowed
> 
> Do you have any special interests, like: writing from a true MPI job; 
> collective vs. independent; one-file-per-processor vs. a single, share 
> file; or writing via MPI-IO vs. posix?
> 
> 
> Marty Barnaby
> 
> 
> mayur bhosle wrote:
>> hi everyone,
>>
>>                         I am a student at Georgia Tech university, and 
>> as a part of a project i need to benchmark lustre file system. I did a 
>> lot of searching regarding
>> the possible benchmark, but i need some advice on which benchmarks 
>> would be more suitable... if any one can post a sugesstion that would 
>> be really helpful.......................
>>
>>                         thanks in advance ............
>>
>> Mayur
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 11 Feb 2008 14:16:20 -0500
> From: Aaron Knister <aaron at iges.org>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Tom.Wang <Tom.Wang at Sun.COM>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID: <79343CD8-77EA-4686-A2AE-BEE6FAC59914 at iges.org>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
> 
> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under  
> load, the clients hand about every 10 minutes which is really bad for  
> a production machine. The only way to fix the hang is to reboot the  
> server. My users are getting extremely impatient :-/
> 
> I see this on the clients-
> 
> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@  
> timeout (sent at 1202756629, 301s ago)  req at ffff8100af233600 x1796079/ 
> t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl  
> Rpc:/0/0 rc 0/-22
> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- 
> OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations  
> using this service will wait for recovery to complete.
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> 
> I've increased the timeout to 300seconds and it has helped marginally.
> 
> -Aaron
> 
> On Feb 9, 2008, at 12:06 AM, Tom.Wang wrote:
> 
>> Hi,
>> Aha, this is bug has been fixed in 14360.
>>
>> https://bugzilla.lustre.org/show_bug.cgi?id=14360
>>
>> The patch there should fix your problem, which should be released in  
>> 1.6.5
>>
>> Thanks
>>
>> Brock Palen wrote:
>>> Sure, Attached,  note though, we rebuilt our lustre source for  
>>> another
>>> box that uses the largesmp kernel. but it used the same options and
>>> compiler.
>>>
>>>
>>> Brock Palen
>>> Center for Advanced Computing
>>> brockp at umich.edu
>>> (734)936-1985
>>>
>>>
>>> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote:
>>>
>>>> Hello,
>>>>
>>>> m45_amp214_om D 0000000000000000     0  2587      1         31389
>>>> 2586 (NOTLB)
>>>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001
>>>>      00000100080f1a40 0000000000000246 00000101f6b435a8
>>>> 0000000380136025
>>>>      00000102270a1030 00000000000000d0
>>>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689}
>>>> <ffffffff8030e45f>{__down+147}
>>>>      <ffffffff80134659>{default_wake_function+0}
>>>> <ffffffff8030ff7d>{__down_failed+53}
>>>>      <ffffffffa04292e1>{:lustre:.text.lock.file+5}
>>>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798}
>>>>      <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456}
>>>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107}
>>>>      <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213}
>>>>      <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56}
>>>>      <ffffffffa02c3dbc>{:ptlrpc:search_queue+284}
>>>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99}
>>>>      <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915}
>>>>      <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435}
>>>>      <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313}
>>>>      <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023}
>>>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}
>>>>      <ffffffffa0268730>{:obdclass:class_handle2object+224}
>>>>      <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794}
>>>>      <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31}
>>>>      <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595}
>>>>      <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140}
>>>>      <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}
>>>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154}
>>>>      <ffffffffa039617d>{:mdc:mdc_intent_lock+685}
>>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>>      <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}
>>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>>      <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}
>>>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139}
>>>>      <ffffffffa0418a32>{:lustre:ll_intent_file_open+450}
>>>>      <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>> <ffffffff80192006>{__d_lookup+287}
>>>>      <ffffffffa0419724>{:lustre:ll_file_open+2100}
>>>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184}
>>>>      <ffffffff80179bdb>{sys_access+349}
>>>> <ffffffff8017a1ee>{__dentry_open+201}
>>>>      <ffffffff8017a3a9>{filp_open+95}
>>>> <ffffffff80179bdb>{sys_access+349}
>>>>      <ffffffff801f00b5>{strncpy_from_user+74}
>>>> <ffffffff8017a598>{sys_open+57}
>>>>      <ffffffff8011026a>{system_call+126}
>>>>
>>>> It seems blocking_ast process was blocked here. Could you dump the
>>>> lustre/llite/namei.o by  objdump -S lustre/llite/namei.o and send  
>>>> to me?
>>>>
>>>> Thanks
>>>> WangDi
>>>>
>>>> Brock Palen wrote:
>>>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote:
>>>>>>>>> MDT dmesg:
>>>>>>>>>
>>>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg())
>>>>>>>>> @@@  processing error (-107)  req at 000001002b
>>>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl
>>>>>>>>> Interpret:/0/0  rc -107/0
>>>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback())  
>>>>>>>>> ###
>>>>>>>>> lock  callback timer expired: evicting cl
>>>>>>>>> ient
>>>>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID
>>>>>>>>> nid 10.164.0.141 at tcp  ns: mds-nobackup
>>>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc:
>>>>>>>>> 1/0,0  mode: CR/CR res: 11240142/324715850 bi
>>>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08
>>>>>>>>> expref:  372 pid 26925
>>>>>>>>>
>>>>>>>> The client was evicted because of this lock can not be released
>>>>>>>> on client
>>>>>>>> on time. Could you provide the stack strace of client at that  
>>>>>>>> time?
>>>>>>>>
>>>>>>>> I assume increase obd_timeout could fix your problem. Then maybe
>>>>>>>> you should wait 1.6.5 released, including a new feature
>>>>>>>> adaptive_timeout,
>>>>>>>> which will adjust the timeout value according to the network
>>>>>>>> congestion
>>>>>>>> and server load. And it should help your problem.
>>>>>>> Waiting for the next version of lustre might be the best  
>>>>>>> thing.  I
>>>>>>> had upped the timeout a few days back but the next day i had
>>>>>>> errors on the MDS box.  I have switched it back:
>>>>>>>
>>>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300
>>>>>>>
>>>>>>> I would love to give you that trace but I don't know how to get
>>>>>>> it.  Is there a debug option to turn on in the clients?
>>>>>> You can get that by echo t > /proc/sysrq-trigger on client.
>>>>>>
>>>>> Cool command,  output of the client is attached.  The four  
>>>>> processes
>>>>> m45_amp214_om,  is the application that hung when working off of
>>>>> luster.  you can see its stuck in IO state.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
> 
> (301) 595-7000
> aaron at iges.org
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Mon, 11 Feb 2008 15:04:05 -0500
> From: "Tom.Wang" <Tom.Wang at Sun.COM>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Aaron Knister <aaron at iges.org>
> Cc: lustre-discuss at lists.lustre.org
> Message-ID: <47B0AA35.7070303 at sun.com>
> Content-Type: text/plain; format=flowed; charset=ISO-8859-1
> 
> Aaron Knister wrote:
>> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under 
>> load, the clients hand about every 10 minutes which is really bad for 
>> a production machine. The only way to fix the hang is to reboot the 
>> server. My users are getting extremely impatient :-/
>>
>> I see this on the clients-
>>
>> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ 
>> timeout (sent at 1202756629, 301s ago)  req at ffff8100af233600 
>> x1796079/t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 
>> ref 1 fl Rpc:/0/0 rc 0/-22
> It means OST could not response the request(unlink, o6) in 300 seconds, 
> so client disconnect the import to OST and try to reconnect.
> Does this disconnection always happened when do unlink ? Could you 
> please post process trace and console msg of OST at that time?
> 
> Thanks
> WangDi
>> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service 
>> data-OST0000 via nid 192.168.64.71 at o2ib was lost; in progress 
>> operations using this service will wait for recovery to complete.
>> LustreError: 11-0: an error occurred while communicating with 
>> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>> LustreError: 11-0: an error occurred while communicating with 
>> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>>
>> I've increased the timeout to 300seconds and it has helped marginally.
>>
>> -Aaron
>>
> 
>>
>>
>>
>>
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Mon, 11 Feb 2008 15:19:21 -0500
> From: Craig Prescott <prescott at hpc.ufl.edu>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Aaron Knister <aaron at iges.org>
> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
> Message-ID: <47B0ADC9.8020501 at hpc.ufl.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Aaron Knister wrote:
>> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under  
>> load, the clients hand about every 10 minutes which is really bad for  
>> a production machine. The only way to fix the hang is to reboot the  
>> server. My users are getting extremely impatient :-/
>>
>> I see this on the clients-
>>
>> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@  
>> timeout (sent at 1202756629, 301s ago)  req at ffff8100af233600 x1796079/ 
>> t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl  
>> Rpc:/0/0 rc 0/-22
>> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- 
>> OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations  
>> using this service will wait for recovery to complete.
>> LustreError: 11-0: an error occurred while communicating with  
>> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>> LustreError: 11-0: an error occurred while communicating with  
>> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>>
>> I've increased the timeout to 300seconds and it has helped marginally.
> 
> Hi Aaron;
> 
> We set the timeout a big number (1000secs) on our 400 node cluster
> (mostly o2ib, some tcp clients).  Until we did this, we had loads
> of evictions.  In our case, it solved the problem.
> 
> Cheers,
> Craig
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Mon, 11 Feb 2008 14:11:45 -0700
> From: Andreas Dilger <adilger at sun.com>
> Subject: Re: [Lustre-discuss] rc -43: Identifier removed
> To: Per Lundqvist <perl at nsc.liu.se>
> Cc: Lustre Discuss <lustre-discuss at lists.lustre.org>
> Message-ID: <20080211211145.GJ3029 at webber.adilger.int>
> Content-Type: text/plain; charset=us-ascii
> 
> On Feb 11, 2008  17:04 +0100, Per Lundqvist wrote:
>> I got this error today when testing a newly set up 1.6 filesystem:
>>
>>    n50 1% cd /mnt/test
>>    n50 2% ls
>>    ls: reading directory .: Identifier removed
>>    
>>    n50 3% ls -alrt
>>    total 8
>>    ?---------  ? ?    ?       ?            ? dir1
>>    ?---------  ? ?    ?       ?            ? dir2
>>    drwxr-xr-x  4 root root 4096 Feb  8 15:46 ../
>>    drwxr-xr-x  4 root root 4096 Feb 11 15:11 ./
>>
>>    n50 4% stat .
>>      File: `.'
>>      Size: 4096            Blocks: 8          IO Block: 4096   directory
>>    Device: b438c888h/-1271347064d  Inode: 27616681    Links: 2
>>    Access: (0755/drwxr-xr-x)  Uid: ( 1120/   faxen)   Gid: (  500/     nsc)
>>    Access: 2008-02-11 16:11:48.336621154 +0100
>>    Modify: 2008-02-11 15:11:27.000000000 +0100
>>    Change: 2008-02-11 15:11:31.352841294 +0100
>>    
>> this seems to be happen almost all the time when I am running as a 
>> specific user on this system. Note that the stat call always works... I 
>> haven't yet been able to reproduce this problem when running as my own 
>> user.
> 
> EIDRM (Identifier removed) means that your MDS has a user database
> (/etc/passwd and /etc/group) that is missing the particular user ID.
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Mon, 11 Feb 2008 16:17:37 -0500
> From: Brock Palen <brockp at umich.edu>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Craig Prescott <prescott at hpc.ufl.edu>
> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
> Message-ID: <38A6B1A2-E20A-40BC-80C2-CEBB971BDC09 at umich.edu>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
> 
>>> I've increased the timeout to 300seconds and it has helped  
>>> marginally.
>> Hi Aaron;
>>
>> We set the timeout a big number (1000secs) on our 400 node cluster
>> (mostly o2ib, some tcp clients).  Until we did this, we had loads
>> of evictions.  In our case, it solved the problem.
> 
> This feels excessive.  But at this point I guess Ill try it.
> 
>> Cheers,
>> Craig
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
> 
> 
> 
> ------------------------------
> 
> Message: 7
> Date: Mon, 11 Feb 2008 16:48:05 -0500
> From: Aaron Knister <aaron at iges.org>
> Subject: Re: [Lustre-discuss] Luster clients getting evicted
> To: Brock Palen <brockp at umich.edu>
> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
> Message-ID: <7A1D46E5-CC69-4C37-9CC7-B229FCA43BA1 at iges.org>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
> 
> So far it's helped. If this doesn't fix it I'm going to apply the  
> patch mentioned here - https://bugzilla.lustre.org/attachment.cgi?id=14006&action=edit 
>   I'll let you know how it goes. If you'd like a copy of the patched  
> version let me know. Are you running RHEL/SLES? what version of the OS  
> and lustre?
> 
> -Aaron
> 
> On Feb 11, 2008, at 4:17 PM, Brock Palen wrote:
> 
>>>> I've increased the timeout to 300seconds and it has helped  
>>>> marginally.
>>> Hi Aaron;
>>>
>>> We set the timeout a big number (1000secs) on our 400 node cluster
>>> (mostly o2ib, some tcp clients).  Until we did this, we had loads
>>> of evictions.  In our case, it solved the problem.
>> This feels excessive.  But at this point I guess Ill try it.
>>
>>> Cheers,
>>> Craig
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
> 
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
> 
> (301) 595-7000
> aaron at iges.org
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> End of Lustre-discuss Digest, Vol 25, Issue 17
> **********************************************
> 
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/ms-tnef
> Size: 11404 bytes
> Desc: not available
> Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080212/610bb025/attachment.bin 
> 
> ------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> 
> End of Lustre-discuss Digest, Vol 25, Issue 19
> **********************************************
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss