[Lustre-discuss] Contents of Lustre-discuss digest...

Mon Feb 11 21:48:35 PST 2008

Hi,
i just want to know whether there are any alternative file systems for HP SFS.
I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway.

Thanks and Regards,
Ashok Bharat

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org
Sent: Tue 2/12/2008 11:05 AM
To: lustre-discuss at lists.lustre.org
Subject: Lustre-discuss Digest, Vol 25, Issue 19

Send Lustre-discuss mailing list submissions to
	lustre-discuss at lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.lustre.org/mailman/listinfo/lustre-discuss
or, via email, send a message with subject or body 'help' to
	lustre-discuss-request at lists.lustre.org

You can reach the person managing the list at
	lustre-discuss-owner at lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Lustre-discuss digest..."

Today's Topics:

   1. Re: multihomed clients ignoring lnet options (Cliff White)
   2. Re: multihomed clients ignoring lnet options (Joe Little)
   3. Re: multihomed clients ignoring lnet options (Steden Klaus)
   4. Re: Lustre-discuss Digest, Vol 25, Issue 17 (ashok bharat bayana)

----------------------------------------------------------------------

Message: 1
Date: Mon, 11 Feb 2008 20:00:10 -0800
From: Cliff White <Cliff.White at Sun.COM>
Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
To: Aaron Knister <aaron at iges.org>
Cc: lustre-discuss at lists.lustre.org
Message-ID: <47B119CA.4050105 at sun.com>
Content-Type: text/plain; format=flowed; charset=ISO-8859-1

Aaron Knister wrote:
> I believe that's correct. The nids of the various server components  
> are stored on the filesystem itself.

Yes, and you can always see them with
tunefs.lustre --print <device>

cliffw

> 
> On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
> 
>> never mind.. The problem was resolved by recreating again the MGS and
>> the OST's using the same parameters on the server. I was able to
>> change the parameters and still have the servers working, but my guess
>> is that those options are permanently etched into the filesystem.
>>
>>
>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
>>> I have all of my servers and clients using eth1 for the tcp lustre  
>>> lnet.
>>>
>>> All have modprobe.conf entries of:
>>>
>>> options lnet networks="tcp0(eth1)"
>>>
>>> and all report with "lctl list_nids" that they are using the IP
>>> address associated with that interface (a net 192.168.200.x address)
>>>
>>> However, when my client connects, it ignores the above and goes with
>>> eth0 for routing, even though the mds/mgs is on that network range:
>>>
>>> client dmesg:
>>>
>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre  
>>> stack 8192
>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
>>> Lustre: Accept secure, port 988
>>> Lustre: OBD class driver, info at clusterfs.com
>>>        Lustre Version: 1.6.4.2
>>>        Build Version:
>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre- 
>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
>>> Lustre: Lustre Client File System; info at clusterfs.com
>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
>>> -104 reading HELLO from 192.168.2.201
>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
>>> 192.168.2.201 on port 988 was reset: is it running a compatible
>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
>>>
>>> server dmesg:
>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
>>> 192.168.2.201 at tcp: No matching NI
>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
> 
> (301) 595-7000
> aaron at iges.org
> 
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

------------------------------

Message: 2
Date: Mon, 11 Feb 2008 20:51:20 -0800
From: "Joe Little" <jmlittle at gmail.com>
Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
To: "Cliff White" <Cliff.White at sun.com>
Cc: lustre-discuss at lists.lustre.org
Message-ID:
	<e3849caa0802112051q7e24e6acv5af03a16f2bca2c3 at mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote:
> Aaron Knister wrote:
> > I believe that's correct. The nids of the various server components
> > are stored on the filesystem itself.
>
> Yes, and you can always see them with
> tunefs.lustre --print <device>
>
> cliffw

anyone to change them after the fact?
>
>
> >
> > On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
> >
> >> never mind.. The problem was resolved by recreating again the MGS and
> >> the OST's using the same parameters on the server. I was able to
> >> change the parameters and still have the servers working, but my guess
> >> is that those options are permanently etched into the filesystem.
> >>
> >>
> >> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
> >>> I have all of my servers and clients using eth1 for the tcp lustre
> >>> lnet.
> >>>
> >>> All have modprobe.conf entries of:
> >>>
> >>> options lnet networks="tcp0(eth1)"
> >>>
> >>> and all report with "lctl list_nids" that they are using the IP
> >>> address associated with that interface (a net 192.168.200.x address)
> >>>
> >>> However, when my client connects, it ignores the above and goes with
> >>> eth0 for routing, even though the mds/mgs is on that network range:
> >>>
> >>> client dmesg:
> >>>
> >>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre
> >>> stack 8192
> >>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
> >>> Lustre: Accept secure, port 988
> >>> Lustre: OBD class driver, info at clusterfs.com
> >>>        Lustre Version: 1.6.4.2
> >>>        Build Version:
> >>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre-
> >>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
> >>> Lustre: Lustre Client File System; info at clusterfs.com
> >>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
> >>> -104 reading HELLO from 192.168.2.201
> >>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
> >>> 192.168.2.201 on port 988 was reset: is it running a compatible
> >>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
> >>>
> >>> server dmesg:
> >>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
> >>> 192.168.2.201 at tcp: No matching NI
> >>>
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> > Aaron Knister
> > Associate Systems Analyst
> > Center for Ocean-Land-Atmosphere Studies
> >
> > (301) 595-7000
> > aaron at iges.org
> >
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

------------------------------

Message: 3
Date: Mon, 11 Feb 2008 20:53:41 -0800
From: "Steden Klaus" <Klaus.Steden at thomson.net>
Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options
To: <jmlittle at gmail.com>,	<Cliff.White at sun.com>
Cc: lustre-discuss at lists.lustre.org
Message-ID:
	<23480D326186CF49819F5EF363276C9003AB2AB3 at BRBKSMAIL04.am.thmulti.com>
Content-Type: text/plain;	charset="utf-8"

If you have root, you can change them using tunefs.lustre after the file system has been shut down.

I've done this a number of times to test various lnet configs.

Klaus

----- Original Message -----
From: lustre-discuss-bounces at lists.lustre.org <lustre-discuss-bounces at lists.lustre.org>
To: Cliff White <Cliff.White at sun.com>
Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
Sent: Mon Feb 11 20:51:20 2008
Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options

On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote:
> Aaron Knister wrote:
> > I believe that's correct. The nids of the various server components
> > are stored on the filesystem itself.
>
> Yes, and you can always see them with
> tunefs.lustre --print <device>
>
> cliffw

anyone to change them after the fact?
>
>
> >
> > On Feb 10, 2008, at 12:58 AM, Joe Little wrote:
> >
> >> never mind.. The problem was resolved by recreating again the MGS and
> >> the OST's using the same parameters on the server. I was able to
> >> change the parameters and still have the servers working, but my guess
> >> is that those options are permanently etched into the filesystem.
> >>
> >>
> >> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote:
> >>> I have all of my servers and clients using eth1 for the tcp lustre
> >>> lnet.
> >>>
> >>> All have modprobe.conf entries of:
> >>>
> >>> options lnet networks="tcp0(eth1)"
> >>>
> >>> and all report with "lctl list_nids" that they are using the IP
> >>> address associated with that interface (a net 192.168.200.x address)
> >>>
> >>> However, when my client connects, it ignores the above and goes with
> >>> eth0 for routing, even though the mds/mgs is on that network range:
> >>>
> >>> client dmesg:
> >>>
> >>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre
> >>> stack 8192
> >>> Lustre: Added LNI 192.168.200.100 at tcp [8/256]
> >>> Lustre: Accept secure, port 988
> >>> Lustre: OBD class driver, info at clusterfs.com
> >>>        Lustre Version: 1.6.4.2
> >>>        Build Version:
> >>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre-
> >>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp
> >>> Lustre: Lustre Client File System; info at clusterfs.com
> >>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error
> >>> -104 reading HELLO from 192.168.2.201
> >>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host
> >>> 192.168.2.201 on port 988 was reset: is it running a compatible
> >>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs?
> >>>
> >>> server dmesg:
> >>> LustreError: 120-3: Refusing connection from 192.168.2.192 for
> >>> 192.168.2.201 at tcp: No matching NI
> >>>
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> > Aaron Knister
> > Associate Systems Analyst
> > Center for Ocean-Land-Atmosphere Studies
> >
> > (301) 595-7000
> > aaron at iges.org
> >
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

------------------------------

Message: 4
Date: Tue, 12 Feb 2008 11:15:18 +0530
From: "ashok bharat bayana" <ashok.bharat.bayana at iiitb.ac.in>
Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 25, Issue 17
To: <lustre-discuss at lists.lustre.org>
Message-ID: <8626C1B7EB748940BCDD7596134632BE850213 at jal.iiitb.ac.in>
Content-Type: text/plain; charset="iso-8859-1"

Hi,
i just want to know whether there are any alternative file systems for HP SFS.
I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway.

Thanks and Regards,
Ashok Bharat

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org
Sent: Tue 2/12/2008 3:18 AM
To: lustre-discuss at lists.lustre.org
Subject: Lustre-discuss Digest, Vol 25, Issue 17

Send Lustre-discuss mailing list submissions to
	lustre-discuss at lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.lustre.org/mailman/listinfo/lustre-discuss
or, via email, send a message with subject or body 'help' to
	lustre-discuss-request at lists.lustre.org

You can reach the person managing the list at
	lustre-discuss-owner at lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Lustre-discuss digest..."

Today's Topics:

   1. Re: Benchmarking Lustre (Marty Barnaby)
   2. Re: Luster clients getting evicted (Aaron Knister)
   3. Re: Luster clients getting evicted (Tom.Wang)
   4. Re: Luster clients getting evicted (Craig Prescott)
   5. Re: rc -43: Identifier removed (Andreas Dilger)
   6. Re: Luster clients getting evicted (Brock Palen)
   7. Re: Luster clients getting evicted (Aaron Knister)

----------------------------------------------------------------------

Message: 1
Date: Mon, 11 Feb 2008 11:25:48 -0700
From: "Marty Barnaby" <mlbarna at sandia.gov>
Subject: Re: [Lustre-discuss] Benchmarking Lustre
To: "lustre-discuss at lists.lustre.org"
	<lustre-discuss at lists.lustre.org>
Message-ID: <47B0932C.2090200 at sandia.gov>
Content-Type: text/plain; charset=iso-8859-1; format=flowed

Do you have any special interests, like: writing from a true MPI job; 
collective vs. independent; one-file-per-processor vs. a single, share 
file; or writing via MPI-IO vs. posix?

Marty Barnaby

mayur bhosle wrote:
> hi everyone,
>
>                         I am a student at Georgia Tech university, and 
> as a part of a project i need to benchmark lustre file system. I did a 
> lot of searching regarding
> the possible benchmark, but i need some advice on which benchmarks 
> would be more suitable... if any one can post a sugesstion that would 
> be really helpful.......................
>
>                         thanks in advance ............
>
> Mayur

------------------------------

Message: 2
Date: Mon, 11 Feb 2008 14:16:20 -0500
From: Aaron Knister <aaron at iges.org>
Subject: Re: [Lustre-discuss] Luster clients getting evicted
To: Tom.Wang <Tom.Wang at Sun.COM>
Cc: lustre-discuss at lists.lustre.org
Message-ID: <79343CD8-77EA-4686-A2AE-BEE6FAC59914 at iges.org>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under  
load, the clients hand about every 10 minutes which is really bad for  
a production machine. The only way to fix the hang is to reboot the  
server. My users are getting extremely impatient :-/

I see this on the clients-

LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@  
timeout (sent at 1202756629, 301s ago)  req at ffff8100af233600 x1796079/ 
t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl  
Rpc:/0/0 rc 0/-22
Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- 
OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations  
using this service will wait for recovery to complete.
LustreError: 11-0: an error occurred while communicating with  
192.168.64.71 at o2ib. The ost_connect operation failed with -16
LustreError: 11-0: an error occurred while communicating with  
192.168.64.71 at o2ib. The ost_connect operation failed with -16

I've increased the timeout to 300seconds and it has helped marginally.

-Aaron

On Feb 9, 2008, at 12:06 AM, Tom.Wang wrote:

> Hi,
> Aha, this is bug has been fixed in 14360.
>
> https://bugzilla.lustre.org/show_bug.cgi?id=14360
>
> The patch there should fix your problem, which should be released in  
> 1.6.5
>
> Thanks
>
> Brock Palen wrote:
>> Sure, Attached,  note though, we rebuilt our lustre source for  
>> another
>> box that uses the largesmp kernel. but it used the same options and
>> compiler.
>>
>>
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote:
>>
>>> Hello,
>>>
>>> m45_amp214_om D 0000000000000000     0  2587      1         31389
>>> 2586 (NOTLB)
>>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001
>>>      00000100080f1a40 0000000000000246 00000101f6b435a8
>>> 0000000380136025
>>>      00000102270a1030 00000000000000d0
>>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689}
>>> <ffffffff8030e45f>{__down+147}
>>>      <ffffffff80134659>{default_wake_function+0}
>>> <ffffffff8030ff7d>{__down_failed+53}
>>>      <ffffffffa04292e1>{:lustre:.text.lock.file+5}
>>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798}
>>>      <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456}
>>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107}
>>>      <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213}
>>>      <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56}
>>>      <ffffffffa02c3dbc>{:ptlrpc:search_queue+284}
>>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99}
>>>      <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915}
>>>      <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435}
>>>      <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313}
>>>      <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023}
>>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}
>>>      <ffffffffa0268730>{:obdclass:class_handle2object+224}
>>>      <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794}
>>>      <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31}
>>>      <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595}
>>>      <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140}
>>>      <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53}
>>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154}
>>>      <ffffffffa039617d>{:mdc:mdc_intent_lock+685}
>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>      <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}
>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>>      <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0}
>>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139}
>>>      <ffffffffa0418a32>{:lustre:ll_intent_file_open+450}
>>>      <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0}
>>> <ffffffff80192006>{__d_lookup+287}
>>>      <ffffffffa0419724>{:lustre:ll_file_open+2100}
>>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184}
>>>      <ffffffff80179bdb>{sys_access+349}
>>> <ffffffff8017a1ee>{__dentry_open+201}
>>>      <ffffffff8017a3a9>{filp_open+95}
>>> <ffffffff80179bdb>{sys_access+349}
>>>      <ffffffff801f00b5>{strncpy_from_user+74}
>>> <ffffffff8017a598>{sys_open+57}
>>>      <ffffffff8011026a>{system_call+126}
>>>
>>> It seems blocking_ast process was blocked here. Could you dump the
>>> lustre/llite/namei.o by  objdump -S lustre/llite/namei.o and send  
>>> to me?
>>>
>>> Thanks
>>> WangDi
>>>
>>> Brock Palen wrote:
>>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote:
>>>>>>>> MDT dmesg:
>>>>>>>>
>>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg())
>>>>>>>> @@@  processing error (-107)  req at 000001002b
>>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl
>>>>>>>> Interpret:/0/0  rc -107/0
>>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback())  
>>>>>>>> ###
>>>>>>>> lock  callback timer expired: evicting cl
>>>>>>>> ient
>>>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID
>>>>>>>> nid 10.164.0.141 at tcp  ns: mds-nobackup
>>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc:
>>>>>>>> 1/0,0  mode: CR/CR res: 11240142/324715850 bi
>>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08
>>>>>>>> expref:  372 pid 26925
>>>>>>>>
>>>>>>> The client was evicted because of this lock can not be released
>>>>>>> on client
>>>>>>> on time. Could you provide the stack strace of client at that  
>>>>>>> time?
>>>>>>>
>>>>>>> I assume increase obd_timeout could fix your problem. Then maybe
>>>>>>> you should wait 1.6.5 released, including a new feature
>>>>>>> adaptive_timeout,
>>>>>>> which will adjust the timeout value according to the network
>>>>>>> congestion
>>>>>>> and server load. And it should help your problem.
>>>>>>
>>>>>> Waiting for the next version of lustre might be the best  
>>>>>> thing.  I
>>>>>> had upped the timeout a few days back but the next day i had
>>>>>> errors on the MDS box.  I have switched it back:
>>>>>>
>>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300
>>>>>>
>>>>>> I would love to give you that trace but I don't know how to get
>>>>>> it.  Is there a debug option to turn on in the clients?
>>>>> You can get that by echo t > /proc/sysrq-trigger on client.
>>>>>
>>>> Cool command,  output of the client is attached.  The four  
>>>> processes
>>>> m45_amp214_om,  is the application that hung when working off of
>>>> luster.  you can see its stuck in IO state.
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

------------------------------

Message: 3
Date: Mon, 11 Feb 2008 15:04:05 -0500
From: "Tom.Wang" <Tom.Wang at Sun.COM>
Subject: Re: [Lustre-discuss] Luster clients getting evicted
To: Aaron Knister <aaron at iges.org>
Cc: lustre-discuss at lists.lustre.org
Message-ID: <47B0AA35.7070303 at sun.com>
Content-Type: text/plain; format=flowed; charset=ISO-8859-1

Aaron Knister wrote:
> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under 
> load, the clients hand about every 10 minutes which is really bad for 
> a production machine. The only way to fix the hang is to reboot the 
> server. My users are getting extremely impatient :-/
>
> I see this on the clients-
>
> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ 
> timeout (sent at 1202756629, 301s ago)  req at ffff8100af233600 
> x1796079/t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 
> ref 1 fl Rpc:/0/0 rc 0/-22
It means OST could not response the request(unlink, o6) in 300 seconds, 
so client disconnect the import to OST and try to reconnect.
Does this disconnection always happened when do unlink ? Could you 
please post process trace and console msg of OST at that time?

Thanks
WangDi
> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service 
> data-OST0000 via nid 192.168.64.71 at o2ib was lost; in progress 
> operations using this service will wait for recovery to complete.
> LustreError: 11-0: an error occurred while communicating with 
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> LustreError: 11-0: an error occurred while communicating with 
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>
> I've increased the timeout to 300seconds and it has helped marginally.
>
> -Aaron
>

>
>
>
>
>

------------------------------

Message: 4
Date: Mon, 11 Feb 2008 15:19:21 -0500
From: Craig Prescott <prescott at hpc.ufl.edu>
Subject: Re: [Lustre-discuss] Luster clients getting evicted
To: Aaron Knister <aaron at iges.org>
Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
Message-ID: <47B0ADC9.8020501 at hpc.ufl.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Aaron Knister wrote:
> I'm having a similar issue with lustre 1.6.4.2 and infiniband. Under  
> load, the clients hand about every 10 minutes which is really bad for  
> a production machine. The only way to fix the hang is to reboot the  
> server. My users are getting extremely impatient :-/
> 
> I see this on the clients-
> 
> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@  
> timeout (sent at 1202756629, 301s ago)  req at ffff8100af233600 x1796079/ 
> t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl  
> Rpc:/0/0 rc 0/-22
> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- 
> OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations  
> using this service will wait for recovery to complete.
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> LustreError: 11-0: an error occurred while communicating with  
> 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> 
> I've increased the timeout to 300seconds and it has helped marginally.

Hi Aaron;

We set the timeout a big number (1000secs) on our 400 node cluster
(mostly o2ib, some tcp clients).  Until we did this, we had loads
of evictions.  In our case, it solved the problem.

Cheers,
Craig

------------------------------

Message: 5
Date: Mon, 11 Feb 2008 14:11:45 -0700
From: Andreas Dilger <adilger at sun.com>
Subject: Re: [Lustre-discuss] rc -43: Identifier removed
To: Per Lundqvist <perl at nsc.liu.se>
Cc: Lustre Discuss <lustre-discuss at lists.lustre.org>
Message-ID: <20080211211145.GJ3029 at webber.adilger.int>
Content-Type: text/plain; charset=us-ascii

On Feb 11, 2008  17:04 +0100, Per Lundqvist wrote:
> I got this error today when testing a newly set up 1.6 filesystem:
> 
>    n50 1% cd /mnt/test
>    n50 2% ls
>    ls: reading directory .: Identifier removed
>    
>    n50 3% ls -alrt
>    total 8
>    ?---------  ? ?    ?       ?            ? dir1
>    ?---------  ? ?    ?       ?            ? dir2
>    drwxr-xr-x  4 root root 4096 Feb  8 15:46 ../
>    drwxr-xr-x  4 root root 4096 Feb 11 15:11 ./
> 
>    n50 4% stat .
>      File: `.'
>      Size: 4096            Blocks: 8          IO Block: 4096   directory
>    Device: b438c888h/-1271347064d  Inode: 27616681    Links: 2
>    Access: (0755/drwxr-xr-x)  Uid: ( 1120/   faxen)   Gid: (  500/     nsc)
>    Access: 2008-02-11 16:11:48.336621154 +0100
>    Modify: 2008-02-11 15:11:27.000000000 +0100
>    Change: 2008-02-11 15:11:31.352841294 +0100
>    
> this seems to be happen almost all the time when I am running as a 
> specific user on this system. Note that the stat call always works... I 
> haven't yet been able to reproduce this problem when running as my own 
> user.

EIDRM (Identifier removed) means that your MDS has a user database
(/etc/passwd and /etc/group) that is missing the particular user ID.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

------------------------------

Message: 6
Date: Mon, 11 Feb 2008 16:17:37 -0500
From: Brock Palen <brockp at umich.edu>
Subject: Re: [Lustre-discuss] Luster clients getting evicted
To: Craig Prescott <prescott at hpc.ufl.edu>
Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
Message-ID: <38A6B1A2-E20A-40BC-80C2-CEBB971BDC09 at umich.edu>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

>> I've increased the timeout to 300seconds and it has helped  
>> marginally.
>
> Hi Aaron;
>
> We set the timeout a big number (1000secs) on our 400 node cluster
> (mostly o2ib, some tcp clients).  Until we did this, we had loads
> of evictions.  In our case, it solved the problem.

This feels excessive.  But at this point I guess Ill try it.

>
> Cheers,
> Craig
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

------------------------------

Message: 7
Date: Mon, 11 Feb 2008 16:48:05 -0500
From: Aaron Knister <aaron at iges.org>
Subject: Re: [Lustre-discuss] Luster clients getting evicted
To: Brock Palen <brockp at umich.edu>
Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org
Message-ID: <7A1D46E5-CC69-4C37-9CC7-B229FCA43BA1 at iges.org>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

So far it's helped. If this doesn't fix it I'm going to apply the  
patch mentioned here - https://bugzilla.lustre.org/attachment.cgi?id=14006&action=edit 
  I'll let you know how it goes. If you'd like a copy of the patched  
version let me know. Are you running RHEL/SLES? what version of the OS  
and lustre?

-Aaron

On Feb 11, 2008, at 4:17 PM, Brock Palen wrote:

>>> I've increased the timeout to 300seconds and it has helped  
>>> marginally.
>>
>> Hi Aaron;
>>
>> We set the timeout a big number (1000secs) on our 400 node cluster
>> (mostly o2ib, some tcp clients).  Until we did this, we had loads
>> of evictions.  In our case, it solved the problem.
>
> This feels excessive.  But at this point I guess Ill try it.
>
>>
>> Cheers,
>> Craig
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>

Aaron Knister
Associate Systems Analyst
Center for Ocean-Land-Atmosphere Studies

(301) 595-7000
aaron at iges.org

------------------------------

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

End of Lustre-discuss Digest, Vol 25, Issue 17
**********************************************

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 11404 bytes
Desc: not available
Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080212/610bb025/attachment.bin 

------------------------------

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

End of Lustre-discuss Digest, Vol 25, Issue 19
**********************************************

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 16245 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080212/2fa9e8b7/attachment.bin>