[Lustre-discuss] OST went down running Lustre 1.6.6

Wed Feb 11 07:23:02 PST 2009

My customer believes that this scenario is leading to corrupted files. 
If so, is there any way to avoid the file corruption when an OSS goes down?

Probably related is the following:

A filesystem hang was reported to me on Friday 2/6. The problem appeared 
to be localized to (OSS for /nobackupp2). Examination showed that all 
512 ll_ost_io threads were stuck waiting for lock callbacks to complete. 
I think the stack trace for the ost_io threads indicates that they'r 
waiting for a blocking AST to complete, which was requested to an 
ost_punch (truncate) operation. For example:

 ll_ost_io_329 S ffff8103fae15888     0   634      1           635   633 
(L-TLB\
)
 ffff8103fae15838 0000000000000046 ffff8100ace9ad40 000000000000000a
        ffff8103fae0da48 ffff8103fae0d7f0 ffff810009035800 000578ee6a132423
        0000000000001971 00000000000061a8
 Call Trace: <ffffffff885903fd>{:ptlrpc:ldlm_expired_completion_wait+173}
        <ffffffff88590350>{:ptlrpc:ldlm_expired_completion_wait+0}
        <ffffffff88591dfd>{:ptlrpc:ldlm_completion_ast+845}
        <ffffffff885780bb>{:ptlrpc:ldlm_lock_enqueue+2171} 
<ffffffff8012c8a9>{default_wake_function+0}
        <ffffffff88573dca>{:ptlrpc:ldlm_lock_addref_internal_nolock+58}
        <ffffffff88590a9b>{:ptlrpc:ldlm_cli_enqueue_local+1275}
        <ffffffff885b56f2>{:ptlrpc:lustre_swab_buf+66} 
<ffffffff887dae38>{:ost:ost_punch+1320}
        <ffffffff8858e270>{:ptlrpc:ldlm_blocking_ast+0} 
<ffffffff88591ab0>{:ptlrpc:ldlm_completion_ast+0}
        <ffffffff8858afc0>{:ptlrpc:ldlm_glimpse_ast+0} 
<ffffffff885b2ff5>{:ptlrpc:lustre_msg_get_opc+53}
        <ffffffff887ded5d>{:ost:ost_msg_check_version+317} 
<ffffffff887e4c7e>{:ost:ost_handle+13454}
        <ffffffff801823a9>{cache_alloc_refill+109} 
<ffffffff88515995>{:obdclass:class_handle2object+213}
        <ffffffff885b2765>{:ptlrpc:lustre_msg_get_conn_cnt+53}
        <ffffffff8012bac9>{find_busiest_group+360} 
<ffffffff885bc60a>{:ptlrpc:ptlrpc_check_req+26}
        <ffffffff885be867>{:ptlrpc:ptlrpc_server_handle_request+2503}
        <ffffffff8010f239>{do_gettimeofday+92} 
<ffffffff8847c3d6>{:libcfs:lcw_update_time+38}
        <ffffffff8012ac78>{__wake_up_common+64} 
<ffffffff885c19d1>{:ptlrpc:ptlrpc_main+3745}
        <ffffffff8012c8a9>{default_wake_function+0} 
<ffffffff8010bfc2>{child_rip+8}
        <ffffffff885c0b30>{:ptlrpc:ptlrpc_main+0} 
<ffffffff8010bfba>{child_rip+0}

Rebooting cleared up the problem.

This appears to be Lustre bug 16129. Can you confirm? Also, if this is 
the issue, it appears as if a patch is available and 16129 said the 
patch is attached to 17748, but it wasn't clear to me which item there 
was the patch.

Thanks,
Brian Stone

Brian J. Murrell wrote:
> On Mon, 2009-02-09 at 13:39 -0500, Brian Stone wrote:
>   
>> Some clients are still evicted, but not all. Why would the recovery 
>> complete if all clients did not reconnect?
>>     
>
> Recovery can't wait indefinitely for all clients to connect.  You could
> wind up with a recovery that never completes if it does.  After a
> timeout, if all clients don't connect, recovery is aborted and the
> target proceeds to the completion state so that it can respond to new
> requests.
>
> b.
>  
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>