[Lustre-discuss] OST went down running Lustre 1.6.6
Brian Stone
bgstone at sgi.com
Wed Feb 11 07:23:02 PST 2009
My customer believes that this scenario is leading to corrupted files.
If so, is there any way to avoid the file corruption when an OSS goes down?
Probably related is the following:
A filesystem hang was reported to me on Friday 2/6. The problem appeared
to be localized to (OSS for /nobackupp2). Examination showed that all
512 ll_ost_io threads were stuck waiting for lock callbacks to complete.
I think the stack trace for the ost_io threads indicates that they'r
waiting for a blocking AST to complete, which was requested to an
ost_punch (truncate) operation. For example:
ll_ost_io_329 S ffff8103fae15888 0 634 1 635 633
(L-TLB\
)
ffff8103fae15838 0000000000000046 ffff8100ace9ad40 000000000000000a
ffff8103fae0da48 ffff8103fae0d7f0 ffff810009035800 000578ee6a132423
0000000000001971 00000000000061a8
Call Trace: <ffffffff885903fd>{:ptlrpc:ldlm_expired_completion_wait+173}
<ffffffff88590350>{:ptlrpc:ldlm_expired_completion_wait+0}
<ffffffff88591dfd>{:ptlrpc:ldlm_completion_ast+845}
<ffffffff885780bb>{:ptlrpc:ldlm_lock_enqueue+2171}
<ffffffff8012c8a9>{default_wake_function+0}
<ffffffff88573dca>{:ptlrpc:ldlm_lock_addref_internal_nolock+58}
<ffffffff88590a9b>{:ptlrpc:ldlm_cli_enqueue_local+1275}
<ffffffff885b56f2>{:ptlrpc:lustre_swab_buf+66}
<ffffffff887dae38>{:ost:ost_punch+1320}
<ffffffff8858e270>{:ptlrpc:ldlm_blocking_ast+0}
<ffffffff88591ab0>{:ptlrpc:ldlm_completion_ast+0}
<ffffffff8858afc0>{:ptlrpc:ldlm_glimpse_ast+0}
<ffffffff885b2ff5>{:ptlrpc:lustre_msg_get_opc+53}
<ffffffff887ded5d>{:ost:ost_msg_check_version+317}
<ffffffff887e4c7e>{:ost:ost_handle+13454}
<ffffffff801823a9>{cache_alloc_refill+109}
<ffffffff88515995>{:obdclass:class_handle2object+213}
<ffffffff885b2765>{:ptlrpc:lustre_msg_get_conn_cnt+53}
<ffffffff8012bac9>{find_busiest_group+360}
<ffffffff885bc60a>{:ptlrpc:ptlrpc_check_req+26}
<ffffffff885be867>{:ptlrpc:ptlrpc_server_handle_request+2503}
<ffffffff8010f239>{do_gettimeofday+92}
<ffffffff8847c3d6>{:libcfs:lcw_update_time+38}
<ffffffff8012ac78>{__wake_up_common+64}
<ffffffff885c19d1>{:ptlrpc:ptlrpc_main+3745}
<ffffffff8012c8a9>{default_wake_function+0}
<ffffffff8010bfc2>{child_rip+8}
<ffffffff885c0b30>{:ptlrpc:ptlrpc_main+0}
<ffffffff8010bfba>{child_rip+0}
Rebooting cleared up the problem.
This appears to be Lustre bug 16129. Can you confirm? Also, if this is
the issue, it appears as if a patch is available and 16129 said the
patch is attached to 17748, but it wasn't clear to me which item there
was the patch.
Thanks,
Brian Stone
Brian J. Murrell wrote:
> On Mon, 2009-02-09 at 13:39 -0500, Brian Stone wrote:
>
>> Some clients are still evicted, but not all. Why would the recovery
>> complete if all clients did not reconnect?
>>
>
> Recovery can't wait indefinitely for all clients to connect. You could
> wind up with a recovery that never completes if it does. After a
> timeout, if all clients don't connect, recovery is aborted and the
> target proceeds to the completion state so that it can respond to new
> requests.
>
> b.
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list