<div dir="auto">Up the thread. </div><div dir="auto">The issue is still relevant </div><div dir="auto"><br></div><div dir="auto">Thanks, </div><div dir="auto">Stepan </div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">---------- Пересылаемое сообщение ---------<br>От: <strong class="gmail_sendername" dir="auto">Stepan Beskrovnyy</strong> <span dir="auto"><<a href="mailto:bsm099@gmail.com">bsm099@gmail.com</a>></span><br>Дата: Чт, 16 апр. 2026 г. в 15:33<br>Тема: Re: [lustre-discuss] EC branch not working with actually “killing” targets<br>Кому: Patrick Farrell <<a href="mailto:pfarrell@ddn.com">pfarrell@ddn.com</a>><br></div><br><br><div dir="ltr"><div><div><div>Additional results. <br><br></div>Done the same test as in sanity-ec.sh test40a <br><br>test_40a() {<br> enable_ec<br> [[ $OSTCOUNT -lt 6 ]] && skip_env "needs >= 6 OSTs"<br><br> local tf=$DIR/$tfile<br><br> $LFS setstripe -E -1 -S 1M -c 4 --ec 4+2 $tf ||<br> error "setstripe --ec 4+2 failed"<br><br> dd if=/dev/urandom of=$tf bs=1M count=4 || error "dd failed"<br> $LFS mirror resync $tf || error "mirror resync failed"<br><br> local sum1=$(md5sum $tf | awk '{print $1}')<br><br> ec_start_read_fault<br> stack_trap "ec_stop_read_fault"<br><br> local sum2=$(md5sum $tf | awk '{print $1}')<br> [[ "$sum1" == "$sum2" ]] ||<br> error "data changed after read fault: $sum1 vs $sum2"<br> echo "** Checksum before: $sum1 after: $sum2"<br>}<br>run_test 40a "test recovery read" <br><br><br><br></div>results, see picture <br><br><br><img src="cid:ii_mo1givbb2" alt="image.png" style="width: 844px; max-width: 100%;"><br><div></div>As I see, mirror resync is not working(Bug?). Used revision <a href="https://review.whamcloud.com/c/fs/lustre-release/+/64970(Patchset" target="_blank">https://review.whamcloud.com/c/fs/lustre-release/+/64970(Patchset</a> 33) <br><br></div><div>Any ideas for a hotfix ? My idea is that something happened due to refactoring of <span role="gridcell"><a href="https://review.whamcloud.com/c/fs/lustre-release/+/64970/33/lustre/utils/liblustreapi_layout.c" target="_blank"><span title="lustre/utils/liblustreapi_layout.c"><span>lustre/</span><span>utils/</span><span>liblustreapi_layout.c </span></span></a></span></div><div><span role="gridcell"><br></span></div><div><span role="gridcell">Thanks, </span></div></div><div dir="ltr"><div><span role="gridcell"><br></span></div><div><span role="gridcell">Stepan</span></div><div><span role="gridcell"><a href="https://review.whamcloud.com/c/fs/lustre-release/+/64970/33/lustre/utils/liblustreapi_layout.c" target="_blank"><span title="lustre/utils/liblustreapi_layout.c"><span></span></span><span></span></a></span></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Apr 15, 2026 at 2:22 PM Stepan Beskrovnyy <<a href="mailto:bsm099@gmail.com" target="_blank">bsm099@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)"><div dir="ltr"><div><div><div><div><div>Hi Patrick ! <br><br></div>Trying latest commits on EC branch. <br><br></div>Builded well, then made 4+2 scheme on 6 OST, layed on same OSServer. <br></div>Made a 1 gb random file and got all parity blocks in stale mode. <br></div>Mirror resync cant see the file (bug? ) <br></div><div>See pictures attached. <br><br><br></div><div>I have unmounted one of OST's and got IO error instantly ( Thats good to be honest:) ) <br><br></div><div>But I have no Idea why resync not working and why parity chunks became STALE by default. <br><br></div><div>Thoughts ? <br><br></div><div>Thanks, <br></div><div>Stepan </div><div><img src="cid:ii_mnzylwnm2" alt="image.png" style="width: 844px; max-width: 100%;"><br><img src="cid:ii_mnzym6y53" alt="image.png" style="width: 844px; max-width: 100%;"><br><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 14, 2026 at 9:29 PM Patrick Farrell <<a href="mailto:pfarrell@ddn.com" target="_blank">pfarrell@ddn.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)"><div>
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
It may not work right now, to be honest.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
We haven't been testing full server failure yet - we're still working with introducing IO errors on individual OSTs. Testing for
<i style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif">that</i> works in most cases, perhaps you can scale back to that for now?</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Sorry it's not further along, but that's where it's at right now - it is very much in development rather than ready. It might be we have it ready in the next few weeks but we can't say currently.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
-Patrick</div>
<div id="m_8852766854412107699m_-7081531694692939761m_-854487528585035114appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="m_8852766854412107699m_-7081531694692939761m_-854487528585035114divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(0,0,0)"><b style="font-family:Calibri,sans-serif">From:</b> Stepan Beskrovnyy <<a href="mailto:bsm099@gmail.com" target="_blank" style="font-family:Calibri,sans-serif">bsm099@gmail.com</a>><br>
<b style="font-family:Calibri,sans-serif">Sent:</b> Tuesday, April 14, 2026 8:56 AM<br>
<b style="font-family:Calibri,sans-serif">To:</b> Patrick Farrell <<a href="mailto:pfarrell@ddn.com" target="_blank" style="font-family:Calibri,sans-serif">pfarrell@ddn.com</a>><br>
<b style="font-family:Calibri,sans-serif">Subject:</b> Re: [lustre-discuss] EC branch not working with actually “killing” targets</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>
<div>Hey Patrick! <br>
<br>
</div>
Did a lot of things about testing EC feature and sometimes it worked sometimes not. <br>
<br>
</div>
Can you please write me a theoretical case, how can I test feature right ? Step by step <br>
<br>
</div>
At all I want to shutdown 1 or 2 Object Servers(OSS) by sending them to reboot and read a file from the lustre filesystem in degraded read mode by reconstructing the file with parity blocks. <br>
<br>
</div>
Can you explain how to test it right ? <br>
<br>
</div>
Thanks, <br>
</div>
Stepan</div>
<br>
<div>
<div dir="ltr">On Sat, Apr 11, 2026 at 2:12 AM Patrick Farrell <<a href="mailto:pfarrell@ddn.com" target="_blank">pfarrell@ddn.com</a>> wrote:<br>
</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">
<div>
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Stepan,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Sorry, I've lost track of our chain a bit - you're right about the build failure in the other message.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
It was something specific to some OSes, I think. If you want to try <a href="https://review.whamcloud.com/c/fs/lustre-release/+/64970" target="_blank" style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif">
https://review.whamcloud.com/c/fs/lustre-release/+/64970</a> again, that would be good - the build should be fixed shortly. Not sure about the crash, of course.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I will say - this is very much unfinished code, so it may be difficult to get things working.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
If it's crashing, there won't really be debug information from dk, of course. Could you share the end of vmcore-dmesg.txt from /var/crash? Maybe tail -n 200 of it?</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
-Patrick</div>
<div id="m_8852766854412107699m_-7081531694692939761m_-854487528585035114x_m_-4480255084746058134appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div id="m_8852766854412107699m_-7081531694692939761m_-854487528585035114x_m_-4480255084746058134divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(0,0,0)"><b style="font-family:Calibri,sans-serif">From:</b> Stepan Beskrovnyy <<a href="mailto:bsm099@gmail.com" target="_blank" style="font-family:Calibri,sans-serif">bsm099@gmail.com</a>><br>
<b style="font-family:Calibri,sans-serif">Sent:</b> Friday, April 10, 2026 8:04 AM<br>
<b style="font-family:Calibri,sans-serif">To:</b> Patrick Farrell <<a href="mailto:pfarrell@ddn.com" target="_blank" style="font-family:Calibri,sans-serif">pfarrell@ddn.com</a>><br>
<b style="font-family:Calibri,sans-serif">Subject:</b> Re: [lustre-discuss] EC branch not working with actually “killing” targets</font>
<div> </div>
</div>
<div>
<div dir="auto">
<div dir="auto" style="border-color:rgb(0,0,0)">Hi Patrick! </div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">I made a 20+4 scheme.</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Created a 2GB file.</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Performed mirror resync – successful, all parity obj went from stale state to init.</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Killed the first node, which killed 4 OST’s (1 node).</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Calculated the checksum – it matched, but it took even less time than before killing the node. Conclusion: the file was cached.</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Manually entered information about active OST’s, marking the killed nodes as INACTIVE. Used </div>
<div dir="auto" style="border-color:rgb(0,0,0)">lctl set_param osc.lus-ec-OSTxxxx-osc-*.active=0</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Opened device_list – the correct OST’s on the killed node are marked INACTIVE.</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Ran the checksum calculation – the client crashed with a core dump and rebooted.</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">From other clients – I/O error when accessing the test file.</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Conclusion: It didn't work.</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Made the same test with 16+8 scheme -> same result. </div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">
<div dir="auto" style="border-color:rgb(0,0,0)">To speed up the tests, I set adaptive timeouts at_max and at_min to 0, timeout = 20, and ldlm_timeout = 20. Maybe this is the reason ? Any errors with RPC sync? </div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Also I tried use debug info with lctl dk as you recommended, but I see NO debug information about EC at all. </div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Thanks, </div>
<div dir="auto" style="border-color:rgb(0,0,0)"><br>
</div>
<div dir="auto" style="border-color:rgb(0,0,0)">Stepan</div>
</div>
</div>
<div><br>
<div>
<div dir="ltr">Пт, 10 апр. 2026 г. в 15:32, Patrick Farrell <<a href="mailto:pfarrell@ddn.com" target="_blank">pfarrell@ddn.com</a>>:<br>
</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">
<div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Stepan,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
You'll have to give a bit more detail about how it fails, sorry! Happy to have you interested.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Patrick</div>
<div id="m_8852766854412107699m_-7081531694692939761m_-854487528585035114x_m_-4480255084746058134x_m_3917808877587152443appendonsend"></div>
<hr style="display:inline-block;width:98%">
<div dir="ltr" id="m_8852766854412107699m_-7081531694692939761m_-854487528585035114x_m_-4480255084746058134x_m_3917808877587152443divRplyFwdMsg">
<font face="Calibri, sans-serif" style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(0,0,0)"><b style="font-family:Calibri,sans-serif">From:</b> lustre-discuss <<a href="mailto:lustre-discuss-bounces@lists.lustre.org" style="font-family:Calibri,sans-serif" target="_blank">lustre-discuss-bounces@lists.lustre.org</a>>
on behalf of Stepan Beskrovnyy via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org" style="font-family:Calibri,sans-serif" target="_blank">lustre-discuss@lists.lustre.org</a>><br>
<b style="font-family:Calibri,sans-serif">Sent:</b> Thursday, April 9, 2026 8:33 AM<br>
<b style="font-family:Calibri,sans-serif">To:</b> <a href="mailto:lustre-discuss@lists.lustre.org" style="font-family:Calibri,sans-serif" target="_blank">
lustre-discuss@lists.lustre.org</a> <<a href="mailto:lustre-discuss@lists.lustre.org" style="font-family:Calibri,sans-serif" target="_blank">lustre-discuss@lists.lustre.org</a>><br>
<b style="font-family:Calibri,sans-serif">Subject:</b> [lustre-discuss] EC branch not working with actually “killing” targets</font>
<div> </div>
</div>
</div>
<div>
<div>Hey everyone!
<div dir="auto"><br>
</div>
<div dir="auto">New update about EC branch. </div>
<div dir="auto"><br>
</div>
<div dir="auto">I tested different server shutdown’s, and found a case. If I umount OST on OSS or just kill whole OSS(reboot server), filesystem falls and stop working at all. </div>
<div dir="auto">In case of 24 OST’s and -c 16 okay, I know the reason why the system falls. But with —ec—expert mirror resync funtion dies too(lags into core). </div>
<div dir="auto"><br>
</div>
<div dir="auto">Thoughts ? </div>
<div dir="auto"><br>
</div>
<div dir="auto"><br>
</div>
<div dir="auto">Thanks, </div>
<div dir="auto"><br>
</div>
<div dir="auto">Stepan</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div></blockquote></div>
</blockquote></div>
</div></div>