[lustre-discuss] Fwd: EC branch not working with actually “killing” targets

Mon Apr 20 00:44:47 PDT 2026

Up the thread.
The issue is still relevant

Thanks,
Stepan

---------- Пересылаемое сообщение ---------
От: Stepan Beskrovnyy <bsm099 at gmail.com>
Дата: Чт, 16 апр. 2026 г. в 15:33
Тема: Re: [lustre-discuss] EC branch not working with actually “killing”
targets
Кому: Patrick Farrell <pfarrell at ddn.com>

Additional results.

Done the same test as in sanity-ec.sh test40a

test_40a() {
enable_ec
[[ $OSTCOUNT -lt 6 ]] && skip_env "needs >= 6 OSTs"

local tf=$DIR/$tfile

$LFS setstripe -E -1 -S 1M -c 4 --ec 4+2 $tf ||
error "setstripe --ec 4+2 failed"

dd if=/dev/urandom of=$tf bs=1M count=4 || error "dd failed"
$LFS mirror resync $tf || error "mirror resync failed"

local sum1=$(md5sum $tf | awk '{print $1}')

ec_start_read_fault
stack_trap "ec_stop_read_fault"

local sum2=$(md5sum $tf | awk '{print $1}')
[[ "$sum1" == "$sum2" ]] ||
error "data changed after read fault: $sum1 vs $sum2"
echo "** Checksum before: $sum1  after: $sum2"
}
run_test 40a "test recovery read"

results, see picture

[image: image.png]
As I see, mirror resync is not working(Bug?). Used revision
https://review.whamcloud.com/c/fs/lustre-release/+/64970(Patchset 33)

Any ideas for a hotfix ? My idea is that something happened due to
refactoring of lustre/utils/liblustreapi_layout.c
<https://review.whamcloud.com/c/fs/lustre-release/+/64970/33/lustre/utils/liblustreapi_layout.c>

Thanks,

Stepan
<https://review.whamcloud.com/c/fs/lustre-release/+/64970/33/lustre/utils/liblustreapi_layout.c>

On Wed, Apr 15, 2026 at 2:22 PM Stepan Beskrovnyy <bsm099 at gmail.com> wrote:

> Hi Patrick !
>
> Trying latest commits on EC branch.
>
> Builded well, then made 4+2 scheme on 6 OST, layed on same OSServer.
> Made a 1 gb random file and got all parity blocks in stale mode.
> Mirror resync cant see the file (bug? )
> See pictures attached.
>
>
> I have unmounted one of OST's and got IO error instantly ( Thats good to
> be honest:) )
>
> But I have no Idea why resync not working and why parity chunks became
> STALE by default.
>
> Thoughts ?
>
> Thanks,
> Stepan
> [image: image.png]
> [image: image.png]
>
>
> On Tue, Apr 14, 2026 at 9:29 PM Patrick Farrell <pfarrell at ddn.com> wrote:
>
>> It may not work right now, to be honest.
>>
>> We haven't been testing full server failure yet - we're still working
>> with introducing IO errors on individual OSTs.  Testing for *that* works
>> in most cases, perhaps you can scale back to that for now?
>>
>> Sorry it's not further along, but that's where it's at right now - it is
>> very much in development rather than ready.  It might be we have it ready
>> in the next few weeks but we can't say currently.
>>
>> -Patrick
>> ------------------------------
>> *From:* Stepan Beskrovnyy <bsm099 at gmail.com>
>> *Sent:* Tuesday, April 14, 2026 8:56 AM
>> *To:* Patrick Farrell <pfarrell at ddn.com>
>> *Subject:* Re: [lustre-discuss] EC branch not working with actually
>> “killing” targets
>>
>> Hey Patrick!
>>
>> Did a lot of things about testing EC feature and sometimes it worked
>> sometimes not.
>>
>> Can you please write me a theoretical case, how can I test feature right
>> ? Step by step
>>
>> At all I want to shutdown 1 or 2 Object Servers(OSS) by sending them to
>> reboot and read a file from the lustre filesystem in degraded read mode by
>> reconstructing the file with parity blocks.
>>
>> Can you explain how to test it right ?
>>
>> Thanks,
>> Stepan
>>
>> On Sat, Apr 11, 2026 at 2:12 AM Patrick Farrell <pfarrell at ddn.com> wrote:
>>
>> Stepan,
>>
>> Sorry, I've lost track of our chain a bit - you're right about the build
>> failure in the other message.
>>
>> It was something specific to some OSes, I think.  If you want to try
>> https://review.whamcloud.com/c/fs/lustre-release/+/64970 again, that
>> would be good - the build should be fixed shortly.  Not sure about the
>> crash, of course.
>>
>> I will say - this is very much unfinished code, so it may be difficult to
>> get things working.
>>
>> If it's crashing, there won't really be debug information from dk, of
>> course.  Could you share the end of vmcore-dmesg.txt from /var/crash?
>> Maybe tail -n 200 of it?
>>
>> -Patrick
>> ------------------------------
>> *From:* Stepan Beskrovnyy <bsm099 at gmail.com>
>> *Sent:* Friday, April 10, 2026 8:04 AM
>> *To:* Patrick Farrell <pfarrell at ddn.com>
>> *Subject:* Re: [lustre-discuss] EC branch not working with actually
>> “killing” targets
>>
>> Hi Patrick!
>>
>>
>> I made a 20+4 scheme.
>>
>> Created a 2GB file.
>> Performed mirror resync – successful, all parity obj went from stale
>> state to init.
>>
>> Killed the first node, which killed 4 OST’s (1 node).
>>
>> Calculated the checksum – it matched, but it took even less time than
>> before killing the node. Conclusion: the file was cached.
>>
>> Manually entered information about active OST’s, marking the killed nodes
>> as INACTIVE. Used
>> lctl set_param osc.lus-ec-OSTxxxx-osc-*.active=0
>>
>> Opened device_list – the correct OST’s on the killed node are marked
>> INACTIVE.
>>
>> Ran the checksum calculation – the client crashed with a core dump and
>> rebooted.
>>
>> From other clients – I/O error when accessing the test file.
>>
>> Conclusion: It didn't work.
>>
>>
>> Made the same test with 16+8 scheme -> same result.
>>
>> To speed up the tests, I set adaptive timeouts at_max and at_min to 0,
>> timeout = 20, and ldlm_timeout = 20. Maybe this is the reason ? Any errors
>> with RPC sync?
>>
>> Also I tried use debug info with lctl dk as you recommended, but I see NO
>> debug information about EC at all.
>>
>> Thanks,
>>
>> Stepan
>>
>> Пт, 10 апр. 2026 г. в 15:32, Patrick Farrell <pfarrell at ddn.com>:
>>
>> Stepan,
>>
>> You'll have to give a bit more detail about how it fails, sorry!  Happy
>> to have you interested.
>>
>>
>> Patrick
>> ------------------------------
>> *From:* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
>> behalf of Stepan Beskrovnyy via lustre-discuss <
>> lustre-discuss at lists.lustre.org>
>> *Sent:* Thursday, April 9, 2026 8:33 AM
>> *To:* lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
>> *Subject:* [lustre-discuss] EC branch not working with actually
>> “killing” targets
>>
>> Hey everyone!
>>
>> New update about EC branch.
>>
>> I tested different server shutdown’s, and found a case. If I umount OST
>> on OSS or just kill whole OSS(reboot server), filesystem falls and stop
>> working at all.
>> In case of 24 OST’s and -c 16 okay, I know the reason why the system
>> falls. But with —ec—expert mirror resync funtion dies too(lags into core).
>>
>> Thoughts ?
>>
>>
>> Thanks,
>>
>> Stepan
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260420/3e1d4c85/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 161535 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260420/3e1d4c85/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 122582 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260420/3e1d4c85/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 55050 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260420/3e1d4c85/attachment-0005.png>