[lustre-discuss] Does Lustre client cause bus error when running code?

Trung Đặng Đinh Quốc tyler at cinnamon.is
Thu Oct 28 01:32:30 PDT 2021


Hello,

After working with Lustre, I observed that my running code might encounter
bus errors sometimes.
The details shown in *"sudo dmesg -T"* is as follows:

> INFO: task python:29299 blocked for more than 120 seconds.

     Tainted: P      OE        4.19.0-9-cloud-amd64 #1 Debian
> 4.19.118-2+deb10u1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" distables this message.

python          D         29299             14931                0x80000000
> Call Trace:
> ? __schedule+0x2a2/0x870
> ? _cond_resched+0x15/0x30
> schedule+0x28/0x80
> rwsem_down_read_failed+0x111/0x180
> call_rwsem_down_read_failed+0x14/0x30
> down_read+0x1c/0x30
> do_exit+0x22d/0xb90
> ? lprocfs_counter_add+0xd2/0x140 [obdclass]
> do_group_exit+0x3a/0xa0
> get_signal+0x36/0x610
> ? handle_mm_fault+0xd6/0x200
> ? up_read+0x1b/0x20
> ? __do_page_fault+0x26c/0x4f0
> ? page_fault+0x8/0x30
> exit_to_usermode_loop+0x89/0xf0
> prepare_exit_to_usermode+0x55/0x60
> retint_user+0x8/0x8
> RIP: 0033:0x7fa2612f73a0
> Code: Bad RIP value.


The details shown in Python faulthandler is as follows:

> Fatal Python error: Bus error
> RuntimeError: DataLoader worker (pid 4882) is killed by signal: Bus error.
> It is possible that dataloader's workers are out of shared memory. Please
> try to raise your shared memory limit.


I'm not sure whether the cause of this error is related to the Lustre
client, or whether it is related to memory issues. However, I read
somewhere that storing executables on Lustre filesystem (in my case,
Anaconda and Python executables) may lead to bus error.

For more information,
My Lustre server version is *2.10.8*, on CentOS 7.9 systems, kernel
*3.10.0-957.1.3.el7_lustre.x86_64*.
My Lustre client version is *2.14.54*, on Debian 10 systems, kernel
*4.19.0-9-cloud-amd64*.

May I have some confirmation on this issue? In addition, in case the cause
of this error is related to the Lustre client, what should I do to solve
this problem?
Thank you very much.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20211028/8c4e259f/attachment-0001.html>


More information about the lustre-discuss mailing list