[lustre-discuss] intermittently can't start ll_sa thread and can't start ll_sa thread, and sysctl kernel.pid_max

Wed Jun 8 17:22:39 PDT 2022

Hi All,

This is not a Lustre problem proper, but others might run into it with a 64-bit Lustre client on RHEL 7, and I hope to save others the time it took us to nail it down.  We saw it on a node running the "Starfish" policy engine, which reads through the entire file system tree repeatedly and consumes changelogs.  Starfish itself creates and destroys processes frequently, and the workload causes Lustre to create and destroy threads as well, by triggering statahead thread creation and changelog thread creation.

For the impatient, the fix was to increase pid_max.  We used:
kernel.pid_max=524288

The symptoms are:

1) console log messages like
LustreError: 10525:0:(statahead.c:970:ll_start_agl()) can't start ll_agl thread, rc: -12
LustreError: 15881:0:(statahead.c:1614:start_statahead_thread()) can't start ll_sa thread, rc: -12
LustreError: 15881:0:(statahead.c:1614:start_statahead_thread()) Skipped 45 previous similar messages
LustreError: 15878:0:(statahead.c:1614:start_statahead_thread()) can't start ll_sa thread, rc: -12
LustreError: 15878:0:(statahead.c:1614:start_statahead_thread()) Skipped 17 previous similar messages 

Note the return codes are -12, which is -ENOMEM.

Attempts to create new user space processes are also intermittently failing.

sf_lustre.liblustreCmds 10983 'MainThread' : ("can't start new thread",) [liblustreCmds.py:216]

and

[faaland1 at solfish2 lustre]$git fetch llnlstash
Enter passphrase for key '/g/g0/faaland1/.ssh/swdev': 
Enter passphrase for key '/g/g0/faaland1/.ssh/swdev': 
remote: Enumerating objects: 1377, done.
remote: Counting objects: 100% (1236/1236), done.
remote: Compressing objects: 100% (271/271), done.
error: cannot fork() for index-pack: Cannot allocate memory
fatal: fetch-pack: unable to fork off index-pack

We wasted a lot of time chasing the idea that this was in fact due to insufficient free memory on the node, but the actual problem was that sysctl kernel.pid_max was too low.

When a new process must be created via fork() or kthread_create(), or similar, the kernel has to allocate a PID.  It has a data structure for keeping track of which PIDs are available, and there is some delay after a process is destroyed before its PID may be reused.

We found that on this node, that the kernel would occasionally find no PIDs available when it was creating the process.  Specifically, copy_process() would call alloc_pidmap(), which would return -1.  This tended to be when the system was processing a large number of changes on the file system, so both Lustre and Starifish were suddenly doing a lot of work and both would have been creating new threads in response to the load.   This node has about 700-800 processes running normally according to top(1).  At the time these errors occurred, I don't know many processes were running or how quickly they were being created and destroyed.

Ftrace showed this:

|        copy_namespaces();
|        copy_thread();
|        alloc_pid() {
|          kmem_cache_alloc() {
|            __might_sleep();
|            _cond_resched();
|          }
|          kmem_cache_free();
|        }
|        exit_task_namespaces() {
|          switch_task_namespaces() {

On this particular node, with 32 cores, running RHEL 7, arch x86_64, pid_max was 36K.    We added
kernel.pid_max=524288
to our sysctl.conf which resolved the issue.

I don't expect this to be an issue under RHEL 8 (or clone of your choice), because in RHEL 8.2 systemd puts a config file in place that sets pid_max to 2^22.

-Olaf