[Lustre-discuss] Stalled autofs + lustre summary

Heiko Schröter schroete at iup.physik.uni-bremen.de
Fri Nov 20 00:31:16 PST 2009


Hello,

FYI, we had stalling lustre mounts in conjunction with automount over the last weeks.
This is a short summary in case you are using automunt + lustre.

When lustre gets automounted ok you will see the messages as in 1).

A user can stall the lustre mount by not using a FQN Filename.
Example file: /lustre_automount/myfile.dat

When lustre is *NOT* mounted a user can stall the client mount with 'ls /lustre_automount/myfile' (no asterik after myfile !) for at minimum 100s.
Error messages as in 2) will popup with the 'lnet_try_match_md()' sequence.
After that you will see messages of type 3) which may indicate a network problem (hm, well, ok to us ...)
After 100s the user gets back 'ls: cannot access /lustre_automount/myfile.dat: No such file or directory'
After that it looks that lustre is mounted. But a simple 'ls /lustre_automount/' in a second shell will not return anything and produce the same message sequence as above.

Attention:
When several 'illformed' ls commands are send at once the lustre mount freezes completely and forever on that client.
This happened in our case because this command sequence has been driven by scripts running in parallel.
You have to 'umount -f /lustre_automount/' or even 'lustre_rmmod' to recover.

If umount works correct it looks like 3).

Due to the fact that a lot of messages are between 1)2) and 3) we were mislead and searched the error in wrong places.
Especially the MDS/MGS hardware and additionally due to 2) we have replaced nearly all network components we could get our hands on.

Unfortunatly doing the same illformed ls command over an NFS automount will not result in a stalled system but will return the 'cannot access' message back at once.

Examples of what does work correctly when lustre is not mounted:
a) ls /lustre_automount/myfi*
b) find /lustre_automount -iname 'myfi*' (eventually: -maxdepth 1)
c) lfs find /lustre_automount --name 'myfile*' --maxdepth 1 (returns the file)
d) lfs find /lustre_automount --name 'myfile' --maxdepth 1 (does not return anything, but will not freeze the system)
.....
Another 'illformed' command is 'gunzip -c /lustre_automount/myfile > /tmp/test' instead of 'gunzip -c /lustre_automount/myfile.gz > /tmp/test'.

The solution seems to be to not using autofs + lustre if the above cannot be avoided for sure including mistyping.
Or to tar and feather the user .... that's what we did .... ;-)

Hairless by now
Heiko

################################################################
Gentoo x86_64 GNU/Linux
lustre: 1.6.6
vanilla-kernel 2.6.22.19
autofs 5.0.3-r6
mount 2.14.2
################################################################
Client Syslog. Automount timing 60s + 120s WAIT, just for testing. The same holds true for timouts of 600s.
1) Mounting OK:
Nov 19 17:29:58 quadcore2 automount[21803]: attempting to mount entry /lustre_automount
Nov 19 17:29:58 quadcore2 Lustre: fs_lustre-OST0006-osc-ffff8101c918b800.osc: set parameter active=0
Nov 19 17:29:58 quadcore2 Lustre: Skipped 16 previous similar messages
Nov 19 17:29:58 quadcore2 LustreError: 24764:0:(lov_obd.c:316:lov_connect_obd()) not connecting OSC fs_lustre-OST0006_UUID; administratively disabled
Nov 19 17:29:58 quadcore2 LustreError: 24764:0:(lov_obd.c:316:lov_connect_obd()) Skipped 13 previous similar messages
Nov 19 17:29:58 quadcore2 Lustre: Client fs_lustre-client has started
Nov 19 17:29:58 quadcore2 automount[21803]: mount(generic): mounted mds1 at tcp0:mds2 at tcp0:/fs_lustre type lustre on /lustre_automount
Nov 19 17:29:58 quadcore2 automount[21803]: mounted /lustre_automount

2) Mounting failed:
Nov 19 17:43:09 quadcore2 automount[21803]: attempting to mount entry /lustre_automount
Nov 19 17:43:09 quadcore2 Lustre: Client fs_lustre-client has started
Nov 19 17:43:09 quadcore2 automount[21803]: mount(generic): mounted mds1 at tcp0:mds2 at tcp0:/fs_lustre type lustre on /lustre_automount
Nov 19 17:43:09 quadcore2 automount[21803]: mounted /lustre_automount
Nov 19 17:43:10 quadcore2 LustreError: 25321:0:(lib-move.c:111:lnet_try_match_md()) Matching packet from 12345-192.168.16.122 at tcp, match 776 length 1336 too big: 1272 left, 1272 allowed
Nov 19 17:43:16 quadcore2 automount[21803]: 1 remaining in /home

3) The possible network problem message:
Nov 19 17:44:50 quadcore2 Lustre: Request x776 sent from fs_lustre-MDT0000-mdc-ffff8101aac5f400 to NID 192.168.16.122 at tcp 100s ago has timed out (limit 100s).
Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400: Connection to service fs_lustre-MDT0000 via nid 192.168.16.122 at tcp was lost; in progress operations using this service will wait for recovery to complete.
Nov 19 17:44:50 quadcore2 LustreError: 25692:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -4
Nov 19 17:44:50 quadcore2 Lustre: fs_lustre-MDT0000-mdc-ffff8101aac5f400: Connection restored to service fs_lustre-MDT0000 using nid 192.168.16.122 at tcp.

4) Umount OK:
Nov 19 17:45:37 quadcore2 automount[21803]: expiring path /lustre_automount
Nov 19 17:45:37 quadcore2 automount[21803]: unmounting dir = /lustre_automount
Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Skipped 2 previous similar messages
Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Nov 19 17:45:37 quadcore2 LustreError: 25717:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) Skipped 2 previous similar messages
Nov 19 17:45:37 quadcore2 LustreError: 25298:0:(connection.c:155:ptlrpc_put_connection()) NULL connection
Nov 19 17:45:37 quadcore2 LustreError: 25298:0:(connection.c:155:ptlrpc_put_connection()) Skipped 13 previous similar messages
Nov 19 17:45:37 quadcore2 Lustre: client ffff8101aac5f400 umount complete
Nov 19 17:45:37 quadcore2 automount[21803]: expired /lustre_automount



More information about the lustre-discuss mailing list