[Lustre-discuss] Newbie w/issues

Cliff White cliff.white at oracle.com
Tue Apr 27 17:00:35 PDT 2010


Brian Andrus wrote:
> Ok, I inherited a lustre filesystem used on a cluster. 
> 
> I am seeing an issue where on the frontend, I see all of /work
> On nodes, however, I only see SOME of the user's directories.

That's rather odd. The directory structure is all on the MDS, so
it's usually either all there, or not there. Are any of the user errors
permission-related? That's the only thing I can think that would change 
what directories one node sees vs another.
> 
> Work consists of one MDT/MGS and 3 osts
> The osts are LVMs served from a DDN via infiniband
> 
> Running the kernel modules/client one the nodes/frontend
> lustre-client-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
> lustre-client-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
> 
> on the ost/mdt
> lustre-modules-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
> kernel-2.6.18-164.11.1.el5_lustre.1.8.2
> lustre-1.8.2-2.6.18_164.11.1.el5_lustre.1.8.2
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.2
> 
> I have so many error messages in the logs, I am not sure which to sift 
> through for this issue.
> A quick tail on the MDT:
> =========================
> Apr 27 16:15:19 nas-0-1 kernel: LustreError: 
> 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-107)  req at ffff810669d35c50 x1334203739385128/t0 o400-><?>@<?>:0/0 lens 
> 192/0 e 0 to 0 dl 1272410135 ref 1 fl Interpret:H/0/0 rc -107/0
> Apr 27 16:15:19 nas-0-1 kernel: LustreError: 
> 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 419 previous 
> similar messages
> Apr 27 16:16:38 nas-0-1 kernel: LustreError: 
> 4155:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS 
> from 12345-10.1.255.55 at tcp
> Apr 27 16:16:38 nas-0-1 kernel: LustreError: 
> 4155:0:(handler.c:1518:mds_handle()) Skipped 177 previous similar messages
> Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
> 6789:0:(mgs_handler.c:573:mgs_handle()) lustre_mgs: operation 400 on 
> unconnected MGS
> Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
> 6789:0:(mgs_handler.c:573:mgs_handle()) Skipped 229 previous similar 
> messages
> Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
> 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error 
> (-107)  req at ffff810673a78050 x1334009404220652/t0 o400-><?>@<?>:0/0 lens 
> 192/0 e 0 to 0 dl 1272410737 ref 1 fl Interpret:H/0/0 rc -107/0
> Apr 27 16:25:21 nas-0-1 kernel: LustreError: 
> 6789:0:(ldlm_lib.c:1848:target_send_reply_msg()) Skipped 404 previous 
> similar messages
> Apr 27 16:26:41 nas-0-1 kernel: LustreError: 
> 4173:0:(handler.c:1518:mds_handle()) operation 400 on unconnected MDS 
> from 12345-10.1.255.46 at tcp
> Apr 27 16:26:41 nas-0-1 kernel: LustreError: 
> 4173:0:(handler.c:1518:mds_handle()) Skipped 181 previous similar messages
> =========================
> 

The ENOTCONN (-107) points at server/network health. I would umount the 
clients and verify server health, then verify LNET connectivity. 
However, this would not relate to missing directories - in the absence 
of other explanations, check the MDT with fsck - that's more of a 
generic useful thing to do rather then something indicated by your data.

I would also look through older logs if available, and see if you can
find a point in time where things go bad. The first error is always the 
most useful.
> Any direction/insigt would be most helpful.

Hope this helps
cliffw

> 
> Brian Andrus
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list