[Lustre-discuss] Client Cannot Mount File System

Mon Jun 16 09:43:07 PDT 2014

Is it possible that your problematic node may have had the same IPoIB address configured as another one of your nodes?  That could explain why the problem was resolved when you changed the IP address.  I ran into a similar issue on one of our clusters a while back.  I don't recall if I had any problems mounting the file system, but we had some nodes that were plagued by timeout and unexplained errors.  Once we discovered that the same IP was being used on two different nodes, we fixed it and everything worked fine after that.

-- 
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

On Jun 12, 2014, at 10:18 AM, Charles Taylor <chasman at ufl.edu>
 wrote:

> MDS/OSSs: 1.8.8-wc1_2.6.18_308.4.1.el5_gbc88c4c
> Client:           1.8.9-wc1_2.6.32_358.23.2.el6
> 
> One (out of hundreds) of our clients has been unable to mount our lustre file system.  We could find no host or network issues.  Attempts to mount yielded the following on the client
> 
> mount -t lustre -o localflock 10.13.68.1 at o2ib:10.13.68.2 at o2ib:/lfs /lfs/scratch  
> mount.lustre: mount 10.13.68.1 at o2ib:10.13.68.2 at o2ib:/lfs at /lfs/scratch failed:
> Interrupted system call
> Error: Failed to mount 10.13.68.1 at o2ib:10.13.68.2 at o2ib:/lfs
> 
> with the following syslog messages.
> 
> Jun 10 15:21:05 r15a-s40 kernel: Lustre: 1269:0:(o2iblnd_cb.c:1813:kiblnd_close_conn_locked()) Closing conn to 10.13.79.252 at o2ib2: error 0(waiting)
> Jun 10 15:21:05 r15a-s40 kernel: LustreError: 166-1: MGC10.13.68.1 at o2ib: Connection to service MGS via nid 10.13.68.1 at o2ib was lost; in progress operations using this service will fail.
> Jun 10 15:21:05 r15a-s40 kernel: LustreError: 15c-8: MGC10.13.68.1 at o2ib: The configuration from log 'lfs-client' failed (-4). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
> Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(llite_lib.c:1099:ll_fill_super()) Unable to process log: -4
> Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(lov_obd.c:1012:lov_cleanup()) lov tgt 1 not cleaned! deathrow=0, lovrc=1
> Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(lov_obd.c:1012:lov_cleanup()) Skipped 5 previous similar messages
> Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(lov_obd.c:1012:lov_cleanup()) lov tgt 13 not cleaned! deathrow=1, lovrc=1
> Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(mdc_request.c:1500:mdc_precleanup()) client import never connected
> Jun 10 15:21:05 r15a-s40 kernel: Lustre: MGC10.13.68.1 at o2ib: Reactivating import
> Jun 10 15:21:05 r15a-s40 kernel: Lustre: MGC10.13.68.1 at o2ib: Connection restored to service MGS using nid 10.13.68.1 at o2ib.
> Jun 10 15:21:05 r15a-s40 kernel: Lustre: client lfs-client(ffff88061e105c00) umount complete
> Jun 10 15:21:05 r15a-s40 kernel: LustreError: 4012:0:(obd_mount.c:2067:lustre_fill_super()) Unable to mount  (-4)
> 
> Nothing noteworthy on the MDS.   
> 
> After reconfiguring the client with a new IPoIB IP (and hence, NID), it was able to mount with no problems and is working fine.    Additionally, the MDS was rebooted at least once during the time that this client in question was unable to mount so it seems like whatever was on the MDT was saved - presumably on the MDT.   
> 
> I'm particularly curious about the "ll_fill_super" message.  To what "log" is it referring?   
> 
> Anyone seen this before and have an idea what we need to clear on the MDS/MDT to allow this client to successfully mount the file system again?
> 
> Thanks,
> 
> Charlie Taylor
> UF Research Computing
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss