[lustre-discuss] lustre 2.5.2 - unable to mount ost

Rafael Perez rjperez at bnl.gov
Mon Nov 23 11:13:02 PST 2015


Hi Rick,

Thanks for your suggestions.  Turns out I was able to get the filesystem 
started this morning and restore access to the critical data.  It was a 
long journey of troubleshooting but here are the steps I ended up taking 
to fix the issue.

- stop the lustre filesystem (umount the osts and mdt/mgt)
- mount the ldiskfs filesystem for the problematic ost (/dev/mapper/ost5 
to /mnt/ost5 in this case)
- backup the CONFIGS/lfs1-client file
     # cp -a /mnt/ost5/CONFIGS/lfs1-client 
/mnt/ost5/CONFIGS/lfs1-client.ORIG
- copy a working non-corrupted 'lfs1-client' file from the MGS (from the 
mounted ldiskfs filesystem on the MGS)
     (there were signs of corruption in the file when I ran llog_reader 
against the bad lfs1-client file and received unexpected output)
- umount all ldiskfs filesystems
- run a writeconf to the MDS and all OSTs
     # tunefs.lustre --verbose --writeconf /dev/mapper/ostX
- restart the filesystem
     (this is where lfs1-OST0006 finally mounted!)
- mount the filesystem on a client

Our setup has 2 oss servers (oss1 and oss2) which serve 3 OSTs on each:
oss1:
/mnt/ost0
/mnt/ost1
/mnt/ost2

oss2:
/mnt/oss3
/mnt/ost4
/mnt/ost5

I'm sending this out for reference.

Thanks again,
Rafael


On 11/23/2015 10:57 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>> On Nov 22, 2015, at 6:12 PM, Perez, Rafael <rjperez at bnl.gov> wrote:
>>
>> LustreError: 10476:0:(mgc_request.c:1707:mgc_llog_local_copy()) MGC172.31.11.121 at o2ib: failed to copy remote log lfs1-client: rc = -5
>> LustreError: 13a-8: Failed to get MGS log lfs1-client and no local copy.
>> LustreError: 15c-8: MGC172.31.11.121 at o2ib: The configuration from log 'lfs1-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
>> LustreError: 10476:0:(obd_mount_server.c:1285:server_start_targets()) lfs1-OST0006: failed to start LWP: -2
> Does this server have other OSTs that mount? Or is this the only OST on this OSS server?  You can use tune2fs to list the OST config parameters and verify that they are correct.  I have also seen this kind of error when there are network problems.  I would look for IB errors or other signs of problems.  (Maybe even do a bandwidth test to see if it is performing as expected.)  You can also run “lctl ping” to test LNet connectivity between the OSS server and the MGS server.
>
> If the network checks out and it really is the llog that is the problem, you can try doing a writeconf to fix things up.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>

-- 
Rafael Perez
rjperez at bnl.gov
ITD HPC Support, Sr Technology Engineer
(631) 344-4426



More information about the lustre-discuss mailing list