[lustre-discuss] lustre 2.5.2 - unable to mount ost
rjperez at bnl.gov
Mon Nov 23 11:13:02 PST 2015
Thanks for your suggestions. Turns out I was able to get the filesystem
started this morning and restore access to the critical data. It was a
long journey of troubleshooting but here are the steps I ended up taking
to fix the issue.
- stop the lustre filesystem (umount the osts and mdt/mgt)
- mount the ldiskfs filesystem for the problematic ost (/dev/mapper/ost5
to /mnt/ost5 in this case)
- backup the CONFIGS/lfs1-client file
# cp -a /mnt/ost5/CONFIGS/lfs1-client
- copy a working non-corrupted 'lfs1-client' file from the MGS (from the
mounted ldiskfs filesystem on the MGS)
(there were signs of corruption in the file when I ran llog_reader
against the bad lfs1-client file and received unexpected output)
- umount all ldiskfs filesystems
- run a writeconf to the MDS and all OSTs
# tunefs.lustre --verbose --writeconf /dev/mapper/ostX
- restart the filesystem
(this is where lfs1-OST0006 finally mounted!)
- mount the filesystem on a client
Our setup has 2 oss servers (oss1 and oss2) which serve 3 OSTs on each:
I'm sending this out for reference.
On 11/23/2015 10:57 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>> On Nov 22, 2015, at 6:12 PM, Perez, Rafael <rjperez at bnl.gov> wrote:
>> LustreError: 10476:0:(mgc_request.c:1707:mgc_llog_local_copy()) MGC172.31.11.121 at o2ib: failed to copy remote log lfs1-client: rc = -5
>> LustreError: 13a-8: Failed to get MGS log lfs1-client and no local copy.
>> LustreError: 15c-8: MGC172.31.11.121 at o2ib: The configuration from log 'lfs1-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
>> LustreError: 10476:0:(obd_mount_server.c:1285:server_start_targets()) lfs1-OST0006: failed to start LWP: -2
> Does this server have other OSTs that mount? Or is this the only OST on this OSS server? You can use tune2fs to list the OST config parameters and verify that they are correct. I have also seen this kind of error when there are network problems. I would look for IB errors or other signs of problems. (Maybe even do a bandwidth test to see if it is performing as expected.) You can also run “lctl ping” to test LNet connectivity between the OSS server and the MGS server.
> If the network checks out and it really is the llog that is the problem, you can try doing a writeconf to fix things up.
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
rjperez at bnl.gov
ITD HPC Support, Sr Technology Engineer
More information about the lustre-discuss