[lustre-discuss] Unable to mount client with 56 MDSes and beyond
Andreas Dilger
adilger at whamcloud.com
Wed May 22 01:02:59 PDT 2019
Scott, if you haven't already done so, it is probably best to file a ticket in Jira with the details. Please include the client syslog/dmesg as well as a Lustre debug log ("lctl dk /tmp/debug") so that the problem can be isolated.
During DNE development we tested with up to 128 MDTs in AWS, but haven't tested that many MDTs in some time.
Cheers, Andreas
On May 8, 2019, at 12:28, White, Scott F <sfpwhite at lanl.gov> wrote:
> We’ve been testing DNE Phase II and tried scaling the number of MDSes(one MDT each for all of our tests) very high, but when we did that, we couldn’t mount the filesystem on a client. After trial and error, we discovered that we were unable to mount the filesystem when there were 56 MDSes. 55 MDSes mounted without issue, and it appears any number below that will mount. This failure at 56 MDSes was replicable across different nodes being used for the MDSes, all of which were tested with working configurations, so it doesn’t seem to be a bad server.
> Here’s the error info we saw in dmesg on the client:
> LustreError: 28880:0:(obd_config.c:559:class_setup()) setup lustre-MDT0037-mdc-ffff95923d31b000 failed (-16)
> LustreError: 28880:0:(obd_config.c:1836:class_config_llog_handler()) MGCx.x.x.x at o2ib: cfg command failed: rc = -16
> Lustre: cmd=cf003 0:lustre-MDT0037-mdc 1:lustre-MDT0037_UUID 2:x.x.x.x at o2ib
> LustreError: 15c-8: MGCx.x.x.x at o2ib: The configuration from log 'lustre-client' failed (-16). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
> LustreError: 28858:0:(obd_config.c:610:class_cleanup()) Device 58 not setup
> Lustre: Unmounted lustre-client
> LustreError: 28858:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-16)
> OS: CentOS 7.6.1810
> Kernel: 3.10.0-957.5.1.el7.x86_64
> Lustre: 2.12.1
> Network card: Qlogic InfiniPath_QLE7340
> Other things to note for completeness’ sake: this happened with both ldiskfs and zfs backfstypes, and these tests were using files in memory as the backing devices.
> Is there something I’m missing as to why more than 56 MDSes won’t mount?
> Thanks,
> Scott White
> Scientist, HPC
> Los Alamos National Laboratory
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Cheers, Andreas
Andreas Dilger
Principal Lustre Architect
More information about the lustre-discuss
mailing list