[Lustre-discuss] simulations

Cliff White Cliff.White at Sun.COM
Thu Aug 7 10:59:22 PDT 2008


Mag Gam wrote:
> We do a lot of fluid simulations at my university, but on a similar
> note I would like to know what the Lustre experts will do in
> particular simulated scenarios...
> 
> The environment is this:
> 30 Servers (All Linux)
> 1000+ Clients (All Linux)
> 
> 30 Servers
> 1 MDS
> 30 OSTs each with 2TB of storage
> 
> No fail over capabilities.
> 
> 
> Scenario 1:
> Your client is trying to mount lustre filesystem using lustre module,
> and it hung. Do what?
Answer 0 to all questions:
"Read the Lustre Manual. File doc bugs in Lustre Bugzilla if there's a 
part you don't understand, or a part missing"

Answer 1 for all your questions.
"Check syslogs/consoles on the impacted clients.
Check syslogs/consoles on _all lustre servers.
Pay careful attention to timestamps.
Work backwards to the first error."

Is the problem restricted to one client or seen by multiple clients?
If multiple clients, start with the network, use lctl ping to check 
lustre connectivity.
If a single client, it's generally a client config/network config issue.
> 
> Scenario 2:
> Your MDS won't mount up. Its saying, "The server is already running".
> You try to mount it up couple of times and still its not

Be certain the server is not already running.
Be certain no hung mount processes exist.
Unload all lustre modules (lustre_rmmod script will do this)
Retry and -> answer 1

> 
> Scenario 3:
> OST/OSS reboots due to a power outage. Some files are striped on this,
> and some aren't What happens? What to do for minimal outage?

- Clients can be mounted with a dead OST using the exclude options to 
the mount command. lfs getstripe can be run from clients to find files
on the bad OST. See answer 0 for detailed process.
> 
> Scenario 4:
> lctl dl shows some devices in "ST" state. What does that mean, and how
> do I clear it?

ST = stopped.
Clear this by cleaning up all devices (answer 0)
or restarting the stopped devices.
Usually indicates an error/issue with the stopped device, so see
answer 1.
> 
> 
> I know some of these scenarios may be ambiguous, but please let me
> know which so I can further elaborate. I am eventually planning to
> wiki this for future reference and other lustre newbies.

Please contribute to wiki.lustre.org - there is considerable information 
there already, and a decent existing structure.
> 
> If anyone else has any other scenarios, please don't be shy and ask
> away. We can create a good trouble shooting doc similar to the
> operations manual.

Again, please file doc bugs at bugzilla.lustre.org and contribute to 
wiki.lustre.org, hope this helps!
cliffw

> 
> 
> TIA
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list