[Lustre-discuss] Fw: How to install Lustre onto LVM?

Thu Mar 5 11:22:08 PST 2009

Thanks much Daire,

Your insights are very much appreciated.

> Interesting configuration. Is there a particular reason why you decided
> on using Xen VMs? Is failover better with Xen instances? I'm guessing 
> you don't have hundreds of clients hammering the hardware.

Cost was the deciding factor for XEN.  True, failover and STONITH are FAST 
with XEN (it's mostly all up in memory and killing a domU is easy) , and 
boot time of XEN domUs is more seconds (35 - 50) than minutes.  But it all 
came down to cost.  We only get about 5k folks a day knocking at the 
Customer Portal door, so not a lot of traffic (we're not Google by any 
stretch of the imagination).

A small group within our ITS department got together to architect this 
solution in 3 days (Security, Portal Apps Devel, App Admins, System 
Admins, Network Admins, Storage Admins).  The choice to use XEN VMs really 
was based on cost.  We were pleasantly suprised by our interim VP to 
receive X number of dollars to create a high availability solution for our 
client facing customer portal and those X number of dollars wasn't a lot 
and was only available for a limited time so we had to hustle (use it or 
lose it for the year).  Out of that bucket we needed to purchase network 
and server hardware, OS support, App and Portal support, backup server 
client licenses, LDAP support and high availability disk solution support 
(Lustre training and ongoing support and initial configuration).   So the 
funds got gobbled up fast and anywhere we could save a buck was reviewed 
and held weight.  Purchasing RHEL 5 Advanced Platforum Premium gave us 
24x7 support and unlimited number of XEN servers (right value, right 
price).

We bought low cost IBM Intel xSeries servers for DEV, CERT, and PROD.  The 
DEV server is running 5 XEN domUs.  The 3 CERT servers all together are 
running 13 XEN domUs.  The 5 PROD servers are all together running 22 XEN 
domUs.  Nine physical servers total, 40 virtual, hardware that's maxed out 
and configured with multiple dual HBAs and quad GigE NICs to the tune of 
about $200k I think.  We've got WAS and WPS servers, HTTP servers, LDAP 
servers, DB2 servers (thankfully free with the Portal server), Lustre MDS 
/ OSS all running side by side with each other.

Our current customer facing portal (we haven't cut over to the new 
hardware yet) consists of 16 servers and dropping to 9 reduces our "carbon 
footprint", but definitely increases the complexity of our environment. 
Our data center, like many, is power constrained (fully using our UPS') 
and we have a large internal push to consolidate and virtualize to realize 
full server utilization potential (do more with less) as well as reduce 
energy costs.  We currently use RHEL's GFS / CS, but we're on 3U8 (which 
uses disk pools) and from RHEL4 and up, GFS uses LVM instead of disk 
pools.  This requires a complete rebuild and hardware refresh any way you 
look at it, so we opened the playing field to all HA disk solutions.  We 
wanted to decouple the HA disk from the Application Server layer (to allow 
the app layer to remain up when GFS panics...and yes, GFS has panicked the 
entire 6 node cluster before and brought down the Application layer; 
bye-bye portal access; RHEL 3 was very buggy).  Lustre allows us to do 
that and has a great support base.

> I'm curious as to why you created 5 filesystems on the same "hardware" 
> instead of one big filesystem?

Legacy filesystems, before my time and reaching far into the past.  Those 
filesystems have migrated from an IBM Regatta class p690 server, to the 
current RHEL GFS 16 server environment, to this new environment.  We're 
working with the data content owners to establish new filesystem 
guidelines which will include archiving old / unused data and better 
recognizing data ownership.  But that's Phase 2 of this project and 
another type of migration.  Phase 1 is the migration and implementation of 
a solid HA hardware / software environment (or as SPOF free as we can make 
it).  Phase 2 will change the entire file system structure and provide us 
with tools to enforce disk usage and accountability (along with 
establishing better control over disk growth and who to charge back for 
SAN expansion).  To change those filesystems now would mean our entire 
development and publishing structure would breakdown (automated publishing 
scripts would all break, several integral connected servers that check 
data existence would go nuts, VPs would have words with VPs, not a pretty 
scene).  Basically, politics and corporate culture are the current 
reasons.  We just need time to plan and carefully coordinate with all 
parties to develop a new filesystem structure and get folks to start 
posting new data and migrating existing data to new filesystems.

> I'm not sure it is possible to migrate an existing filesystem to LVM 
> easily - you would need to do a file backup of your MDT first and 
restore 
> to the LVM device (section 15.1.3.1 of the manual). So in your case to 
> wipe (!) a single MDT and create a new one I'd do something like:
> 
>   pvcreate /dev/xvdj
>   vgcreate lusfs01 /dev/xvdj
>   vgchange -a y lusfs01
>   lvcreate -L3G -nmdt lusfs01

I have another limited time opportunity open to me.  Into the middle of 
this large Customer Portal project another project got dropped; a new SAN 
with all the fun that comes with a new SAN (migration, migration, 
migration).   As I'm being asked to migrate all my virtual domU OS' (which 
are located on the old SAN) along with their corresponding data disks 
(also on the old SAN), I figured, I'd take advantage of that migration and 
instead rebuild Lustre with LVM to get  the benefits of journaling and 
snapshotting, as you'd mentioned.  Thank you for making clear that the 
MDT's are where you'd recommend using LVM, I wasn't sure if it was just 
the MDS servers or both MDS + OSS.

And thank you for taking the time to answer.  Your reply is absolutely 
brilliant and what I'd hope for (it's exactly what I need to present my 
case to the business).  We're not live, let's recreate and bring on board 
these additional LVM features!  Give me some new SAN data disks for 
building up LVM, I'll build it alongside the old SAN data disks, transfer 
the MDT data and then drop the old SAN.  This is faster and more efficient 
than using our SAN vendor's migration solution for the data disks (I still 
have to use it for the OS disk though, but rebuilding with LVm is still a 
time savings and a known procedure).

Cheers and many thanks again, Daire,

Ms. Andrea D. Rucks
Sr. Unix Systems Administrator,
Lawson ITS Unix Server Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090305/5652be0a/attachment.htm>