[Lustre-discuss] 1.8.4 and write-through cache

Mon Sep 13 02:31:06 PDT 2010

Afternoon

I upgraded our oss's from 1.8.3 to 1.8.4 on Saturday (due to
https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a
great deal of pain.

We have 30 oss's of multiple vintages.  The basic difference between them is

  * md on first 20 nodes
  * 3ware 9650SE ML12 on last 10 nodes

After the upgrade to 1.8.4 we were seeing terrible throughput on the
nodes with 3ware cards (and only the nodes with 3ware cards).  This
was typified by see the block device being 100% utilised (iostat),
doing about 100r/s and 400kb/s and all the ost_io threads in D state
(no writes).  They would be in this state for 10mins and then suddenly
awake and start pushing data again.  1-2 mins later, they would lock
up again.

The oss's were dumping stacks all over the place, crawling along and
generally making our lustrefs unuseable.

After trying different kernels, raid card drivers, changing write back
policy on the raid cards etc. the solution was to

    lctl set_param obdfilter.*.writethrough_cache_enable=0
    lctl set_param obdfilter.*.read_cache_enable=0

on all the nodes with the 3ware cards.

Has anyone else seen this?  I am completely baffled as to why it only
affects our nodes with 3ware cards.

These nodes were working very well under 1.8.3...

-- 
Dr Stuart Midgley
sdm900 at gmail.com