[Lustre-discuss] 1.8.4 and write-through cache

Thu Sep 16 14:57:45 PDT 2010

Stu Midgley wrote:
> Afternoon
>
> I upgraded our oss's from 1.8.3 to 1.8.4 on Saturday (due to
> https://bugzilla.lustre.org/show_bug.cgi?id=22755) and suffered a
> great deal of pain.
>
> We have 30 oss's of multiple vintages.  The basic difference between them is
>
>   * md on first 20 nodes
>   * 3ware 9650SE ML12 on last 10 nodes
>
> After the upgrade to 1.8.4 we were seeing terrible throughput on the
> nodes with 3ware cards (and only the nodes with 3ware cards).  This
> was typified by see the block device being 100% utilised (iostat),
> doing about 100r/s and 400kb/s and all the ost_io threads in D state
> (no writes).  They would be in this state for 10mins and then suddenly
> awake and start pushing data again.  1-2 mins later, they would lock
> up again.
>
> The oss's were dumping stacks all over the place, crawling along and
> generally making our lustrefs unuseable.
>   

Would you post a few of the stack traces?  Presumably these were driven 
by watchdog timeouts,
but it would help to know where they were getting stuck.

> After trying different kernels, raid card drivers, changing write back
> policy on the raid cards etc. the solution was to
>
>     lctl set_param obdfilter.*.writethrough_cache_enable=0
>     lctl set_param obdfilter.*.read_cache_enable=0
>
> on all the nodes with the 3ware cards.
>
> Has anyone else seen this?  I am completely baffled as to why it only
> affects our nodes with 3ware cards.
>
> These nodes were working very well under 1.8.3...
>
>
>