[lustre-discuss] old Lustre 2.8.0 panic'ing continously

Thu Mar 5 01:20:06 PST 2020

Hello Torsten,

- What is the exact error message when the panic happens? Could you copy/paste few log messages from this panic message?
- Did you try searching for this pattern onto jira.whamcloud.com, to see if this is an already known bug.
- It seems related to quota. Is disabling quota an option for you?
- Lustre 2.10.8 supports CentOS 6 and was a LTS Lustre version, it got a lot more fixes and is very stable. It could be an easy upgrade path for you before getting your new system?

Aurélien

Le 05/03/2020 08:49, « lustre-discuss au nom de Torsten Harenberg » <lustre-discuss-bounces at lists.lustre.org au nom de harenberg at physik.uni-wuppertal.de> a écrit :

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

    Dear all,

    I know it's dared to ask for help for such an old system.

    We still run a CentOS 6 based Lustre 2.8.0 system
    (kernel-2.6.32-573.12.1.el6_lustre.x86_64,
    lustre-2.8.0-2.6.32_573.12.1.el6_lustre.x86_64.x86_64).

    It's out of warrenty and about to be replaced. The approval process for
    the new grant took over a year and we're currently preparing an EU wide
    tender, all of that takes and took much more time than we expected.

    The problem is:

    one OSS server is always running into a kernel panic. It seems that this
    kernel panic is related to one of the OSS mount points. If we mount the
    LUNs of that server (all data is on a 3par SAN) to a different server,
    this one is panic'ing, too.

    We always run file system checks after such a panic but these show only
    minor issues that you would expect after a crashed machine like

    [QUOTA WARNING] Usage inconsistent for ID 2901:actual (757747712, 217)
    != expected (664182784, 215)

    We would love to avoid an upgrade to CentOS 7 with these old machines,
    but these crashes happen really often meanwhile and yesterday it
    panic'ed after only 30mins.

    Now we're running out of ideas.

    If anyone has an idea how we could identify the source of the problem,
    we would really appreciate it.

    Kind regards

      Torsten

    --
    Dr. Torsten Harenberg     harenberg at physik.uni-wuppertal.de
    Bergische Universitaet
    Fakultät 4 - Physik       Tel.: +49 (0)202 439-3521
    Gaussstr. 20              Fax : +49 (0)202 439-2811
    42097 Wuppertal