[lustre-discuss] MDT corruption

Alastair Basden a.g.basden at durham.ac.uk
Tue Jan 27 13:29:58 PST 2026


Hi all,

We have solved the problem, so posting back here for completeness, in case 
it helps anyone else.

It turns out that the oi.16* files are some sort of cache file, and not 
really needed.

So, getting hints from the user manual section about file-level backups, 
we mounted the mdt as ldiskfs, removed all the oi.16* files (64 of them) 
and a few others (lfsck_*,  LFSCK, CATALOGS), and then remounted as Lustre.

After a few hours on an lctl lfsck, all appears to be well.

Hopefully that will help someone having a future Lustre panic!

Cheers,
Alastair.

On Mon, 26 Jan 2026, Alastair Basden via lustre-discuss wrote:

> [EXTERNAL EMAIL]
>
> Hi all,
>
> We are wondering whether anyone can shed some light for us.
>
> A MDT raid controller failed, and the drbd replica seems to be corrupted,
> since we can't mount the MDT on another node (where it should have been
> replicated to).
>
> We are using Lustre 2.12.6.
>
> Errors are (when trying to mount):
>
> LDISKFS-fs (drbd3): mounted filesystem with ordered data mode. Opts: 
> user_xattr,errors=remount-ro,no_mbcache,nodelalloc
> LustreError: 114156:0:(osd_iam.c:182:iam_load_idle_blocks()) drbd3: cannot 
> load idle blocks, blk = 1244, err = -5
> LustreError: 114156:0:(osd_oi.c:324:osd_oi_table_open()) drbd3: can't open 
> oi.16.6: rc = -5
> LustreError: 114156:0:(osd_oi.c:327:osd_oi_table_open()) drbd3: expect to 
> open total 64 OI files.
> LustreError: 114156:0:(obd_config.c:559:class_setup()) setup cos8-MDT0003-osd 
> failed (-5)
> LustreError: 114156:0:(obd_mount.c:202:lustre_start_simple()) 
> cos8-MDT0003-osd setup error -5
> LustreError: 114156:0:(obd_mount_server.c:1958:server_fill_super()) Unable to 
> start osd on /dev/drbd3: -5
> LustreError: 114156:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount 
> (-5)
>
> We can mount as ldiskfs, and the oi.16.6 file is there, however we suspect
> this is corrupted (based on teh above error).
>
> We are wondering whether replacing this file from a backup (or indeed from
> the failed raid once the controller is back online) would be an option,
> and allow the system to continue again, albeit with some potential data
> loss of recent accesses.
>
> The failed MDT is not the primary one.
>
> Anyone any ideas?
>
> Thanks,
> Alastair.
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


More information about the lustre-discuss mailing list