[lustre-discuss] can not get the hsm_release command to work with Lustre 2.7 and Oracle HSM 6.1

Colin Faber colin.faber at seagate.com
Thu Jun 23 12:38:45 PDT 2016


Hi

On Thu, Jun 23, 2016 at 12:57 PM, Michael Skiba <michael.skiba at oracle.com>
wrote:

> Colin, The first file was a data file so; I made another file that was a
> txt file named test. Archived it then changed the file and here is the
> output below. The copytool daemon are different this time and are
> complaining about (cannot get path of FID and cannot set attributes)
>
>
>
>
>
> [root at isr-x4150-01 mnt]# lfs hsm_archive --archive=1 /mnt/test
>

Space on the file system won't change with an archive action, only on
release, and in such case the file stub in place is sparse so you have to
look at allocated blocks.


> [root at isr-x4150-01 mnt]# df -h
>
> Filesystem            Size  Used Avail Use% Mounted on
>
> /dev/sda2             134G  7.0G  121G   6% /
>
> tmpfs                 3.9G     0  3.9G   0% /dev/shm
>
> /dev/sda1             477M  105M  348M  24% /boot
>
> 10.80.191.134 at tcp0:/lustre
>
>                       136G  9.6G  120G   8% /mnt
>
> 10.80.191.161:/samqfs1
>
>                       558G  9.6G  549G   2% /samqfs1
>


To me, this indicates that file archive request was made, and the copytool
failed to complete the request

> [root at isr-x4150-01 mnt]# lfs hsm_state /mnt/test
>
> /mnt/test: (0x00000001) exists, archive_id:1
>
>

This dirties the file, however since no successful archive request was made
(by the copytool) the file loses it's HSM state (possibly bug? or behavior?)

[root at isr-x4150-01 mnt]# vi test
>
> [root at isr-x4150-01 mnt]# lfs hsm_state /mnt/test
>
> /mnt/test: (0x00000000)
>                                                          *(does this mean
> its dirty?)*
>
> [root at isr-x4150-01 mnt]#
>
>
>
>
>
> The logs
>
>
>
> 1466707271.922641 lhsmtool_posix[14489]: copytool fs=lustre archive#=1
> item_count=1
>
> 1466707271.922696 lhsmtool_posix[14489]: waiting for message from kernel
>
> 1466707271.922717 lhsmtool_posix[8023]: '[0x200000400:0x16:0x0]' action
> ARCHIVE reclen 72, cookie=0x576ab8e1
>
> ioctl err -19: No such device (19)
>

It's possible that the FID error is because the original file was removed
and the HSM coordinator queue still had it present (another bug ? feature ?)


1466707271.922850 lhsmtool_posix[8023]: cannot get path of FID
> [0x200000400:0x16:0x0]: No such device (19)
>
1466707271.924618 lhsmtool_posix[8023]: archiving
> 'mnt/.lustre/fid/0x200000400:0x16:0x0' to
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp'
>
> 1466707271.939245 lhsmtool_posix[8023]: saving stripe info of
> 'mnt/.lustre/fid/0x200000400:0x16:0x0' in
> samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp.lov
>
> 1466707271.941082 lhsmtool_posix[8023]: start copy of 973891 bytes from
> 'mnt/.lustre/fid/0x200000400:0x16:0x0' to
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp'
>
> 1466707271.955555 lhsmtool_posix[8023]: copied 973891 bytes in 0.015133
> seconds
>
> 1466707271.972955 lhsmtool_posix[8023]: data archiving for
> 'mnt/.lustre/fid/0x200000400:0x16:0x0' to
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp' done
>
> 1466707271.973479 lhsmtool_posix[8023]: cannot set attributes of
> 'mnt/.lustre/fid/0x200000400:0x16:0x0': Operation not permitted (1)
>
> 1466707271.973499 lhsmtool_posix[8023]: cannot copy attr of
> 'mnt/.lustre/fid/0x200000400:0x16:0x0' to
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp': Operation
> not permitted (1)
>
> 1466707271.973509 lhsmtool_posix[8023]: attr file for
> 'mnt/.lustre/fid/0x200000400:0x16:0x0' saved to archive
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp'
>
> 1466707271.973869 lhsmtool_posix[8023]: fsetxattr of 'trusted.hsm' on
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp' rc=-1
> (Operation not supported)
>
> 1466707271.973885 lhsmtool_posix[8023]: fsetxattr of 'trusted.link' on
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp' rc=-1
> (Operation not supported)
>
> 1466707271.973906 lhsmtool_posix[8023]: fsetxattr of 'trusted.lov' on
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp' rc=-1
> (Operation not supported)
>
> 1466707271.973919 lhsmtool_posix[8023]: fsetxattr of 'trusted.lma' on
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp' rc=-1
> (Operation not supported)
>
> 1466707271.974100 lhsmtool_posix[8023]: fsetxattr of 'lustre.lov' on
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp' rc=-1
> (Operation not supported)
>
> 1466707271.974112 lhsmtool_posix[8023]: xattr file for
> 'mnt/.lustre/fid/0x200000400:0x16:0x0' saved to archive
> 'samqfs1/0016/0000/0400/0000/0002/0000/0x200000400:0x16:0x0_tmp'
>
> ioctl err -19: No such device (19)
>
> 1466707271.975415 lhsmtool_posix[8023]: cannot get FID of
> '[0x200000400:0x16:0x0]': No such device (19)
>
> 1466707271.975848 lhsmtool_posix[8023]: Action completed, notifying
> coordinator cookie=0x576ab8e1, FID=[0x200000400:0x16:0x0], hp_flags=0 err=1
>
> 1466707271.976507 lhsmtool_posix[8023]: llapi_hsm_action_end() on
> 'mnt/.lustre/fid/0x200000400:0x16:0x0' ok (rc=0)
>
>
>
>
>
>  *<snip>*
>

Here's a test for you to try to keep things simple and to verify that all
components are working correctly.

Step 1:

Shutdown your copytool listing for archive 1. Next generate a test file and
gather information about it.

# truncate --size +1M /mnt/hsm_test
# lfs path2fid /mnt/hsm_test
# lfs hsm_archive --archive 1 /mnt/hsm_test

Step 2:

Now check the pending actions on the mds

# cat /proc/fs/lustre/mdt/*/hsm/actions
lrh=[type=10680000 len=136 idx=1/43086] fid=[0x2000013a1:0x1f17d:0x0]
dfid=[0x2000013a1:0x1f17d:0x0] compound/cookie=0x574badc0/0x574baddb
action=ARCHIVE archive#=1 flags=0x0 extent=0x0-0xffffffffffffffff gid=0x0
datalen=0 status=WAITING data=[]

Step 3:
Restart your copytool on archive 1 and monitor it

Once you believe the file to be archived, check your actions queue again to
see if the archive succeeded (it probably failed).


I can replicate your exact behavior by introducing failures into the posix
copytool (i.e. set hsm-root to /dev/null) and it will note that the archive
was attempted and failed (exists with flag x0001).

It's interesting that you're seeing the FID issue, I think the first thing
to do is pin that down and see why the copytool is receiving the bad FID

-cf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160623/07a68495/attachment.htm>


More information about the lustre-discuss mailing list