[Lustre-discuss] OST unavailable tests

Shantanu S Pavgi pavgi at uab.edu
Tue Jul 7 14:33:49 PDT 2009


Hello again,

I am struggling to understand recovery procedure (or aborting the same)
after OST becomes unavailable. I have tried deactivating corresponding
OSC on MDS, however it hasn't worked so far.

- How do I stop already connected client hanging from df command? If a
new client (not connected at failure time) tries to mount the file
system then it does not receive any error/denial message.
- I assume 'lctl --device <OST device number> abort_recovery' needs to
be run on MGS. Please correct me if I am wrong.

I will really appreciate if someone could help me on this issue.

Thanks,
Shantanu Pavgi.



Shantanu S Pavgi wrote:
> Hi,
>
> I am exploring lustre configuration on my test installation. I am doing
> some tests for OST crash/unavailable and recovery. I am unable to
> unmount file system from client and my client hangs even after
> deactivating corresponding OSC on client. I would like to get better
> understanding of what is happening here. The test installation as follows:
> - combined MGS/MDS with OSS/T
> - separate OSS/T
> - separate client
>
> * Following steps were performed:
> Step 1:  unmounted OST of separate OSS box.
>  -- df command hanged for a response from corresponding OST.
>
> Step 2: mounted back OST
>  -- df showed output after latency time was over.
>
> Step 3: unmounted OST of separate OSS box, deactivated  corresponding
> OSC on MDS (lctl  --device <no> deactivate) 
>  --  df command hanged for a response from corresponding OST
>
> Step 4: deactivated  corresponding OSC on client (lctl  --device <no>
> deactivate)
>  --  df command hanged for a response from corresponding OST
>
> Step 5: unmount file system from client
>  -- device busy message
>
> * Following are the log messages on MDS/MGS machine:
> Jul  6 14:36:46 localhost kernel: Lustre: Request x18446744073241928855
> sent from pacific-OST0000-osc to NID 10.0.0.15 at tcp 56s ago has timed out
> (limit 56s).
> Jul  6 14:36:46 localhost kernel: Lustre: Skipped 7 previous similar
> messages
> Jul  6 14:40:38 localhost dhclient: DHCPREQUEST on eth0 to 10.0.0.91 port 67
> Jul  6 14:40:38 localhost dhclient: DHCPACK from 10.0.0.91
> Jul  6 14:40:38 localhost dhclient: bound to 10.0.0.18 -- renewal in 376
> seconds.
> Jul  6 14:40:50 localhost kernel: Lustre:
> 7082:0:(import.c:508:import_select_connection()) pacific-OST0000-osc:
> tried all connections, increasing latency to 51s
> Jul  6 14:40:50 localhost kernel: Lustre:
> 7082:0:(import.c:508:import_select_connection()) Skipped 7 previous
> similar messages
> Jul  6 14:46:54 localhost dhclient: DHCPREQUEST on eth0 to 10.0.0.91 port 67
> Jul  6 14:46:54 localhost dhclient: DHCPACK from 10.0.0.91
> Jul  6 14:46:54 localhost dhclient: bound to 10.0.0.18 -- renewal in 415
> seconds.
> Jul  6 14:48:01 localhost kernel: Lustre: Request x18446744073241928909
> sent from pacific-OST0000-osc to NID 10.0.0.15 at tcp 56s ago has timed out
> (limit 56s).
> Jul  6 14:48:01 localhost kernel: Lustre: Skipped 8 previous similar
> messages
>
> * Following are the log messages from client:
> Jul  6 14:30:19 localhost kernel: Lustre: Request x1685132016 sent from
> pacific-OST0000-osc-c472e000 to NID 10.0.0.15 at tcp 56s ago has timed out
> (limit 56s).
> Jul  6 14:30:19 localhost kernel: Lustre: Skipped 7 previous similar
> messages
> Jul  6 14:31:53 localhost kernel: Lustre:
> 1910:0:(import.c:508:import_select_connection())
> pacific-OST0000-osc-c472e000: tried all connections, increasing latency
> to 51s
> Jul  6 14:31:53 localhost kernel: Lustre:
> 1910:0:(import.c:508:import_select_connection()) Skipped 7 previous
> similar messages
> Jul  6 14:35:29 localhost dhclient: DHCPREQUEST on eth0 to 10.0.0.91 port 67
> Jul  6 14:35:29 localhost dhclient: DHCPACK from 10.0.0.91
> Jul  6 14:35:29 localhost dhclient: bound to 10.0.0.11 -- renewal in 372
> seconds.
> Jul  6 14:41:34 localhost kernel: Lustre: Request x1685132085 sent from
> pacific-OST0000-osc-c472e000 to NID 10.0.0.15 at tcp 56s ago has timed out
> (limit 56s).
> Jul  6 14:41:34 localhost kernel: Lustre: Skipped 8 previous similar
> messages
> Jul  6 14:41:41 localhost dhclient: DHCPREQUEST on eth0 to 10.0.0.91 port 67
> Jul  6 14:41:41 localhost dhclient: DHCPACK from 10.0.0.91
> Jul  6 14:41:41 localhost dhclient: bound to 10.0.0.11 -- renewal in 442
> seconds.
> Jul  6 14:41:53 localhost kernel: Lustre:
> 1910:0:(import.c:508:import_select_connection())
> pacific-OST0000-osc-c472e000: tried all connections, increasing latency
> to 51s
> Jul  6 14:41:53 localhost kernel: Lustre:
> 1910:0:(import.c:508:import_select_connection()) Skipped 7 previous
> similar messages
>
> Any insights? 
>
> Thanks,
> Shantanu Pavgi.
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list