[Lustre-discuss] Nodes claim error with files, then say everything is fine.

Chris Worley worleys at gmail.com
Wed Aug 6 11:11:16 PDT 2008


On Wed, Aug 6, 2008 at 11:26 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:
>
> +rpctrace debug is probably the way to go to see what the client
> are(n't) doing in terms of keeping the MDS aware of it's existence.
>

The log from the client perspective starts with a lot of packet mismatches:

00000100:00020000:3:1217629906.902955:0:5592:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217633409.730495:0:5590:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000400:00000100:3:1217633608.884380:0:5588:0:(lib-move.c:1647:lnet_parse_put())
Dropping PUT from 12345-36.102.29.1 at o2ib portal 25 match 157556 offset
0 length 128
: 2
00000100:00020000:3:1217633710.886759:0:5587:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217633852.887489:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217634099.512727:0:5591:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128

...repeat many times.  Then it repeats this many times:

00000100:00080000:6:1217651769.943608:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery())
MGS: starting recovery
00000100:00080000:6:1217651769.943619:0:5596:0:(import.c:465:ptlrpc_connect_import())
ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING
00000100:00080000:6:1217651769.943623:0:5596:0:(import.c:342:import_select_connection())
MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt
4342590785
00000100:00080000:6:1217651769.943628:0:5596:0:(import.c:423:import_select_connection())
MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection
MGC36.102.29.1
@o2ib_0/36.102.29.1 at o2ib
00000100:00080000:6:1217651769.943630:0:5596:0:(import.c:504:ptlrpc_connect_import())
Last reconnection attempt (743) for MGS
00000100:00080000:6:1217651820.942855:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery())
MGS: starting recovery
00000100:00080000:6:1217651820.942858:0:5596:0:(import.c:465:ptlrpc_connect_import())
ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING
00000100:00080000:6:1217651820.942862:0:5596:0:(import.c:342:import_select_connection())
MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt
4342641786
00000100:00080000:6:1217651820.942866:0:5596:0:(import.c:423:import_select_connection())
MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection
MGC36.102.29.1
@o2ib_0/36.102.29.1 at o2ib
00000100:00080000:6:1217651820.942868:0:5596:0:(import.c:504:ptlrpc_connect_import())
Last reconnection attempt (744) for MGS

... repeat... with one of these embedded:

00000100:00080000:6:1217651860.941624:0:5596:0:(pinger.c:144:ptlrpc_pinger_main())
not pinging MGS (in recovery: EVICTED or recovery disabled: 0/0)

...ending with this:


00000100:00080000:6:1217652002.937993:0:5596:0:(import.c:504:ptlrpc_connect_import())
Last reconnection attempt (748) for MGS
00000100:00080000:3:1217663095.721928:0:9878:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib
10000000:01000000:3:1217663095.721931:0:9878:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:3:1217663095.722708:0:9878:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:3:1217663095.722710:0:9878:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:3:1217663095.722712:0:9878:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FUL
L
10000000:01000000:3:1217663095.722714:0:9878:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:3:1217663095.722715:0:9878:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import
00000100:02000000:3:1217663095.722719:0:9878:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using n
id 36.102.29.1 at o2ib.

...then followed by length mismatches interspersed with "recovery":

00000100:00020000:3:1217663393.547988:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217663677.841799:0:5591:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217663714.726333:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217663819.561596:0:5592:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217664140.732610:0:5592:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:3:1217664244.038963:0:5590:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217664556.735280:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00080000:3:1217665408.738550:0:9927:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib
10000000:01000000:3:1217665408.738554:0:9927:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:3:1217665408.739328:0:9927:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:3:1217665408.739330:0:9927:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:3:1217665408.739333:0:9927:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FUL
L
10000000:01000000:3:1217665408.739335:0:9927:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:3:1217665408.739336:0:9927:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import
00000100:02000000:3:1217665408.739340:0:9927:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using n
id 36.102.29.1 at o2ib.
00000100:00080000:3:1217665408.739344:0:9927:0:(recover.c:175:ptlrpc_wake_delayed())
@@@ waking (set 0000000000000000):  req at ffff8101d3aa78e8 x328561/t0
o501->MGS at 36
.102.29.1 at o2ib:26/25 lens 200/304 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
00000100:00020000:3:1217665653.806267:0:5589:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217665692.740540:0:5589:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:6:1217666544.737120:0:5592:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:6:1217666647.982634:0:5590:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:6:1217667112.732267:0:5588:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00080000:3:1217667254.732738:0:9966:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2ib
10000000:01000000:3:1217667254.732742:0:9966:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:3:1217667254.733543:0:9966:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:3:1217667254.733546:0:9966:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:3:1217667254.733549:0:9966:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FUL
L
10000000:01000000:3:1217667254.733552:0:9966:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:3:1217667254.733553:0:9966:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import
00000100:02000000:3:1217667254.733556:0:9966:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using n
id 36.102.29.1 at o2ib.
00000100:00080000:3:1217667254.733559:0:9966:0:(recover.c:175:ptlrpc_wake_delayed())
@@@ waking (set 0000000000000000):  req at ffff8102341658e8 x338536/t0
o501->MGS at 36
.102.29.1 at o2ib:26/25 lens 200/304 e 0 to 40 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
00000100:00020000:6:1217667254.733672:0:5586:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128
00000100:00020000:6:1217667502.281243:0:5589:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00020000:3:1217668248.722633:0:5587:0:(events.c:116:reply_in_callback())
early reply sized 176, expect 128

... lots of these mismatches, followed by what look like more recoveries:

00000100:02000000:0:1217678411.535026:0:12692:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using
nid 36.102.29.1 at o2ib.
00000100:00020000:0:1217678425.863040:0:5590:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
00000100:00080000:0:1217678462.534057:0:12693:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2i
b
10000000:01000000:0:1217678462.534061:0:12693:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:0:1217678462.534192:0:12693:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:0:1217678462.534194:0:12693:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:0:1217678462.534195:0:12693:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FU
LL
10000000:01000000:0:1217678462.534196:0:12693:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:0:1217678462.534197:0:12693:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import

...repeated many, many, many times... then:

00000100:00080000:4:1217780793.292959:0:16703:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FU
LL
10000000:01000000:4:1217780793.292960:0:16703:0:(mgc_request.c:880:mgc_import_event())
import event 0x808004
10000000:02000400:4:1217780793.292961:0:16703:0:(mgc_request.c:894:mgc_import_event())
MGC36.102.29.1 at o2ib: Reactivating import
00000100:02000000:4:1217780793.292962:0:16703:0:(import.c:1116:ptlrpc_import_recovery_state_machine())
MGC36.102.29.1 at o2ib: Connection restored to service MGS using
nid 36.102.29.1 at o2ib.
10000000:01000000:4:1217780806.980814:0:6127:0:(mgc_request.c:289:mgc_requeue_thread())
updating log lfs-client
10000000:01000000:4:1217780806.980817:0:6127:0:(mgc_request.c:1092:mgc_process_log())
Process log lfs-client:ffff81027cca8d20 from 530
10000000:01000000:4:1217780806.980819:0:6127:0:(mgc_request.c:627:mgc_enqueue())
Enqueue for lfs-client (res 0x73666c)
00000100:00020000:0:1217780806.980984:0:5588:0:(events.c:116:reply_in_callback())
early reply sized 240, expect 128
10000000:01000000:4:1217780844.291599:0:6127:0:(mgc_request.c:1111:mgc_process_log())
Can't get cfg lock: -5
00000100:00080000:0:1217780844.291621:0:5596:0:(recover.c:54:ptlrpc_initiate_recovery())
MGS: starting recovery
00000100:00080000:4:1217780844.291623:0:6127:0:(client.c:1801:ptlrpc_queue_wait())
@@@ "ll_cfg_requeue" waiting for recovery: (FULL != DISCONN)
req at ffff810238534098
 x935044/t0 o501->MGS at 36.102.29.1@o2ib:26/25 lens 200/304 e 0 to 40 dl
0 ref 1 fl Rpc:/0/0 rc 0/0
00000100:00080000:0:1217780844.291625:0:5596:0:(import.c:465:ptlrpc_connect_import())
ffff810275e06188 MGS: changing import state from DISCONN to CONNECTING
00000100:00080000:0:1217780844.291630:0:5596:0:(import.c:342:import_select_connection())
MGC36.102.29.1 at o2ib: connect to NID 36.102.29.1 at o2ib last attempt
4471669871
10000000:01000000:4:1217780844.291636:0:5660:0:(mgc_request.c:565:mgc_blocking_ast())
Lock res 0x73666c (lfs)
00000100:00080000:0:1217780844.291637:0:5596:0:(import.c:423:import_select_connection())
MGC36.102.29.1 at o2ib: import ffff810275e06188 using connection
MGC36.102.29.1
@o2ib_0/36.102.29.1 at o2ib
10000000:01000000:4:1217780844.291638:0:5660:0:(mgc_request.c:590:mgc_blocking_ast())
log lfs-client: original grant failed, will requeue later
00000100:00080000:0:1217780844.291639:0:5596:0:(import.c:504:ptlrpc_connect_import())
Last reconnection attempt (3480) for MGS
00000100:00080000:4:1217780844.291870:0:5595:0:(import.c:709:ptlrpc_connect_interpret())
reconnected to MGS at 36.102.29.1@o2ib after partition
00000100:00080000:4:1217780844.291873:0:5595:0:(import.c:714:ptlrpc_connect_interpret())
MGC36.102.29.1 at o2ib: reconnected but import is invalid; marking
evicted
00000100:00080000:4:1217780844.291874:0:5595:0:(import.c:715:ptlrpc_connect_interpret())
ffff810275e06188 MGS: changing import state from CONNECTING to EVICTED
00000100:00080000:4:1217780844.291877:0:5595:0:(import.c:1053:ptlrpc_import_recovery_state_machine())
evicted from MGS at 36.102.29.1@o2ib; invalidating
10000000:01000000:4:1217780844.291975:0:5595:0:(mgc_request.c:880:mgc_import_event())
import event 0x808005
00000100:00080000:0:1217780844.291996:0:16704:0:(import.c:1016:ptlrpc_invalidate_import_thread())
thread invalidate import MGC36.102.29.1 at o2ib to MGS at 36.102.29.1@o2i
b
10000000:01000000:0:1217780844.292000:0:16704:0:(mgc_request.c:880:mgc_import_event())
import event 0x808003
00000100:00080000:0:1217780844.292132:0:16704:0:(import.c:1025:ptlrpc_invalidate_import_thread())
ffff810275e06188 MGS: changing import state from EVICTED to RECOVER
00000100:00080000:0:1217780844.292134:0:16704:0:(import.c:1103:ptlrpc_import_recovery_state_machine())
reconnected to MGS at 36.102.29.1@o2ib
00000100:00080000:0:1217780844.292135:0:16704:0:(import.c:1108:ptlrpc_import_recovery_state_machine())
ffff810275e06188 MGS: changing import state from RECOVER to FU
LL


Does this give any hints... there's a lot more log to go.

Thanks,

Chris



More information about the lustre-discuss mailing list