[Lustre-discuss] yet another lustre error
Brock Palen
brockp at umich.edu
Mon Mar 10 06:46:24 PDT 2008
On Mar 9, 2008, at 10:01 PM, Aaron Knister wrote:
> Hi! I have a few questions for you-
>
> 1. How many nodes was his job running on?
around 64 serial jobs accessing the same directory (not the same files).
> 2. What version of lustre and linux kernel are you running on your
> servers/clients?
Lustre servers:
2.6.9-55.0.9.EL_lustre.1.6.4.1smp
Clients:
2.6.9-67.0.1.ELsmp
> 3. What ethernet module are you using on the servers/clients?
Most use the tg3, some use e1000.
>
> I honestly am not sure what the RPC errors mean but I've had
> similar issues caused by ethernet-level errors.
Over the weekend the MDS/MGS went into a unhealthy state forced a
reboot+fsck and when it came back up the directory was accessible
again and jobs started working again.
>
> -Aaron
>
> On Mar 7, 2008, at 6:45 PM, Brock Palen wrote:
>
>> On a file system thats been up for only 57 days, I have:
>>
>> 505 lustre-log. dumps.
>>
>> THe problem at hand is a user has many jobs where his jobs are now
>> hung trying to create a directory from his pbs script. On the
>> clients i see:
>>
>> LustreError: 11-0: an error occurred while communicating with
>> 141.212.30.184 at tcp. The mds_connect operation failed with -16
>> LustreError: Skipped 2 previous similar messages
>>
>> On every client his jobs are on.
>>
>> In the most recent /tmp/lustre-log. on the MDS/MGS I see this
>> message:
>>
>> @@@ processing error (-16) req at 000001001af9a600 x12808293/t0 o38-
>>> 32633f05-02c6-50a5-b496-047150f1fe81 at NET_0x200000aa4003e_UUID:-1
>> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0
>> ldlm_lib.c
>> target_handle_reconnect
>> nobackup-MDT0000: 34b4fbea-200b-1f7c-dac0-516b8ce786fc reconnecting
>> ldlm_lib.c
>> target_handle_connect
>> nobackup-MDT0000: refuse reconnection from 34b4fbea-200b-1f7c-
>> dac0-516b8ce786fc at 10.164.0.111@tcp to 0x00000100069a7000; still busy
>> with 2 active RPCs
>> ldlm_lib.c
>> target_send_reply_msg
>> @@@ processing error (-16) req at 0000010019159a00 x11199816/t0 o38-
>>> 34b4fbea-200b-1f7c-dac0-516b8ce786fc at NET_0x200000aa4006f_UUID:-1
>> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0
>>
>>
>> What I see messages about active rpc's in other logs. What would
>> this mean? Is something suck someplace ?
>>
>>
>>
>> Brock Palen
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> Aaron Knister
> Associate Systems Analyst
> Center for Ocean-Land-Atmosphere Studies
>
> (301) 595-7000
> aaron at iges.org
>
>
>
>
>
>
More information about the lustre-discuss
mailing list