Client not accepting response over network hangs server

Description

A customer has been plagued by server hangs for quite a while.

Finally we seem to have found the root cause.
Server logs with log_events set to 2047:
06/14/2017 01:18:36;0100;Server@host2046;Req;;Type 49 request received from user1@<hostname>, sock=40
06/14/2017 01:18:36;0100;Server@host2046;Req;;Type 0 request received from user2@<hostname>, sock=40
06/14/2017 01:18:36;0100;Server@host2046;Req;;Type 49 request received from user2@<hostname>, sock=42
06/14/2017 01:18:36;0100;Server@host2046;Req;;Type 21 request received from user2@<hostname>, sock=40
06/14/2017 01:18:36;0100;Server@host2046;Req;;Type 0 request received from user2@<hostname>, sock=40
06/14/2017 01:18:36;0100;Server@host2046;Req;;Type 49 request received from user2@<hostname>, sock=42
06/14/2017 01:18:36;0100;Server@host2046;Req;;Type 21 request received from user2@<hostname>, sock=40
06/14/2017 01:18:36;0100;Server@host2046;Req;;Type 19 request received from user2@<hostname>, sock=40
06/14/2017 01:18:37;0100;Server@host2046;Req;;Type 19 request received from user1@<hostname>, sock=16
06/14/2017 01:29:36;0002;Server@host2046;Svr;Log;Log opened
06/14/2017 01:29:36;0002;Server@host2046;Svr;Server@host2046;pbs_version=PBSPro_13.1.0.160576

The last line is due to the customer restarting the server since it is hung (in his opinion). He has a point: it has been unresponsive to clients for more than 10 minutes.

Before restarting the server, the customer took three stack traces, which are identical:
hread 3 (Thread 0x2b5e4fc7b700 (LWP 6408)):
#0 0x00002b5e4c1593e8 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00002b5e4d607072 in ?? ()
#2 0x0000000000000000 in ?? ()
Thread 2 (Thread 0x2b5e4fa7a700 (LWP 2337)):
#0 0x00002b5e4cfd6603 in epoll_wait () from /lib64/libc.so.6
#1 0x0000000000523ef7 in tpp_em_wait (em_ctx=<optimized out>, ev_array=<optimized out>, timeout=<optimized out>) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/lib/Libtpp/tpp_em.c:280
#2 0x0000000000521441 in work (v=0xfd99f0) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/lib/Libtpp/tpp_transport.c:1510
#3 0x00002b5e4c1550a4 in start_thread () from /lib64/libpthread.so.0
#4 0x00002b5e4cfd602d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b5e4b4d4600 (LWP 2336)):
#0 0x00002b5e4cfcdbfd in poll () from /lib64/libc.so.6
#1 0x0000000000503b6e in DIS_tcp_wflush (fd=fd@entry=16) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/lib/Libpbs/../Libifl/tcp_dis.c:361
#2 0x0000000000503c52 in tcp_puts (fd=16, str=0x8468c11 "<jsdl-hpcpa:Argument>-W</jsdl-hpcpa:Argument><jsdl-hpcpa:Argument>umask=027</jsdl-hpcpa:Argument><jsdl-hpcpa:Argument>-S</jsdl-hpcpa:Argument><jsdl-hpcpa:Argument>/bin/ksh</jsdl-hpcpa:Argument><jsdl-h"..., ct=1237) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/lib/Libpbs/../Libifl/tcp_dis.c:501
#3 0x0000000000510341 in diswcs (stream=stream@entry=16, value=value@entry=0x8468c11 "<jsdl-hpcpa:Argument>-W</jsdl-hpcpa:Argument><jsdl-hpcpa:Argument>umask=027</jsdl-hpcpa:Argument><jsdl-hpcpa:Argument>-S</jsdl-hpcpa:Argument><jsdl-hpcpa:Argument>/bin/ksh</jsdl-hpcpa:Argument><jsdl-h"..., nchars=1237) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/lib/Libpbs/../Libdis/diswcs.c:50
#4 0x00000000005035ce in encode_DIS_svrattrl (sock=sock@entry=16, psattl=<optimized out>) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/lib/Libpbs/../Libifl/enc_svrattrl.c:83
#5 0x0000000000503395 in encode_DIS_reply_inner (sock=16, reply=<optimized out>) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/lib/Libpbs/../Libifl/enc_reply.c:114
#6 0x00000000004798b4 in dis_reply_write (preq=0x2564380, sfds=16) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/server/reply_send.c:149
#7 reply_send (request=request@entry=0x2564380) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/server/reply_send.c:253
#8 0x0000000000490fd8 in req_stat_job (preq=preq@entry=0x2564380) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/server/req_stat.c:356
#9 0x0000000000477db4 in dispatch_request (sfds=16, request=0x2564380) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/server/process_request.c:1014
#10 0x00000000004ce25f in wait_request (waittime=waittime@entry=2) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/lib/Libnet/net_server.c:476
#11 0x0000000000450247 in main (argc=<optimized out>, argv=0x7ffcda2f01a8) at /home/pbsbuild/PBSPro_mainline/PBSPro_13.1.0.160576/pbs/src/server/pbsd_main.c:2131

In other words: we are stuck in DIS_tcp_wflush, called from tcp_puts, called from encode_DIS_svrattrl, trying to write the reply to a qstat command to socket with fd 16 (note this corresponds to the socket in the log's last line documenting a request.
DIS_tcp_wflush only poll()s when, while write()ing to the socket, it is unable to write a single byte and the OS errno is EAGAIN; it then polls until the socket's state changes and POLLOUT is on, i.e. the socket is ready to accept more data.

If the poll() doesn't return in 30 seconds, then it will actually return an error, which is transformed into a DIS_PROTO error in the caller functions and should abort the reply to the client.

./include/libpbs.h:#define PBS_DIS_TCP_TIMEOUT_SHORT 30

That in itself is already a problem, since the timeout is anything but short and on many sites will cause cascading failures (usually alarms on any hook that calls pbs.server() ).

But this seems to be worse: it seems that the network connection to the client is simply unreliable but some data does trickle to the client, so we end up in the poll() call multiple times, enough for us to have been stuck here for 10 minutes. In practice, on most OSes, it would be enough for the connection to accept either one MTU or one memory page per every 30 seconds and we would never finally decide that we have to bail out – that can be a bandwidth as low as 0.13KB/s. 

Of course that means that, for very large qstat outputs, the time we hang in here and service no other request in the server is essentially unbounded if we have a really sick network connection.

We have been able to confirm that this particular host was having issues: over the course of two days there were many hangs, but always of qstat requests associated with this host. And its logs do contain these snippets:

messages:2017-06-13T18:18:17.125866+02:00 <hostname> kernel: [3164577.901402] e1000e: eth0 NIC Link is Down

messages:2017-06-13T18:18:20.553843+02:00 <hostname> kernel: [3164581.328328] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

messages:2017-06-14T01:44:46.269864+02:00 <hostname> kernel: [3191344.424276] e1000e: eth0 NIC Link is Down

messages:2017-06-14T01:44:49.693868+02:00 <hostname> kernel: [3191347.847234] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

So the working hypothesis is that the networking is extremely flaky; not flaky enough to prevent the host from sending short requests to us, but enough to make the server unable to send its reply effectively hanging it.

Indeed, since the NIC and cable have been replaced, the problems have ceased.

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

Sam Goosen

Severity

2-Medium

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Fix versions

Priority

High
Configure