In high availability setup, "new" server leaf connection to pbs_comm keeps getting rejected because stale connection from old server is still seen

Description

When failover happens without a clean server shutdown on the old node, the leaf connection to pbs_comm from the old server node isn't torn down.

On the new node, the server starts up but cannot see any node as up nor contact the scheduler. The cause is that its connection to pbs_comm as a leaf is rejected, because pbs_comm refuses duplicate requests to connect (with matching IP address and port). The old one is stale, but since not enough time has passed for the keepalive settings on the TCP/IP connections to detect it as stale, pbs_comm does not know.

The pbs_comm logs log:

01/05/2017 03:42:53;0c06;Comm@server;TPP;Comm@server(Thread 1);tfd=16, Leaf registered address xxx.xxx.xxx.xxx:15001

01/05/2017 03:42:53;0c06;Comm@server;TPP;Comm@server(Thread 1);tfd=16, Leaf xxx.xxx.xxx.xxx:15001 still connected while another leaf connect arrived, dropping

The problem is that even when that stale TCP/IP connection is no longer seen in netstat -tnap output, hours later, every reconnect from the server (one every two seconds) is still being rejected with the same message.

When it detects a duplicate connection, pbs_comm needs to either drop the old connection and establish the new one, or --more conservatively-- drop the old connection and reject the new incoming connection, since the leaf client will then retry connecting successfully.

Subhasis has produced a patch to be applied to tpp_router.c (line numbers will differ):

— tpp_router.orig.c 2017-01-05 22:10:02.780722549 +0100
+++ tpp_router.c 2017-01-05 22:10:02.756721991 +0100
@@ -1167,8 +1167,9 @@

  • so close this incoming connection
    */
    snprintf(tpp_get_logbuf(), TPP_LOGBUF_SZ, "tfd=%d, pbs_comm %s is still connected while "

  • "another connect arrived, dropping", tfd, r->router_name);
    + "another connect arrived, dropping original connection %d", tfd, r->router_name, r->conn_fd);
    tpp_log_func(LOG_CRIT, NULL, tpp_get_logbuf());
    + tpp_transport_close(r->conn_fd);
    tpp_unlock(&router_lock);
    return -1;
    }
    @@ -1281,9 +1282,10 @@

  • so close this incoming connection
    */
    snprintf(tpp_get_logbuf(), TPP_LOGBUF_SZ, "tfd=%d, Leaf %s still connected while "

  • "another leaf connect arrived, dropping",

  • tfd, tpp_netaddr(&l->leaf_addrs[0]));
    + "another leaf connect arrived, dropping original connection %d",
    + tfd, tpp_netaddr(&l->leaf_addrs[0]), l->conn_fd);
    tpp_log_func(LOG_CRIT, NULL, tpp_get_logbuf());
    + tpp_transport_close(l->conn_fd);
    tpp_unlock(&router_lock);
    return -1;
    }

which has been checked to indeed avoid the problem: the first server connection is rejected but closes the old TPP transport fd, and after 2 seconds the server reconnects correctly (and usually reuses the fd that was just freed).

Acceptance Criteria

None

Status

Assignee

Krishnan Dilip

Reporter

Scott Campbell

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Fix versions

Affects versions

Priority

Critical
Configure