jobs always stay in the queue, weird errors in server logs

Description

Hi!

I installed an ohpc cluster and everything works fine, except pbs. I did everything as in the manual and it appears find, nodes got added, are free and reservable, but submitted jobs are like this:

Job: 1.lms-matmeca

09/23/2018 17:57:08 S enqueuing into workq, state 1 hop 1
09/23/2018 17:57:08 S Job Queued at request of kpetrov@lms-matmeca, owner =
kpetrov@lms-matmeca, job name = myjob, queue = workq

forever.
the interesting part is in server_log

[root@lms-matmeca centos7.5]# tail /var/spool/pbs/server_logs/20180923
09/23/2018 17:57:08;0040;Server@lms-matmeca;Svr;lms-matmeca;Scheduler sent command 1
09/23/2018 17:57:08;0040;Server@lms-matmeca;Svr;lms-matmeca;Scheduler sent command 0
09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;?;Req Header bad, errno 104, dis error 7
09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;req_reject;Reject reply code=15056, aux=0, type=0, from @lms-matmeca
09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;?;Req Header bad, errno 104, dis error 7
09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;req_reject;Reject reply code=15056, aux=0, type=0, from @lms-matmeca
09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 0 request received from kpetrov@lms-matmeca, sock=15

and then nothing.

comm_log has nothing, things like:
[root@lms-matmeca centos7.5]# tail /var/spool/pbs/server_logs/20180923
09/23/2018 17:57:08;0040;Server@lms-matmeca;Svr;lms-matmeca;Scheduler sent command 1
09/23/2018 17:57:08;0040;Server@lms-matmeca;Svr;lms-matmeca;Scheduler sent command 0
09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;?;Req Header bad, errno 104, dis error 7
09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;req_reject;Reject reply code=15056, aux=0, type=0, from @lms-matmeca
09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;?;Req Header bad, errno 104, dis error 7
09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;req_reject;Reject reply code=15056, aux=0, type=0, from @lms-matmeca
09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 0 request received from kpetrov@lms-matmeca, sock=15
09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 49 request received from kpetrov@lms-matmeca, sock=16
09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 21 request received from kpetrov@lms-matmeca, sock=15
09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 19 request received from kpetrov@lms-matmeca, sock=15

now on the node01 i get from the mom:

09/23/2018 17:42:41;0080;pbs_mom;Hook;PBS_power.HK;copy hook-related file request received
09/23/2018 17:51:22;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 10.0.0.254:15001 on stream 1
09/23/2018 17:51:22;0002;pbs_mom;Svr;im_eof;Server closed connection.
09/23/2018 17:51:23;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm lms-matmeca:17001 down
09/23/2018 17:51:23;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
09/23/2018 17:52:23;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 10.0.0.1:15003 to pbs_comm
09/23/2018 17:52:23;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm lms-matmeca:17001
09/23/2018 17:52:23;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
09/23/2018 17:52:23;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at lms-matmeca:15001
09/23/2018 17:52:23;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 2, Received noroute to dest 10.0.0.254:15001, msg="tfd=15, pbs_comm:10.0.0.254:17001: Dest not found"

while it pings
[root@node01 ~]# ping lms-matmeca
PING lms-matmeca (10.0.0.254) 56(84) bytes of data.

I am not an expert on pbs so forgive me if that is silly. my hosts file is:
10.0.0.254 lms-matmeca master lms-matmeca.cluster.intern
followed by 52
10.0.0.1 node01 node01.cluster.intern

situation is the same in 14 and 18

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

Konstantin Petrov

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Time tracking

0m

Components

Affects versions

Priority

Low
Configure