Uploaded image for project: 'PBS Pro'
  1. PP-1304

jobs always stay in the queue, weird errors in server logs

    Details

    • Type: Bug
    • Status: Open
    • Priority: Low
    • Resolution: Unresolved
    • Affects versions: 14.1.2, 18.1.2
    • Fix versions: None
    • Components: Other PBS
    • Labels:
      None
    • Environment:

      ohpc centos 7.5

    • Sprint:
    • Story Points:
      1

      Description

      Hi!

      I installed an ohpc cluster and everything works fine, except pbs. I did everything as in the manual and it appears find, nodes got added, are free and reservable, but submitted jobs are like this:

      Job: 1.lms-matmeca

      09/23/2018 17:57:08 S enqueuing into workq, state 1 hop 1
      09/23/2018 17:57:08 S Job Queued at request of kpetrov@lms-matmeca, owner =
      kpetrov@lms-matmeca, job name = myjob, queue = workq

      forever.
      the interesting part is in server_log

      [root@lms-matmeca centos7.5]# tail /var/spool/pbs/server_logs/20180923
      09/23/2018 17:57:08;0040;Server@lms-matmeca;Svr;lms-matmeca;Scheduler sent command 1
      09/23/2018 17:57:08;0040;Server@lms-matmeca;Svr;lms-matmeca;Scheduler sent command 0
      09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;?;Req Header bad, errno 104, dis error 7
      09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;req_reject;Reject reply code=15056, aux=0, type=0, from @lms-matmeca
      09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;?;Req Header bad, errno 104, dis error 7
      09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;req_reject;Reject reply code=15056, aux=0, type=0, from @lms-matmeca
      09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 0 request received from kpetrov@lms-matmeca, sock=15

      and then nothing.

      comm_log has nothing, things like:
      [root@lms-matmeca centos7.5]# tail /var/spool/pbs/server_logs/20180923
      09/23/2018 17:57:08;0040;Server@lms-matmeca;Svr;lms-matmeca;Scheduler sent command 1
      09/23/2018 17:57:08;0040;Server@lms-matmeca;Svr;lms-matmeca;Scheduler sent command 0
      09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;?;Req Header bad, errno 104, dis error 7
      09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;req_reject;Reject reply code=15056, aux=0, type=0, from @lms-matmeca
      09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;?;Req Header bad, errno 104, dis error 7
      09/23/2018 17:57:08;0080;Server@lms-matmeca;Req;req_reject;Reject reply code=15056, aux=0, type=0, from @lms-matmeca
      09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 0 request received from kpetrov@lms-matmeca, sock=15
      09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 49 request received from kpetrov@lms-matmeca, sock=16
      09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 21 request received from kpetrov@lms-matmeca, sock=15
      09/23/2018 17:57:17;0100;Server@lms-matmeca;Req;;Type 19 request received from kpetrov@lms-matmeca, sock=15

      now on the node01 i get from the mom:

      09/23/2018 17:42:41;0080;pbs_mom;Hook;PBS_power.HK;copy hook-related file request received
      09/23/2018 17:51:22;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 10.0.0.254:15001 on stream 1
      09/23/2018 17:51:22;0002;pbs_mom;Svr;im_eof;Server closed connection.
      09/23/2018 17:51:23;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm lms-matmeca:17001 down
      09/23/2018 17:51:23;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net down handler called
      09/23/2018 17:52:23;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 10.0.0.1:15003 to pbs_comm
      09/23/2018 17:52:23;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm lms-matmeca:17001
      09/23/2018 17:52:23;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
      09/23/2018 17:52:23;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at lms-matmeca:15001
      09/23/2018 17:52:23;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 2, Received noroute to dest 10.0.0.254:15001, msg="tfd=15, pbs_comm:10.0.0.254:17001: Dest not found"

      while it pings
      [root@node01 ~]# ping lms-matmeca
      PING lms-matmeca (10.0.0.254) 56(84) bytes of data.

      I am not an expert on pbs so forgive me if that is silly. my hosts file is:
      10.0.0.254 lms-matmeca master lms-matmeca.cluster.intern
      followed by 52
      10.0.0.1 node01 node01.cluster.intern

      situation is the same in 14 and 18

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              const.petrov Konstantin Petrov
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: