External interface design for subjobs surviving server restarts.

Overview:

Currently on server restart, job arrays that have running sub jobs are terminated due to them only being stored in memory. With this RFE the behavior is changed so that running subjobs continue to run after a server restart . It also enables storing the information that is unique to each subjob such as run_count, resources_used, comments, and hence qstat of the sub jobs does not return just the parent information once the job is finished.

Interface Design:

  • Interface 1:
    • Change control: Stable
    • Synopsis: Status of running Subjobs of Array Job persistent across pbs_server restarts 
    • Details:
      • Currently when a server restarts (peacefully or abruptly) any running subjobs of Array Job are killed and re-queued and start from beginning as the whole parent Array job is re-queued.
      • With this RFE we make subjob and array job status persistent across server restarts. So any running subjobs of Array Job continue to run when server is restarted
      • This is achieved by making a running subjob on par with a single job and storing each subjob's job object and its attributes into the pbs database's job table and job_attr table which gets recovered during subsequent server start (pbsd_init())
      • "pbs.subjob_track" db Table is removed from pbs db schema
      • More accurate job comment field is updated for subjob as listed below
        • When Job History is enabled
          • For running subjob: "Job run at <date> at <time> on <exec_vnode>"
          • For finished subjob: "Job run at <date> at <time> on <exec_vnode> and <finished | failed | terminated>"
          • For MoM rejected subjob: "Not Running: PBS Error: Execution server rejected request"
        • When Job History is disabled
          • For running subjob: "Job run at <date> at <time> on <exec_vnode>"


  • Interface 2:   < deleted >


  • Interface 3:
    • Change control: Stable
    • Synopsis: Impact on running sub jobs when terminating the pbs server using "qterm -t quick"
    • Details: 
      • When the PBS server is terminated with "qterm -t quick" (i.e type of shutdown requested is "quick"), any running subjob is not requeued and will be left running after the server shutdown.


  • Interface 4:
    • Change control: Stable
    • Synopsis: content of "qstat -xtf" with respect to subjob
    • Details: 
      • Now the contents displayed under command "qstat -xft" for each of the attributes / resources of finished subjobs are obtained from the corresponding subjob job object instead of copying from the parent array job obj
      • Hence it displays the information that is unique to each subjob such as run_count, resources_used, comments, and hence qstat of the sub jobs does not return just the parent information once the job is finished.


  • Interface 5:   < deleted >


  • Interface 6:   < deleted >


P.S. :  Admin should take note of an inconsistency caused due to a limitation originating from interface 1

As stated in the interface 1, we are progressing towards making subjobs equivalent to regular jobs.

Specifically, this means that like a regular job, when a subjob finishes, the information about the subjob (exit status etc) can be seen only as long as the subjob is in job history.

If a subjob finishes and is no longer in job history (or history is disabled), then information specific to that subjob is no longer available.

In that case, a stat on such subjobs (no longer in history) will wrongly show default values about the subjob (like state = finished, exit_status = 0 etc.)

This leads to the below inconsistencies with respect to running Array job when the server is restarted

  1. A failed subjob will be shown as finished with Exit_status = 0, and Job comment for that subjob will become "Subjob finished" from "Subjob failed"
  2. A terminated (deleted using qdel) subjob will be shown as finished with Exit_status = 0, and Job comment for that subjob will become "Subjob finished" from "Subjob terminated"
  3. For both the above situations the Exit status and the Job comment of the finished Array job can be wrong.



Site Map

Developer Guide Pages