Overview:

Currently on server restart, job arrays that have running sub jobs are terminated due to them only being stored in memory. With this RFE the behavior is changed so that running subjobs continue to run after a server restart . It also enables storing the information that is unique to each subjob such as run_count, resources_used, comments, and hence qstat of the sub jobs does not return just the parent information once the job is finished.

Interface Design:







P.S. :  Admin should take note of an inconsistency caused due to a limitation originating from interface 1

As stated in the interface 1, we are progressing towards making subjobs equivalent to regular jobs.

Specifically, this means that like a regular job, when a subjob finishes, the information about the subjob (exit status etc) can be seen only as long as the subjob is in job history.

If a subjob finishes and is no longer in job history (or history is disabled), then information specific to that subjob is no longer available.

In that case, a stat on such subjobs (no longer in history) will wrongly show default values about the subjob (like state = finished, exit_status = 0 etc.)

This leads to the below inconsistencies with respect to running Array job when the server is restarted

  1. A failed subjob will be shown as finished with Exit_status = 0, and Job comment for that subjob will become "Subjob finished" from "Subjob failed"
  2. A terminated (deleted using qdel) subjob will be shown as finished with Exit_status = 0, and Job comment for that subjob will become "Subjob finished" from "Subjob terminated"
  3. For both the above situations the Exit status and the Job comment of the finished Array job can be wrong.



Site Map

Developer Guide Pages


Ignore this.  We may use it later for page characterization.