Analysis for PP-465

PP-465 - Getting issue details... STATUS
Issue
qrerun reports timeout when there are large job files to be transferred, but, in the background the server continues working on getting the job requeued.

Analysis
Below are the relevant events that happen when a job is requested to be rerun.
  1. A rerun request is sent from the client.
  2. Necessary checks are performed by the server (req_rerunjob2) and SIG_RERUN (issue_signal) is issued to the MoM, batch request being PBS_BATCH_SignalJob.
  3. The server sets post_rerun as the callback function and job substate to JOB_SUBSTATE_RERUN.

  4. MoM on receiving this signal, kills the job and sends job obit.

  5. Server on receiving job obit creates a task with function as on_job_rerun and sets a timeout task with function timeout_rerun_request.

  6. on_job_rerun sends rerun batch request to MoM.

  7. MoM on receiving rerun request prepares for the job files to be sent to the server through batch request PBS_BATCH_MvJobFile.

  8. MoM sends the job files in blocks of 4KB.

  9. Server spools the files and then requeues the job.
  10. These files are sent to the MoM when job is scheduled to run.
If the job files being moved in step 10 above are huge, it causes throttling. This issue was fixed through PP-367.
However, transfer of huge files, in step 8 may take longer than PBS_DIS_TCP_TIMEOUT_RERUN (45 seconds) or the value set in attribute job_requeue_timeout.
Due to this timeout task is started and timeout_rerun_request is called. This makes qrerun report timeout to the client.
But, in the background, the transfer of files continues and the job is rerun.

So, all the timeout does for now is it throws a spurious error that causes problems for the client w/o anything having really gone wrong.

In the event of a network failure, the job will get requeued by node_fail_requeue.

Proposed Solution

As discussed in the forum, and documented in the EDD, the server attribute job_requeue_timeout will be renamed to job_requeue_delay.