Analysis for PP-465
- A rerun request is sent from the client.
- Necessary checks are performed by the server (req_rerunjob2) and SIG_RERUN (issue_signal) is issued to the MoM, batch request being PBS_BATCH_SignalJob.
The server sets post_rerun as the callback function and job substate to JOB_SUBSTATE_RERUN.
MoM on receiving this signal, kills the job and sends job obit.
Server on receiving job obit creates a task with function as on_job_rerun and sets a timeout task with function timeout_rerun_request.
on_job_rerun sends rerun batch request to MoM.
MoM on receiving rerun request prepares for the job files to be sent to the server through batch request PBS_BATCH_MvJobFile.
MoM sends the job files in blocks of 4KB.
- Server spools the files and then requeues the job.
- These files are sent to the MoM when job is scheduled to run.
So, all the timeout does for now is it throws a spurious error that causes problems for the client w/o anything having really gone wrong.
In the event of a network failure, the job will get requeued by node_fail_requeue.As discussed in the forum, and documented in the EDD, the server attribute job_requeue_timeout will be renamed to job_requeue_delay.