Analysis for PP-351

PP-351 - Getting issue details... STATUS

Requirement -

Accounting record type 'R' should have information about resources used by the job so far.

Analysis:

  • The current behaviour does not record anything about the resources used by the job before until the time it was re-queued.
  • This results in less usage being reported for the job.
  • If information on resource usage is added to the record type 'R', better resource usage will be reported, even though not 100% accurate.
  • 'R' record type is logged when a job is re-queued because -
    1. The node the job was running on goes down and node_fail_requeue timeout is hit.
    2. It is rerun using qrerun <job-id>.
    3. It is rerun using qrerun -Wforce <job-id>.
    4. provisioning for a vnode fails.
    5. mom is restarted without any options or with the '-r' option.
  • Currently, 'R' record contains resource_used information for items 2 and 5 above.

Code flow:

Case 1. Job re-queued because node_fail_requeue was triggered:

node_down_requeue() --> discard_job() --> post_discard_job() --> account_jobend()

Case 2, 5. job is rerun using qrerun <job id> and on mom restart.

on_job_rerun() [job substate == JOB_SUBSTATE_RERUN3] --> account_jobend()

Case 3. job is rerun using qrerun -Wforce <job id>.

req_rerunjob() --> req_rerunjob2() --> force_requeue() --> account_jobend().

Case 4. provisioning of a vnode fails

check_and_run_jobs() --> fail_vnode_job() --> force_requeue() --> account_jobend()