Objective:
As of today, if server_dyn_res program/script does not return or hangs. The scheduler keeps on waiting for the script to complete the execution.
The objective of this design document is to propose the solution for this hang issue.
Interface 1: New Configurable Scheduler attribute: server_dyn_res_alarm
- Visibility: sched object |Operator Read | Manager Read/Write
- Change Control: Stable
- Details:
- Admin can configure the scheduler attribute "server_dyn_res_alarm". Default value is 30 seconds.
- Usage :
qmgr -c "set sched server_dyn_res_alarm = 15"
- PBS will start polling from the time the server_dyn_res program/script starts executing and will wait for "server_dyn_res_alarm" time. After the timeout, the interaction with the script/program will end and the scheduler will log a timeout info message.
- This timeout will be applicable for each server_dyn_res program/script.
- On timeout, the value of the resource will be assumed to be "0" and scheduling cycle will continue normally.
Interface 2: Log message for timeout of server_dyn_res program/script
- Visibility: Scheduler log message at PBSEVENT_SCHED, PBS_EVENTCLASS_SERVER, and syslog LOG_INFO
- Change Control: Unstable
- Details:
- Once the timeout is reached a timeout info message is logged in the scheduler logs. Something like as follows :
... ...;0040;pbs_sched;Svr;server_dyn_res;program /bin/get_foo timed out
Interface 3: Log message for value of server_dyn_res on timeout
- Visibility: Scheduler log message at PBSEVENT_DEBUG, PBS_EVENTCLASS_SERVER, and syslog LOG_DEBUG
- Change Control: Unstable
- Details:
- On timeout a debug message is logged in the scheduler logs for assuming the resource value as "0". Something like as follows :
... ...;0080;pbs_sched;Svr;server_dyn_res;/bin/get_foo = 0