Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This is a design proposal to configure PBS in a way that it releases only limited number of resources (as specified by the admin) when a job is suspended.

PBS in its current form releases all the consumable resources requested by the job when it is suspended. In most cases suspended job But, in reality when system is out of swap space, a suspended job's process holds on to the memory it would have consumed and just releases ncpus (because kernel stops the process), in some cases admin might have configured an alternate suspend signal which would make the job release a few resources (like licenses) upon suspension. Therefore, it would be better if PBS has a way for admins to specify what all resources can be released from a job upon suspension.

Link to forum discussion.

Interface 1: New scheduler server attribute to specify which resources can be released.

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • A new scheduler server attribute “restrict_res_to_releasedrelease_on_susp” can be used to specify a list of suspend” is a comma separated list of resource names that can be release when a job is suspended. The resources that get released on suspension will be restricted to the resources listed in "restrict_res_to_release_on_suspend".
    • This scheduler server attribute is of type “arraystring_string” arrayand can only be set by a manager. Python type of this attribute is string.
    • If a manager is trying to set the attribute with a resource which is non existent then following error will be thrown on the console by qmgr command - 

# qmgr -c "s sched s restrict_res_to_releasedrelease_on_susp suspend = ‘ncpus, abcd'"

   qmgr obj=abcd svr=default: Unknown resource

   qmgr: Error (15035) returned from server

    • By default, this attribute is unset and If unset, after suspending the job PBS will release all the consumable resources requested by the job.If
    • By default this attribute is unset PBS will get back to default behavior of releasing all the consumable resources upon job suspension.
    • PBS manager can also add/remove resources to/from "restrict_res_to_releasedrelease_on_suspsuspend" attribute by using "+="/"-=" operators.
    • The resources specified in this new scheduler server attribute will be released (provided job has requested for them) every time a job is suspended (by preemption or qsig).

...

Interface 2: New Job attribute “resources_released”

  • Visibility: Public
  • Change Control: Stable
  • Details:

A new job attribute “resources_released” is added.

This attribute is of type string and can only be read by user/ operator/manager. This attribute is internally set by server when a job is suspended. Python type of this attribute is string.

It stores a string that depicts the amount of resources that are released on each chunk node that the job was running on (provided these resources are also part of “res_released“restrict_res_to_release_on_susp” suspend” string). The format of the string is similar to that of exec_vnode

...

•This job attribute is populated at the time of job suspension only if “restrict_res_releasedto_release_on_susp” scheduler suspend” server attribute is set and has a list of legitimate resources to be released.

This attribute is set by scheduler server whenever it tries to preempt suspends a job using suspension. Scheduler will populate this job attribute by sending a ModifyJob batch request to server.

...

Interface 3: New Job attribute “resource_released_list”

  • Visibility: Public
  • Change Control: Stable
  • Details:

A new job attribute “resource_released_list” is added.

This attribute is of type “resource” “resource_list” and can only be read by user/ operator/manager. This attribute is internally set by server when a job is suspended. Python type of this attribute is pbs_resource.

It stores the cumulative value of all the consumable resources requested by the job (provided these resources are also part of “res_released“restrict_res_to_release_on_susp” suspend” string).

using example in interface 2: qstat -f 1 | grep resource_released_list

         resources resource_released_list.license = 2

         resources resource_released_list.ncpus = 6

•This job attribute is populated only if “restrict_res_to_releasedrelease_on_susp” scheduler suspend” server attribute is set and has a list of legitimate resources to be released.

...

Interface 4: New server log message

  • Visibility: Public
  • Change Control: Stable Experimental
  • Details:
    • If server is unable to populate “resources_released” job attribute while suspending a job then it will log following log message at LOG_INFO log level and of type PBSEVENT_JOB.

Unable to create resource released list


Interface 5: New

...

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • If scheduler is unable to populate “resources_released” job attribute while suspending a job then it will log following log message at LOG_INFO log level.

...

Interface 6: New error message while deleting a custom resource

  • Visibility: Public
  • Change Control: Stable
  • Details:
    • If an admin tries to delete a custom resource that is part of the restrict_res_to_releasedrelease_on_susp scheduler suspend server attribute then qmgr command will fail with “resource busy” error code.

...

   qmgr obj=res1 svr=default: Resource busy on schedulerserver

   qmgr: Error (15174) returned from server

...