Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Visibility: Public
  • Change Control: Stable
  • Python type: pbs.select
  • Input Format: rjselect="[N:][chunk specification][+[N:]chunk specification]" (same as select spec)
    where N is a multiple of chunks, and chunk specification is of the form:
    "<resource name>=<value>[:<resource name>=<value>...]"
    rjselect can be set in one of the following ways:
    • qsub -l rjselect=[N:][chunk specification][+[N:]chunk specification]
    • qalter -l rjselect=[N:][chunk specification][+[N:]chunk specification]
    • As a PBS directive in a job script:
                   #PBS -l rjselect=[N:][chunk specification][+[N:]chunk_specification]
    • Within a Python hook script, use the pbs.select() type.
                 pbs.event().job.Resource_List["rjselect"] = pbs.select("[N:][chunk specification][+[N:]chunk specification]")
  • Privilege: only root, PBS admin, PBS operator can set the rjselect value
  • Details:
    This is a builtin resource that is used to cause a job to be started reliably. rjselect must be a mirror of the original 'select' request, but with extra chunks added to cause assignment of extra vnodes. When rjselect is set, the schedselect job attribute is set to map to this value instead of the orignal 'select', causing the scheduler to allocate the extra nodes to the job. At the start of the job, before the user's program/job script is executed, any failure of sister nodes are tolerated. But just before the execjob_launch hook (if any) is executed, and also before the user's program/job script is started, the job's assigned nodes are pruned back (take away the failed nodes) to a minimum list, enough to satisfy the original 'select' request.

    When the job is updated, the MS mom will tell the sister moms (via an IM_UPDATE_JOB request) that the nodes mapping has changed. This is needed so that spawning multi node tasks via TM interface continues to work.

  • Normal case:
    • The rjselect value is a mirror of the original select value with the 'N' part of a chunk specification (i.e. [N:][chunk specification][+[N:]chunk specification\) changed to reflect additional chunks.
    • Example, given qsub -l select="ncpus=3:mem=1gb+ncpus=2:mem=2gb+ncpus=1:mem=3gb" where there's 1 chunk of "ncpus=3:mem=1gb" spec, 1 chunk of "ncpus=2:mem=2gb" spec, and 1 chunk of "chpus=1:mem=3gb" spec. an rjeselect rjselect value can bump up the second chunk to have 2 of 'ncpus=2:mem=2gb" spec and the third chunk to have 2 of "ncpus=1:mem=3gb" spec as follows:    rjselect=ncpus=3:mem=1gb+2:ncpus=2:mem=2gb+2:ncpus=1:mem=3gb
    •   The first chunk assigned is always preserved since that is allocated from the MS host, so there's really no need to bump the number of chunks for the first chunk in the original select request.
  • Log/Error messages:
    1. If 'rjselect' is set by a regular, non-privileged user, then there'll be an error message from qsub, qalter as follows:
                     "not allowed to set 'rjselect' unless PBS operator or manager"
    2. When job's assigned nodes get pruned, mom_logs will show the following info under PBSEVENT_JOB log level::

      ";Job;<jobid>;pruned from exec_vnode=<original value_satisfying_rjselect>"
      ";Job;<jobid>;pruned to exec_node=<new value_mapping_original_select>"
      ";Job;<jobid>;pruned from schedselect=<original schedselect value_satisfying_rjselect>"
      ";Job;<jobid>;pruned to schedselect=<new schedselect value_mapping_original_select>"

    3. When mother superior fails to prune currently assigned chunk resources (which satisfied the rjselect request) into a minimum set that satisfies the original select specification, then the job is requeued, and the following detailed mom_logs message are shown under PBSEVENT_DEBUG log level unless otherwise noted:

      1. "<jobid>;job not started, Retry 3" under PBSEVENT_ERROR log level

      2. "could not satisfy 1st select chunk (<resc1>=<val1> <resc2>=<val2>... <rescN>=valN) with first available chunk (<resc1>=<val1> <resc2>=<val2>...<rescN>=<valN>"  when first chunk from the original select spec could not be satisfied

      3. "could not satisfy the original select chunk (<resc1>=<val1> <resc2>=<val2> ...<rescN>=valN) with first available chunk <resc1>=<val1> <resc2>=<val2>..."  when a secondary (sister) chunk from the original select spec could not be satisfied

      4. "job node_list_fail: node <node_name1>" which shows a job's vnode that is managed by an unhealthy mom

      5. "job node_list: node <node_name1>" which shows a job's vnode managed by a fully functionining mom.

    4. When a sister mom updated its internal nodes info, then mom_logs on the sister host would show the message in PBSEVENT_JOB level:

          ";pbs_mom;Job;<jobid>;updated nodes info"

    5. When mother superior notice that not all acks were received from the sister moms in regards to updating their internal nodes data, then mom_logs would show the    
       PBSEVENT_DEBUG2 message: "NOT all job updates to sister moms completed" Note that eventually, all the nodes would automatically complete updating its info.
    6. if a sister mom receives a TM request but its nodes data have not been updated yet, then mom_logs on the sister host will show under PBSEVENT_JOB log level:
      ""job reliably started but missing updated nodes list. Try spawn() again later." 
      NOTE: The client would get an "error on spawn" message while doing tm_spawn.

  • Examples:

...