Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Add your comments in the Discussion Forum.


This design only applies to Cray ALPS systems.

On a Cray ALPS system, when a job's script is finished running PBS sends to ALPS a release reservation request.  PBS then will intermittently poll until the ALPS response is "No entry for resId" is received.  This is the indication to PBS that the ALPS reservation has successfully been canceled.

What happens today

Today, the amount of time between when PBS will send an ALPS release reservation request will grow exponentially with each try.  PBS also randomly adds between 0-4 seconds to each interval as jitter.  The jitter is so that in the case that the jobs all end at the same time, PBS will not overwhelm ALPS with reservation release requests all at once.  The jitter helps to randomly make each ALPS reservation release happen at a different interval.  Thus the total time between ALPS release reservation requests was the combination of the base loop exponent result plus the value randomly generated between 0-4.  Both of these timings for the interval and the jitter were requested by Cray.

New proposal

Cray says that things have changed and we should now be able to poll at a different interval.  This way, the job's ALPS reservation being released can be discovered sooner, and the next job can use those resources sooner.  The best way for PBS to handle this, will be to put the control in the PBS administrator's hands.   2 new mom tunables will allow the PBS administrator to individually adjust the base interval value, and the amount of potential jitter added to the total interval time.  Total interval time is determined by adding the value for alps_release_interval_usec + the randomly generated value based off alps_release_jitter_usec.

Tunable 1 - alps_release_interval_usec
  • This sets the base time in microseconds to wait between ALPS release reservation requests 
  • It is an integer, with valid values starting from 1
    • Remember, there is an existing mom tunable alps_release_timeout which defaults to 600 seconds (10 min).  That is the point at which PBS gives up trying to contact ALPS, and no more ALPS release reservation requests will be sent to ALPS.
  • Set alps_release_interval_usec in the mom_priv/config file
  • If it is not set in the mom_priv/config file, the default value of alps_release_interval_usec is 500000 usecs (0.5 sec)
Tunable 2 - alps_release_jitter_usec

Turned this into a tunable to the PBS administrator could choose to increase or decrease the amount of jitter added to the interval

  • Based on this value, PBS will randomly generate how many microseconds to add as jitter.  The jitter amount is randomly generated and can range from 0 to alps_release_jitter_usec microseconds.
  • alps_release_jitter_usec is an integer with valid values starting from 1
  • Set alps_release_interval_usec in the mom_priv/config file
  • If it is not set in the mom_priv/config file, the default value of of alps_release_jitter_usec is 4000000 usecs (4 sec)




OSS Site Map

Developer Guide Pages


  • No labels