PP-516: Direct write of the job's stdout/err files.

Interface 1: New option to output (stdout/stderr) files go to the final destination, instead of being staged, if the final destination is known to be writable from the job execution node.

  • Visibility: Public
  • Change Control: Stable
  • Details: A user can have the option to have their job’s output (.o and .e) files to be written to the final destination, if the file system is available from mother superior, instead of being staged.
  • "d" modifier can be used with existing "qsub -k" option. (Ex. qsub -k oed)
  • The phrase "known to be writable" or "usecp-able" mean "the files ultimate destination host:path is mapped from the primary execution node via the existing $usecp directive in mom config" or "the execution node is the same as destination host"
  • The job's Output_path and Error_path are settable with the -o and -e options, and will be honored if the "d" modifier is used for the corresponding file.
  • The admin can make this behavior as default by using "default_qsub_arguments = -koed".
  • If the d modifier for -k is used but the specified file's final destination path(s) are NOT usecp-able, the mom should log a warning and continue running the job with normal spooling and staging to the final destination.
  • This will reflect in qstat -f output as: Keep_Files = oed

Interface 2: A user shall be able to provide an option at job submission time to have PBS remove the output files (.o and .e) for that job, if it completes successfully.

  • Visibility: Public
  • Change Control: Stable
  • Details: Introduce new job attribute Remove_Files using "R" option with qsub, which means "remove ouput/error files upon job completion".
  • "job completion" means terminated with no errors.
  • qsub -R oe job.sh
  • The admin can make this behavior as default by using "default_qsub_arguments = -Roe".
  • The user has the choice to tell which files has to be deleted. (.e or .o or both)
  • This will reflect in qstat -f output as: Remove_Files = oe
  • The format is string and the valid values are "e" or "o" or both of them.

  • The default value is None. (option will be disabled)
  • This attribute can be set or read by user, operator, manager.
  • If the job has a remove file attribute and the job has succeeded, std_files doesn't has to be copied. In that case, server will not send a stage out request to mom. In this scenario, $PBS_HOME/spool will contain the std_files and they will be removed as part of the cleanup routines.

Interface 3: Warning messages(MOM) will be generated in the following scenarios.

  • Visibility: Public
  • Change Control: Stable
  • Details:
  • The following warning message will be logged if direct write was requested but the path(s) are not usecp-able from the primary execution host.
  • "Direct write is requested for job:$job_id but the destination: $final_destination_directory is not usecp-able from $mom_hostname" (DEBUG3)
  • Same message will be logged into job's stderr file as well.
  • The following warning message will be logged if Mom had a problem and wants to have the post job processing restarted, and direct_write is enabled.

  • "Skipping copy of directly written $which file of job $jobid" (DEBUG4)

  • The following warning will come if the mom comes to a conclusion that the stdout/err files might have written directly and thereby it is not available in the spool area.
  • "Skipping directly written/absent spool file$file_path" (DEBUG4)

Interface 4: direct_write and remove_files options can be used with qalter.

  • Visibility: Public
  • Change Control: Stable
  • Details: A user can change the provided options for a particular job if the job has not started yet.
  • Usage of direct_write with qalter:
  • qalter -koed $jobid.
  • Usage of remove_files with qalter:
  • qalter -Roe $jobid.
  • If the job has already started running, it will throw the following (already existing) error :
  • qalter: Cannot modify attribute while job running  Remove_Files $jobid
  • Existing behavior of -koe is retained for 'd' option (will throw error if tried to modify a running job).

Interface 5: More lenient usage of - k sub-arguments

  • Visibility: Public
  • Change Control: Stable
  • Details: The user can have any possible combinations of (oedn)* with only exception that n cannot be used with o/e.
  • The earlier implementation of keep_files puts more restrictions on the way to use sub arguments.
    The options can be o/e/oe/eo along with -k.
  • After this RFE, the following usages will be valid.
    1. -kode
      oe doen’t have to be strictly used together.
    2. koded
      Multiple occurrence of the sub-argument will not result in an illegal operation error as there is no violation of the rule.


Examples:

  • qsub -koed 

Means direct write both the job's output and error files to the Output_path and Error_path if host:path is usecp-able from the primary exec host.  If they are not, issue a warning in mom log and stderr file then do normal spooling and staging.

  • qsub -kod 

Means direct write the job's output file to the Output_path if host:path is usecp-able from the primary exec host.  If it is not, issue a warning in mom log and stderr file then do normal spooling and staging of the output file.  The job's error file will be spooled in $PBS_HOME/spool and staged to Error_path per existing behavior since nothing concerning it was specified.

  • qsub -ked 

Means direct write the job's error file to the Error_path if host:path is usecp-able from the primary exec host.  If it is not, issue a warning in mom log and stderr file then do normal spooling and staging of the error file.  The job's output file will be spooled and staged per existing behavior since nothing concerning it was specified.

  • qsub -Roe -koe

Means direct write both files to user's local home directory (does not matter if it is usecp-able in this case, this is existing -koe functionality), then remove both files upon successful job completion.

  • qsub -koed -Roe

Means direct write both the job's output and error files to the Output_path and Error_path if host:path is usecp-able from the primary exec host.  If they are not, issue a warning in mom log and stderr file then do normal spooling. When the job completes successfully, remove the output and error files either from their directly written location or from $PBS_HOME/spool.  If the job is unsuccessful, leave the files in place or stage them to Output_path and Error_path if they were spooled.

  • qsub -keo -Re

Means write both files to user's local home directory (does not matter if it is usecp-able in this case, this is existing -koe functionality), then remove only the error file upon successful job completion.

  • qsub -koed -Re

 Means both output and error files are directly written to the Output_path and Error_path if host:path is usecp-able from the primary exec host.  The error file will be removed upon successful job completion, the output file is retained

  • qsub -Wsandbox=PRIVATE -Roe

Means the output files will be written to sandbox and will get deleted upon successful completion.

  • When used with -j option: If the user specifies -joe then both the stdout and stderr get streamed to the .o file and -ke is specified, this error will be silently ignored.
  • If usecp is mapped and if -o and -e is pointing to the source directory in the usecp.

    $usecp *:/home/user  /tmp/dest

    qsub -koed -o /home/user/outputfile job.sh

           In this case the output file will be generated at /tmp/dest/outputfile


Community discussion