Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 21 Next »

Target release17.1.1
Epic

PP-758 - Getting issue details... STATUS

Document status
DRAFT
Forum Discussion/Reviewhttp://community.pbspro.org/t/pp-758-add-pbs-snapshot-tool-to-capture-state-logs-from-pbs/520/22
Document owner
Designer
Developers
QA

Introduction:

This tool is meant to replace the 'pbs_diag' script which is currently the means to capture data from PBS for diagnostics.

"pbs_snapshot" will be written in Python and will make use of PTL libraries to interact with the PBS system that it is capturing. This will mean that any major changes to PBS will need very minor (if any) refactoring of pbs_snapshot as PTL gets updated in tandem with PBS now, so pbs_snapshot will automatically work with the latest version of PBSPro. 

Also, a new set of utilities (PBSSnapUtils) will be added to PTL itself for this tool, which will be directly available for PTL test writers and developers to write PTL tests/debugging tools which may need the ability to take snapshots of PBS.

The first version of the tool will also come with the ability to anonymize/obfuscate PBS data to enable users with sensitive data to obfuscate and share snapshots for bug reporting and debugging.

Shape and Form of a 'snapshot':

A 'snapshot', which will be the output produced by the pbs_snapshot tool, will be a tarball (.tgz file) containing the following directory structure & files:

  • server/
    • qstat_B.out: output of "qstat -B"
    • qstat_Bf.out: output of "qstat -Bf"
    • qmgr_ps.out: output of "qmgr print server"
    • server_priv/: a copy of the 'server_priv' directory inside PBS_HOME, may or may not include accounting logs (see the -L option under "Interface Documentation"), core files are not captured (see core_file_bt/).
    • server_logs/ (optional): contains server logs from the PBS_HOME/server_logs directory for the number of days specified by -L option
  • job/
    • qstat.out: output of "qstat"
    • qstat_f.out: output of "qstat -f"
    • qstat_t.out: output of "qstat -t"
    • qstat_tf.out: output of "qstat -tf"
    • qstat_x.out: output of "qstat -x"
    • qstat_xf.out: output of "qstat -xf"
    • qstat_ns.out: output of "qstat -ns"
    • qstat_fx_F_dsv.out: output of "qstat -fx -F dsv"
    • qstat_f_F_dsv.out: output of "qstat -f -F dsv"
  • node/
    • pbsnodes_va.out: output of "pbsnodes -va"
    • pbsnodes_a.out: output of "pbsnodes -a"
    • pbsnodes_avSj.out: output of "pbsnodes -avSj"
    • pbsnodes_aSj.out: output of "pbsnodes -aSj"
    • pbsnodes_avS.out: output of "pbsnodes -avS"
    • pbsnodes_aS.out: output of "pbsnodes -aS"
    • pbsnodes_aFdsv.out: output of "pbsnodes -aFdsv"
    • pbsnodes_avFdsv.out: output of "pbsnodes -avFdsv"
    • qmgr_pn_default.out: output of "qmgr print node @default"
    • mom_priv/

      • Copies of the following files: 'config', 'prologue', 'epilogue', 'mom.lock'

      • config.d/: contains copy of all vnode def files from inside PBS_HOME/mom_priv/config.d/

    • mom_logs/ (optional): contains mom logs from the PBS_HOME/mom_logs directory for the number of days specified by -L option
  • comm/
    • comm_logs/ (optional): contains comm logs from the PBS_HOME/comm_logs directory for the number of days specified by -L option
  • queue/
    • qstat_Q.out: output of "qstat -Q"
    • qstat_Qf.out: output of "qstat -Qf"
  • hook/
    • qmgr_ph_default.out: output of "qmgr print hook @default"
    • qmgr_lpbshook.out: output of "qmgr list pbshook"
  • scheduler/
    • qmgr_psched.out: output of "qmgr print sched"
    • sched_priv/: a copy of the 'sched_priv' directory inside PBS_HOME with all the files, core files are not captured (see core_file_bt/).
    • sched_logs/ (optional): contains scheduler logs from the PBS_HOME/sched_logs directory for the number of days specified by -L option
  • reservation/
    • pbs_rstat_f.out: output of "pbs_rstat -f"
    • pbs_rstat.out: output of "pbs_rstat"
  • resource/
    • qmgr_pr.out: output of "qmgr print resource"
    • rscs_all (derived from the resourcedef file): Will list out built-in as well as custom resources in the following format:

          Name: <resource name>
               type = <resource type attribute>
               flag = <resource flag attribute>

          Name: <resource name>
               type = <resource type attribute>
               flag = <resource flag attribute>

          ...
          ...

  • datastore/

    • pg_log/: a copy of the "PBS_HOME/datastore/pg_log" directory

  • pbs/

    • pbs.conf: a copy of the pbs.conf file for the PBS system

    • pbs_probe_v.out: output of "pbs_probe -v"

    • pbs_hostn_v.out: output of "pbs_hostn -v $(hostname)"
    • pbs_environment: copy of PBS_HOME/pbs_environment file
  • core_file_bt/ (stack backtrace from core files)

    • sched_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/sched_priv

    • server_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/server_priv
    • mom_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/mom_priv
  • system/
    • os_info: Information about the OS: version, flavour of linux etc. (output of "uname -a" and "cat /etc/*release*" for linux)
    • process_info: List of processes running on the system when the snapshot was taken (output of "ps -ef | grep pbs | grep -v grep" for linux)
    • lsof_pbs.out: output of "lsof | grep pbs | grep -v grep", only on linux systems
    • ps_aux_pbs.out: output of "ps -aux | grep pbs | grep -v grep", only on linux systems
    • etc_hosts: Copy of "/etc/hosts" file, only on linux systems.
    • etc_nsswitch_conf: Copy of "/etc/nsswitch.conf" file, only on linux systems.
    • vmstat.out: Output of the command 'vmstat', only on linux systems.
    • df_h.out: Output of the command 'df -h', only on linux systems.
    • dmesg.out: Output of the 'dmesg' command, only on linux systems.
  • ctime: this will log the time (since epoch) when the snapshot was taken.
  • pbs_snapshot.log (optional): captures the logs generated by pbs_snapshot if the -l option is provided.

Interface Documentation:

The interface for pbs_snapshot will be as follows:

sudo pbs_snapshot [OPTION]

-d <pbs_diag>: diag directory to use as input
-H <hostname>: hostname to operate on. Defaults to the value of PBS_SERVER
-L <num days>: number of days of pbs logs to collect
-l <loglevel>: set log level to one of INFO, INFOCLI, INFOCLI2, DEBUG, DEBUG2,
                      WARNING, ERROR, FATAL
-o <dir>: Output directory
--additional_hosts=<hostname>: Also capture logs from additional hosts
                                                        'hostname' is a comma separated list of hostnames to take logs from
--map=<file>: Optional path to filename to store the mapping of obfuscated data
--obfuscate: Obfuscates euser, egroup, project, account_name, hostnames,
                     IP Addressses, PBS dataservice username
                     Deletes mail endpoints, owner, managers, operators, variable_list
                     ACLs, group_list, job name, jobdir
--version: print version number and exit


sudo - Currently pbs_snapshot will need to be run as a user with sudo privileges because it needs to access protected PBS information (e.g - information inside the PBS_HOME/ _priv directories)


Interface: Option -d <pbs_diag>

  • Synopsis: Option to provide path to a pbs_diag directory to be used to generate the snapshot
  • Details:
    • This option is meant to make pbs_snapshot be usable on diags generated by the pbs_diag script.
    • <pbs_diag> should be path to a pbs_diag directory that's generated by unwrapping the tarball that pbs_diag produces.
    • This option will instruct pbs_snapshot to not query a live PBS system and instead use the information captured inside the diag to create the snapshot
    • No sudo privileges are needed when running pbs_snapshot using this option

Interface: Option -H <hostname>

  • Synopsis: Option to provide the hostname to PBS server
  • Details:
    • This option will make pbs_snapshot ignore the value of PBS_SERVER and instead use the one provided.

Interface: Option -L <num days>

  • Synopsis: Option to instruct pbs_snapshot to capture logs for the given number of days going back from the current day
  • Details:
    • This will capture sched_logs, server_logs and accounting logs.
    • The value of <num days> should be >= 0.
      • If the value is 0, only the logs for the current day are captured.
      • If the value is 1, only the logs for the current day, and the day before will be captured.

Interface: Option -l <loglevel>

  • Synopsis: Option to set the desired log level for debugging pbs_snapshot
  • Details:
    • The <loglevel> can be set to INFO, INFOCLI, INFOCLI2, DEBUG, DEBUG2, WARNING, ERROR or FATAL.
    • The logging becomes more comprehensive going from FATAL to INFO.
    • By default, the log level will be set to INFOCLI2.
    • The generated logs will also be written out in the file 'pbs_snapshot.log' inside the snapshot directory.

Interface: Option -o <dir>

  • Synopsis: Option to specify path to the output/snapshot directory
  • Details:
    • By default, pbs_snapshot will generate the snapshot directory inside /tmp/, and the snapshot directory will be named as "snapshot_<timestamp>"
      • <timestamp> will be in the format: YYYYMMDD_HH_MM_SS
    • If a directory <dir> already exists, pbs_snapshot will prompt the user to confirm whether to over-write it or not.

Interface: Option --additional_hosts=<hostname>

  • Synopsis: Option to capture logs from hosts other than the one where PBS Server is running
  • Details:
    • This option only works with the -L <days> option.
    • This option will cause pbs_snapshot to capture logs of all the daemons running on the hosts specified, for the days specified in the -L option.
    • The 'hostname' argument could either be a single hostname value, or a comma separated list of hostnames to capture the logs from, or blank, in which case logs from all the hosts running PBS daemons will be captured.
    • Warning: This can bloat the size of snapshot by a lot.

Interface: Option --map=<file>

  • Synopsis: Option to generate a map file for obfuscated data
  • Details:
    • This option will cause pbs_snapshot to create a map file by the name specified with "key:value" pair mapping of the data that's obfuscated.
    • This option will work only with the --obfuscate option.

Interface: Option --obfuscate

  • Synopsis: Option to obfuscate/anonymize the PBS data captured
  • Details:
    • This option will instruct pbs_snapshot to obfuscate euser, egroup, project and account_name. If the --map option is provided, it will generate a map file for these attributes.
    • It will also delete mail endpoints, owner, managers, operators, variable_list, ACLs, group_list, job name and jobdir.

Interface: Option --version

  • Synopsis: Option to display the version of pbs_snapshot being used
  • Details:
    • This will just display the version and exit
    • The versions would display the version of PBSPro

Sample Usage:

  • sudo pbs_snapshot: Will capture snapshot of the system without any logs, and will store the output inside /tmp/snapshot_<timestamp>
  • sudo pbs_snapshot -L 10 -o mysnapshot: Will capture a snapshot at $PWD/mysnapshot along with 11 days of logs going back from the present day (including the present day's logs)
  • sudo pbs_snapshot -L 10 -o mysnapshot --obfuscate --map=mapfile.txt: Will capture a snapshot at $PWD/mysnapshot along with 11 days of logs, will obfuscate the data and store data mapping in the map file named 'mapfile.txt'.



  • No labels