Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Introduction:

This tool is meant to replace the 'pbs_diag' script which is currently the means to capture data from PBS for diagnostics.

"pbs_snapshot" will be written in Python and will make use of PTL libraries to interact with the PBS system that it is capturing. This will mean that any major changes to PBS will need very minor (if any) refactoring of pbs_snapshot as PTL gets updated in tandem with PBS now, so pbs_snapshot will automatically work with the latest version of PBSPro. 

Also, a new set of utilities (PBSSnapUtils) will be added to PTL itself for this tool, which will be directly available for PTL test writers and developers to write PTL tests/debugging tools which may need the ability to take snapshots of PBS.

The first version of the tool will also come with the ability to anonymize/obfuscate PBS data to enable users with sensitive data to obfuscate and share snapshots for bug reporting and debugging.

Shape and Form of a 'snapshot':

A 'snapshot', which will be the output produced by the pbs_snapshot tool, will be a tarball (.tgz file) containing the following directory structure & files:

...


...

Jira Legacy
serverJIRA (pbspro.atlassian.net)
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId32008a99-7831-3ff8-9638-3db0cd01164d
keyPP-758

...

Status
titleDRAFT

...

...

...


Page Properties


Target release17.1.1
Epic

Jira Legacy
serverJIRA (pbspro.atlassian.net)
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId32008a99-7831-3ff8-9638-3db0cd01164d
keyPP-758

Document status

Status
colourGreen
titlecomplete

Forum Discussion/Reviewhttp://community.pbspro.org/t/pp-758-add-pbs-snapshot-tool-to-capture-state-logs-from-pbs/520/22
Document owner
Designer
Developers
QA


Introduction:

This tool is meant to replace the 'pbs_diag' script which is currently the means to capture data from PBS for diagnostics.

"pbs_snapshot" will be written in Python and will make use of PTL libraries to interact with the PBS system that it is capturing. This will mean that any major changes to PBS will need very minor (if any) refactoring of pbs_snapshot as PTL gets updated in tandem with PBS now, so pbs_snapshot will automatically work with the latest version of PBSPro. 

Also, a new set of utilities (PBSSnapUtils) will be added to PTL itself for this tool, which will be directly available for PTL test writers and developers to write PTL tests/debugging tools which may need the ability to take snapshots of PBS.

The first version of the tool will also come with the ability to anonymize/obfuscate PBS data to enable users with sensitive data to obfuscate and share snapshots for bug reporting and debugging.

Shape and Form of a 'snapshot':

A 'snapshot', which will be the output produced by the pbs_snapshot tool, will be a tarball (.tgz file) named as "snapshot_<timestamp>.tgz" containing the following directory structure & files:

  • server/
    • qstat_B.out: output of "qstat -B"
    • qstat_Bf.out: output of "qstat -Bf"
    • qmgr_ps.out: output of "qmgr print server"
    • qstat_Q.out: output of "qstat -Q"
    • qstat_Qf.out: output of "qstat -BQf"
    • qstatqmgr_Bfpr.out: output of "qstat -Bfqmgr print resource"
    • qmgr_pspq.out:  output of "qmgr print serverqueue @default"
  • server_priv/: a copy of the 'server_priv' directory inside PBS_HOME, core files are captured separately (see core_file_bt/)
    • accounting/: contains accounting logs from PBS_HOME/server_priv/accounting/ directory for the number of days specified by --accounting-logs option
  • server_logs/ : contains  contains server logs from the PBS_HOME/server_logs directory for the number of days specified by --servicedaemon-logs option
  • job/
    • qstat.out: output of "qstat"
    • qstat_f.out: output of "qstat -f"
    • qstat_t.out: output of "qstat -t"
    • qstat_tf.out: output of "qstat -tf"
    • qstat_x.out: output of "qstat -x"
    • qstat_xf.out: output of "qstat -xf"
    • qstat_ns.out: output of "qstat -ns"
    • qstat_fx_F_dsv.out: output of "qstat -fx -F dsv"
    • qstat_f_F_dsv.out: output of "qstat -f -F dsv"
    • qstat_f_F_json.out: output of qstat -f -F json"
  • node/
    • pbsnodes_va.out: output of "pbsnodes -va"
    • pbsnodes_a.out: output of "pbsnodes -a"
    • pbsnodes_avSj.out: output of "pbsnodes -avSj"
    • pbsnodes_aSj.out: output of "pbsnodes -aSj"
    • pbsnodes_avS.out: output of "pbsnodes -avS"
    • pbsnodes_aS.out: output of "pbsnodes -aS"
    • pbsnodes_aFdsv.out: output of "pbsnodes -aFdsv"
    • pbsnodes_avFdsv.out: output of "pbsnodes -avFdsv"
    • pbsnodes_avFjson.out: output of "pbsnodes -avFjson"
    • qmgr_pn_default.out: output of "qmgr print node @default"

  • mom_priv/

    Copies

    a copy of the

    following files: 'config', 'prologue', 'epilogue', 'mom.lock'config.d/: contains copy of all vnode def files from inside PBS_HOME/mom_priv/config.d/

    'mom_priv' directory inside PBS_HOME, core files are captured separately (see core_file_bt/)

  • mom_logs/: contains mom logs from the PBS_HOME/mom_logs directory for the number of days specified by --servicedaemon-logs option
    comm/
  • comm_logs/: contains comm logs from the PBS_HOME/comm_logs directory for the number of days specified by --servicedaemon-logs option
  • queue/qstat_Q.outsched_priv/output of "qstat -Q"
  • qstat_Qf.out: output of "qstat -Qf"
  • hook/
  • qmgr_ph_default.out: output of "qmgr print hook @default"
  • qmgr_lpbshooka copy of the 'sched_priv' directory inside PBS_HOME with all the files, core files are not captured (see core_file_bt/).
  • sched_logs/: contains scheduler logs from the PBS_HOME/sched_logs directory for the number of days specified by --daemon-logs option
  • reservation/
    • pbs_rstat_f.out: output of "pbs_rstat -f"
    • pbs_rstat.out: output of "qmgr list pbshookpbs_rstat"
  • scheduler/
    • qmgr_lsched.out: output of "qmgr list sched"
    • sched_priv/: a copy of the 'sched_priv' directory inside PBS_HOME with all the files, core files are not captured (see core_file_bt/).
    • sched_logs/: contains scheduler logs from the PBS_HOME/sched_logs qmgr_psched.out: output of "qmgr print sched"
  • hook/

    • qmgr_ph_default.out: output of "qmgr print hook @default"
    • qmgr_lpbshook.out: output of "qmgr list pbshook"

  • datastore/

    • pg_log/: a copy of the "PBS_HOME/datastore/pg_log" directory for the number of days specified by --

      service

      daemon-logs option

  • reservation/
    • pbs_rstat_f.out: output of "pbs_rstat -f"
    • pbs_rstat.out: output of "pbs_rstat"
  • resource/
    • qmgr_pr.out: output of "qmgr print resource"
    • rscs_all (derived from the resourcedef file): Will list out built-in as well as custom resources in the following format:

          Name: <resource name>
               type = <resource type attribute>
               flag = <resource flag attribute>

          Name: <resource name>
               type = <resource type attribute>
               flag = <resource flag attribute>

          ...
          ...

  • datastore/

    • pg_log/: a copy of the "PBS_HOME/datastore/pg_log" directory

  • pbs/

    • pbs.conf: a copy of the pbs.conf file for the PBS system

    • pbs_probe_v.out: output of "pbs_probe -v"

    • pbs_hostn_v.out: output of "pbs_hostn -v $(hostname)"
    • pbs_environment: copy of PBS_HOME/pbs_environment file
  • core_file_bt/ (stack backtrace from core files)

    • sched_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/sched_priv

    • server_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/server_priv
    • mom_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/mom_priv
    • misc/: files containing the output of "thread apply all backtrace full" on any other core files found inside PBS_HOME
  • system/
    • os_info: Information about the OS: version, flavour of linux etc. (output of "uname -a" and "cat /etc/*release*" for linux)
    • process_info: List of processes running on the system when the snapshot was taken (output of "ps -ef | grep [p]bs" for linux)
    • lsof_pbs.out: output of "lsof | grep [p]bs", only on linux systems
    • ps_aux_pbs.out: output of "ps -aux | grep [p]bs", only on linux systems
    • etc_hosts: Copy of "/etc/hosts" file, only on linux systems.
    • etc_nsswitch_conf: Copy of "/etc/nsswitch.conf" file, only on linux systems.
    • vmstat.out: Output of the command 'vmstat', only on linux systems.
    • df_h.out: Output of the command 'df -h', only on linux systems.
    • dmesg.out: Output of the 'dmesg' command, only on linux systems.
  • ctime: this will log the time (since epoch) when the snapshot was taken.
  • pbs_snapshot.log: captures the logs generated by pbs_snapshot.

Interface Documentation:

The interface for pbs_snapshot will be as follows:

pbs_snapshot -o <path to output file> [OPTION]

-d <pbs_diag>: diag directory to use as input

-H <hostname>: hostname to operate on. Defaults to the value of PBS_SERVER

-l <loglevel>: set log level to one of INFO, INFOCLI, INFOCLI2, DEBUG, DEBUG2,

                      WARNING, ERROR, FATAL

--service-logs=<num days> number of days of service logs to collect

--accounting-logs=<num days> number of days of accounting logs to collect

--additional-hosts=<hostname>: capture additional logs from the hosts specified

                                                        'hostname' is a comma separated list of hosts to take logs from

--map=<file>: path to filename to store the mapping of obfuscated data

--obfuscate: obfuscates euser, egroup, project, account_name, hostnames,

                     IP Addressses, PBS dataservice username

                     Deletes mail endpoints, owner, managers, operators, variable_list

                     ACLs, group_list, job name, jobdir

--version: print version number and exit

Caveat - Currently pbs_snapshot will need to be run as root because it needs to access protected PBS information (e.g - information inside the PBS_HOME/ _priv directories). So, it could either be run with sudo, or as root user. If it is run with restricted privileges, it won't be able to query all of the data.

Interface: Option -o <path to output file>

  • Synopsis: Path to the snapshot tar file that will be generated (without the extension .tgz)
  • Details:
    • This is a mandatory option to pbs_snapshot.
    • If a tar file by the same name already exists then pbs_snapshot will error out.

Interface: Option -d <pbs_diag>


  • Synopsis: Option to provide path to a pbs_diag directory to be used to generate the snapshot
  • Details:
  • This option is meant to make pbs_snapshot be usable on diags generated by the pbs_diag script.
  • <pbs_diag> should be path to a pbs_diag directory that's generated by unwrapping the tarball that pbs_diag produces.
  • This option will instruct pbs_snapshot to not query a live PBS system and instead use the information captured inside the diag to create the snapshot
  • No sudo privileges are needed when running pbs_snapshot using this option

    core_file_bt/ (stack backtrace from core files)

    • sched_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/sched_priv

    • server_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/server_priv
    • mom_priv/: files containing the output of "thread apply all backtrace full" on all core files captured from PBS_HOME/mom_priv
    • misc/: files containing the output of "thread apply all backtrace full" on any other core files found inside PBS_HOME
  • system/
    • pbs_probe_v.out: output of "pbs_probe -v"
    • pbs_hostn_v.out: output of "pbs_hostn -v $(hostname)"
    • pbs_environment: copy of PBS_HOME/pbs_environment file
    • os_info: Information about the OS
    • process_info: List of processes running on the system when the snapshot was taken (output of "ps -aux | grep [p]bs" on linux systems and "tasklist /v" on windows systems)
    • ps_leaf.out: output of ps -leaf, only on linux systems
    • lsof_pbs.out: output of "lsof | grep [p]bs", only on linux systems
    • etc_hosts: Copy of "/etc/hosts" file, only on linux systems.
    • etc_nsswitch_conf: Copy of "/etc/nsswitch.conf" file, only on linux systems.
    • vmstat.out: Output of the command 'vmstat', only on linux systems.
    • df_h.out: Output of the command 'df -h', only on linux systems.
    • dmesg.out: Output of the 'dmesg' command, only on linux systems.
  • pbs.conf: a copy of the pbs.conf file for the PBS system
  • ctime: this will log the time (since epoch) when the snapshot was taken.
  • pbs_snapshot.log: captures the logs generated by pbs_snapshot.

Interface Documentation:

The interface for pbs_snapshot will be as follows (output of pbs_snapshot --help):

Code Block
Usage: pbs_snapshot -o <path to existing output directory> [OPTION]

    Take snapshot of a PBS system and optionally capture logs for diagnostics

    -H <hostname>                     primary hostname to operate on
                                      Defaults to local host
    -l <loglevel>                     set log level to one of INFO, INFOCLI,
                                      INFOCLI2, DEBUG, DEBUG2, WARNING, ERROR
                                      or FATAL
    -h, --help                        display this usage message
    --daemon-logs=<num days>          number of daemon logs to collect
    --accounting-logs=<num days>      number of accounting logs to collect
    --additional-hosts=<hostname>     collect data from additional hosts
                                      'hostname' is a comma separated list
    --map=<file>                      file to store the map of obfuscated data
    --obfuscate                       obfuscates sensitive data
    --with-sudo                       Uses sudo to capture privileged data
    --version                         print version number and exit



Caveat - Currently pbs_snapshot will need to be run as root because it needs to access protected PBS information (e.g - information inside the PBS_HOME/ _priv directories). So, it could either be run with sudo, or as root user. If it is run with restricted privileges, it won't be able to query all of the data.


Interface: Option -o <path to target directory>

  • Synopsis: Path to the directory where the snapshot tarball will be generated
  • Details:
    • This is a mandatory option to pbs_snapshot.
    • The target directory must exist.
    • As an example, if -o is passed "/temp", then path to the generated snapshot would be "/temp/snapshot_<timestamp>.tgz"

Interface: Option -H <hostname>

...

  • Synopsis: Option to set the desired log level for debugging pbs_snapshot
  • Details:
    • The <loglevel> can be set to INFO, INFOCLI, INFOCLI2, DEBUG, DEBUG2, WARNING, ERROR or FATAL.
    • The logging becomes more comprehensive going from FATAL to INFO.
    • By default, the log level will be set to INFOCLI2.
    • The generated logs will also be written out in the file 'pbs_snapshot.log' inside the snapshot directory.

Interface: Option --servicedaemon-logs=<num days>

  • Synopsis: Option to instruct pbs_snapshot to capture service daemon logs for the given number of days going back from the current day
  • Details:
    • This will capture all the daemons' logs available on the host that's running PBS Server.
    • If this option is not specified, a default of 5 days of logs will be collected.
    • The value of <num days> should be >= 0.
      • If the value is 0, no logs are captured.
      • If the value is 1, only the logs for the current day are captured.

...

  • Synopsis: Option to capture service logs information from hosts other than the one where PBS Server is running
  • Details:
    • This option only works if service daemon logs are being captured (i.e:- if --servicedaemon-logs=0, then this option will not cause any effect).
    • This option will cause pbs_snapshot to capture logs of all the daemons running on the following information:
      • mom and comm logs from the hosts specified, for the number of days of logs being captured.
      • mom_priv from the hosts specified, if present.
      • system information from the hosts specified.
    • The 'hostname' argument could either be a single hostname value, or a comma separated list of hostnames to capture the logs from, or blank, in which case logs from all the hosts running PBS daemons will be capturedcapture the logs from.
    • Warning: This can bloat the size of snapshot by a lot and cause pbs_snapshot to take a lot of time copying possibly large amount of data over the network.

...

  • Synopsis: Option to generate a map file by the name specified for obfuscated data
  • Details:
    • This option will cause pbs_snapshot to create a map file by the name specified with "key:value" pair mapping of the data that's obfuscated.
    • This option will work only with the --obfuscate option.
    • If this option is not specified, a file called "obfuscate.map" will be created by default at the location specified by the -o option.

Interface: Option --obfuscate

  • Synopsis: Option to obfuscate/anonymize the PBS data captured
  • Details:
    • This option will instruct pbs_snapshot to obfuscate euser, egroup, project and account_name. If the --map option is provided, it will generate a map file for these attributes.
    • It will also delete mail endpoints, owner, managers, operators, variable_list, ACLs, group_list, job name and jobdiroption.
    • If this option is not specified, a file called "obfuscate.map" will be created by default at the location specified by the -o option.

Interface: Option --obfuscate

  • Synopsis: Option to obfuscate/anonymize the PBS data captured
  • Details:
    • This option will instruct pbs_snapshot to obfuscate euser, egroup, project, Account_Name, operators, managers, group_listMail_Users, User_List, server_host, acl_groups, acl_users, acl_resv_groups, acl_resv_users, sched_host, acl_resv_hosts, acl_hosts, Job_Owner, exec_hostHost, Mom, resources_available.host and resources_available.vnode.
    • It will also delete Variable_List, Error_Path, Output_Path, mail_from, Job_Name, jobdir, Submit_arguments,  Shell_Path_List.

Interface: Option --version

...

Sample Usage:

  • pbs_snapshot -o mysnapshoto /tmp: Will capture a snapshot inside $PWD/mysnapshot/temp/snapshot_<timestamp>.tgz along with 30 days of accounting logs and 5 days of service daemon logs from the machine that's runningg running PBS Server
  • pbs_snapshot --servicedaemon-logs=1 --accounting-logs=1 -o mysnapshot /tmp --obfuscate --map=mapfile.txt: Will capture a snapshot inside $PWD/mysnapshot.tgz along with 1 days inside /temp/snapshot_<timestamp>.tgz along with 1 day of accounting and service daemon logs, will obfuscate the data and store data mapping in the map file named 'mapfile.txt'.

...