Redesigning pbs_snapshot --obfuscate

Motivation:

  • pbs_snapshot --obfuscate has many issues: it doesn't obfuscate everything, there are bugs with obfuscating special attributes (like it only obfuscates the first entry in managers, acl attributes etc.), the obfuscated string has the same length as the original string, which could be decrypted back to the original, etc.
  • pbs_snapshot uses a PTL utility called pbs_anonutils.py which contains code that is somewhat complicated to develop. It's designed to have separate routines to obfuscate specific kinds of outputs (tabular, long format, resourcedef, accounting logs, etc.), and was not written for obfuscating snapshots as a whole.
  • To use the existing architecture of pbs_anonutils to obfuscate new outputs like json would have required either writing specialized routines for those outputs, or re-writing existing routines to be more generic, which would have meant essentially re-writing pbs_anonutils.


Proposal:

Add a snapshot obfuscation utility inside pbs_snaputils.py itself which obfuscates an entire snapshot in one go.

Architecture:

  • Introducing a new class in pbs_snaputils called 'ObfuscateSnapshot' which will contain the following:
    • Information about attributes that need to be deleted or obfuscated
    • A routine called obfuscate_snapshot(<path to snapshot>, <path to map file>) which can obfuscate a snapshot completely
      • The routine currently deletes any sched, server, comm and database logs captured as we cannot obfuscate them. A future enhancement will be done to add support for obfuscating them. The idea is to not capture anything that we cannot obfuscate.
      • Algorithm:
        • The routine first obfuscates the long stat format outputs like qstat -f, pbs_rstat -f, pbsnodes -av, etc.
          • While doing this, it also creates an obfuscation map of (sensitive value: obfuscated value)
          • It also deletes necessary attributes and stores their value in a separate list.
        • It then parses custom resources from resourcedef file, generates obfuscated values for them and adds them to the obfuscation map
        • Then calls obfuscate_acct_logs() to obfuscate accounting logs, this can add more entries to the obfuscation map.
        • Then deletes all daemon logs
        • Binary job files:
          • We capture the printjob output of all .JB files and save them as <jobid>.JB_printjob. These text files then get obfuscated
          • We delete all other files inside the jobs directory
        • Finally, it goes through ALL files in the snapshot and does the following:
          • re.sub(r'\b' + key + r'\b', val, <file content>) to replace any sensitive values in the file using the obfuscation map created above.
          • Goes through the list of attribute values to delete and deletes them from the file
    • A routine called obfuscate_acct_logs(<path to snapshot>) which can obfuscate all accounting logs in the path mentioned.