additional control over mom hook update behavior is needed

Description

Looks like thousands of mom hook updates are sent and they're not getting acked within the 2 minute period (which is not currently configurable). The server log_events=4095 showed:

11/02/2017 15:13:27;0400;Server@hostname;Svr;next_sync_mom_hookfiles;Timing out previous send of mom hook updates (send replies expected=12093 received=4031)
11/02/2017 15:19:24;0400;Server@hostname;Svr;next_sync_mom_hookfiles;Timing out previous send of mom hook updates (send replies expected=12093 received=0)

and some were acknowledged later on (saw 4031 entries of these):

11/02/2017 15:20:57;0400;Server@hostname;Svr;post_sendhookRPP;sendhookRPP reply 1tid=403347) not from current batch of hook updates (tid=403348)
11/02/2017 15:20:57;0400;Server@hostname;Svr;post_sendhookRPP;sendhookRPP reply (tid=403347) successfully sent hook file /path/to/hookname.HK to nodename.fqdn:15002

Though second messages thus indicate that the hook file copy was a success, just delayed.

I noticed around this time, server was very busy servicing other requests, and perhaps there's a delay handling these hook updates.
Suggested fix:

  • New server attribute 'sync_hook_momfiles_timeout' to specify the # of seconds to wait for a current set of hook updates to complete before starting a new run.

  • New server attribute 'runjob_on_node_pending_hook_update' (or a better name please) that can be set to 'true' to mean go ahead and run the job on that node even if there's a pending hook update (default action). Setting this to false would not allow the job to run. This is handling:

11/02/2017 13:57:12;0080;Server@hostname;Node;2114000.hostname;vnode nodename's parent mom nodename.fqdn:15002 has a pending copy hook or delete hook request

Some sites would rather jobs not run than run without the mom side hooks they have in place being run, so we need to be able to dictate that jobs should not run on nodes that the server knows does not have the most recent mom hook updates. Changing the default behavior from the current (run the job and warn with the above message) to defaulting to not running the job and logging an error (while this new theoretical parameter could be used to restore the current default behavior) deserves discussion on the forum.

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

Scott Campbell

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Story Points

1

Components

Priority

Low
Configure