Looks like thousands of mom hook updates are sent and they're not getting acked within the 2 minute period (which is not currently configurable). The server log_events=4095 showed:
11/02/2017 15:13:27;0400;Server@hostname;Svr;next_sync_mom_hookfiles;Timing out previous send of mom hook updates (send replies expected=12093 received=4031)
11/02/2017 15:19:24;0400;Server@hostname;Svr;next_sync_mom_hookfiles;Timing out previous send of mom hook updates (send replies expected=12093 received=0)
and some were acknowledged later on (saw 4031 entries of these):
11/02/2017 15:20:57;0400;Server@hostname;Svr;post_sendhookRPP;sendhookRPP reply 1tid=403347) not from current batch of hook updates (tid=403348)
11/02/2017 15:20:57;0400;Server@hostname;Svr;post_sendhookRPP;sendhookRPP reply (tid=403347) successfully sent hook file /path/to/hookname.HK to nodename.fqdn:15002
Though second messages thus indicate that the hook file copy was a success, just delayed.
I noticed around this time, server was very busy servicing other requests, and perhaps there's a delay handling these hook updates.
New server attribute 'sync_hook_momfiles_timeout' to specify the # of seconds to wait for a current set of hook updates to complete before starting a new run.
New server attribute 'runjob_on_node_pending_hook_update' (or a better name please) that can be set to 'true' to mean go ahead and run the job on that node even if there's a pending hook update (default action). Setting this to false would not allow the job to run. This is handling:
11/02/2017 13:57:12;0080;Server@hostname;Node;2114000.hostname;vnode nodename's parent mom nodename.fqdn:15002 has a pending copy hook or delete hook request
Some sites would rather jobs not run than run without the mom side hooks they have in place being run, so we need to be able to dictate that jobs should not run on nodes that the server knows does not have the most recent mom hook updates. Changing the default behavior from the current (run the job and warn with the above message) to defaulting to not running the job and logging an error (while this new theoretical parameter could be used to restore the current default behavior) deserves discussion on the forum.