Done issues

blocking write can cause pbs_mom can hang when interactive job is exiting
PP-1000
PTL doesn't revert the default mom's configuration file in setup
PP-1225
PTL test case fails before setUp function with AttributeError
PP-1335
Add the dependency for minimum db version require in v19.1 spec file
PP-1329
fix dirname: missing operand error printed from PTL related init scripts
PP-1326
Upgrade fails from 18.1.3 to 19.1.1 with postgresql error on Open SUSE
PP-1325
PBS Installation fails on Open SUSE Leap 15.0 with libical dependency
PP-1324
Insufficient privileges causes the test test_only_explicit_psets in pbs_only_explicit_psets.py fail
PP-1172
Few test cases of TestResvEndHook are failing due to logmatch error in Server logs
PP-1286
Add a M4 macro to enable online data compression in TPP
PP-1258
As an admin, I would like to have unique job ids up to 1 trillion job ids
PP-289
As an admin, I would like to pad the job request, so that if some nodes fail the job will still have enough resources to run
PP-928
The test test_sgi_eoe_job in pbs_power_provisioning_sgi.py fails with “Unknown node” error due to FQDN/shortname mismatch when deleting a node.
PP-1165
TestPartition is failing due to unknown node error
PP-1091
increase interval for log_match in TestAdminSuspend.test_hook
PP-1173
Tests in pbs_acl_host_moms.py fail when the client name (input with -p client=<client>) is not a FQDN
PP-1141
Function "add_to_resource_group" doesn't validate the file as part of creating it
PP-1296
pbs daemons start failed with error : Error initializing the PBS dataservice
PP-1284
test_hold_time_not_counted_in_walltime of "TestMomWalltime" is failing intermittently while checking the walltime of a job
PP-1308
Test "test_sister_mom_crash" of TestSisterMom Fails as it is not able to find pbsdsh path while submitting the job
PP-1307
test_hook of TestHookSwig failing due to very low attempts specified in "max_attempt" parameter
PP-1275
PBS does not update job comment while resuming a suspended job
PP-794
Update copyright header on several files
PP-1301
A race in suspend/resume at end of job "succeed" in suspending but leave job in limbo
PP-1305
Cray - Ramp rate limiting
PP-824
do not purge moved job from history before the job is finished
PP-1287
Stranded array subjobs after communication hiccup
PP-1026
Improvements in PBS Pro MSI installer
PP-962
Enhance test pbs_basil_support.py to handle error in retklist() and teardown()
PP-1291
Secondary server stuck in "Starting" state even after Primary server is killed, with an error in tpp_em_wait() [WINDOWS]
PP-1024
scheduler's attribute allows to control unset resources in placement sets
PP-732
Remove obsolete LMX based licensing code
PP-346
On Windows, pbsdsh fails with "pbsdsh: tm_init failed, rc = not connected (17002) (17002)"
PP-1292
Upgrading from an older version to the latest mainline version fails to start the PBS daemons
PP-756
Engineering work for Reservation end hook event
PP-913
Move pbs_snapshot from unsupported/ to sbin/
PP-935
pbs_probe buffer overflow
PP-1121
Start components of PBS in docker images using environment variables
PP-1016
pbs_server is not getting started on windows
PP-1248
Get rid of the pbs_ifl_wrap.c workaround by building swig under AIX
PP-10
Deleted subjobs get requeued after server restart or failover
PP-1259
Microsecond Logging
PP-261
init script overrides START environment variables with values in pbs.conf file
PP-1043
Add deepcopy support in pbs types in hook
PP-1266
node limits are not working correctly with job equivalence classes feature
PP-903
Creating Reservations Uses Memory after Free
PP-1119
Arrayjob 'E' accounting record has the 'start' value set to 0
PP-1081
Submitting reservations are invalid if only specifying endtime and duration
PP-1079
PBS Comm fails to open log file under valgrind
PP-1224
pbs_mom dumped core in tpp_em_destroy
PP-1272
issue 1 of 563

blocking write can cause pbs_mom can hang when interactive job is exiting

Description

reports:

When users create interactive jobs that send lots of spam to stdout/stderr, they can hang MoM when MoM tries to kill the job.

That is because the pipeline to the actual terminal (attached to the qsub that created the interactive job) is very slow, offering lots of opportunities for the kernel to keep buffers around that take ages to drain. That is compounded by the fact there is no line discipline - - you want an interactive user to see what he types even one character at a time - - and the protocol is vey inefficient for lots of data.

Unfortunately, in the MoM code, if we decide to kill such a job, we end up in a quasi-deadlock: before we even try to kill the job we use message_job to insert a message to tell the user why the job is being killed.

message_job then does a BLOCKING write to the stdout – usually a spool file, but here a pseudo-tty that is being read by the pbs_mom child to channel the data to the qsub process on the submission host. That is on most OSes usually going to block until everything that was buffered by another process that issues tty_write() calls is read by the pbs_mom interactive job shepherd, and in one case we have seen it manifest itself as a very long quasihang.

The nasty part is that you cannot even restart MoM with -INT (it's still hung here and will only process the signal "after" it gets out of here) and if you kill it with -KILL and restart MoM with -p, it soon will try to kill the job again and hang once more. You really have to clean up job processes manually to unstick MoM, in a way that cannot be easily automated (since a remote qsub process is also involved and there would be more than one pbs_mom child for interactive jobs).

There are four ways to fix this:
-delay calling message_job until right before we stage out the files (but then we need a place to actually keep those messages for each job), -use a message_job routine that writes messages to be appended in a local file and then merge these with the "real" stdout only after all processes are killed, -or rewrite message_job to use O_NONBLOCK when opening the file and try to write the message in a loop, keeping an eye on the time in the loop and exiting if e.g. 5 seconds have passed (if so, we should simply write something in the MoM log saying that we could not write the message XYZ to the job's stderr).
-let message_job attempt to write the message in a child process.

The last two ones are the least risky solutions, obviously, since it doesn't require any rearchitecting; we are already opening the file in the routine itself so it's trivial to add O_NONBLOCK, and we don't even have to use poll() if we decide to try writing for N seconds since it's fine for us to spin busy for five seconds trying to write all the time; alternatively, we can spawn a child in a fire-and-forget manner (if we kill the job then eventually whatever is blocking the child will go away) within the routine.

Acceptance Criteria

None

Status

Assignee

Prakash Varandani

Reporter

Scott Campbell

Severity

None

OS

None

Start Date

None

Pull Request URL

Story Points

1

Components

Fix versions

Affects versions

14.1.0

Priority

Low