blocking write can cause pbs_mom can hang when interactive job is exiting

Description

reports:

When users create interactive jobs that send lots of spam to stdout/stderr, they can hang MoM when MoM tries to kill the job.

That is because the pipeline to the actual terminal (attached to the qsub that created the interactive job) is very slow, offering lots of opportunities for the kernel to keep buffers around that take ages to drain. That is compounded by the fact there is no line discipline - - you want an interactive user to see what he types even one character at a time - - and the protocol is vey inefficient for lots of data.

Unfortunately, in the MoM code, if we decide to kill such a job, we end up in a quasi-deadlock: before we even try to kill the job we use message_job to insert a message to tell the user why the job is being killed.

message_job then does a BLOCKING write to the stdout – usually a spool file, but here a pseudo-tty that is being read by the pbs_mom child to channel the data to the qsub process on the submission host. That is on most OSes usually going to block until everything that was buffered by another process that issues tty_write() calls is read by the pbs_mom interactive job shepherd, and in one case we have seen it manifest itself as a very long quasihang.

The nasty part is that you cannot even restart MoM with -INT (it's still hung here and will only process the signal "after" it gets out of here) and if you kill it with -KILL and restart MoM with -p, it soon will try to kill the job again and hang once more. You really have to clean up job processes manually to unstick MoM, in a way that cannot be easily automated (since a remote qsub process is also involved and there would be more than one pbs_mom child for interactive jobs).

There are four ways to fix this:
-delay calling message_job until right before we stage out the files (but then we need a place to actually keep those messages for each job), -use a message_job routine that writes messages to be appended in a local file and then merge these with the "real" stdout only after all processes are killed, -or rewrite message_job to use O_NONBLOCK when opening the file and try to write the message in a loop, keeping an eye on the time in the loop and exiting if e.g. 5 seconds have passed (if so, we should simply write something in the MoM log saying that we could not write the message XYZ to the job's stderr).
-let message_job attempt to write the message in a child process.

The last two ones are the least risky solutions, obviously, since it doesn't require any rearchitecting; we are already opening the file in the routine itself so it's trivial to add O_NONBLOCK, and we don't even have to use poll() if we decide to try writing for N seconds since it's fine for us to spin busy for five seconds trying to write all the time; alternatively, we can spawn a child in a fire-and-forget manner (if we kill the job then eventually whatever is blocking the child will go away) within the routine.

Acceptance Criteria

None

Status

Assignee

Prakash Varandani

Reporter

Scott Campbell

Severity

None

OS

None

Start Date

None

Pull Request URL

Story Points

1

Components

Fix versions

Affects versions

14.1.0

Priority

Low
Configure