Custom resources created in mom hook can sometimes cause server crash when mom sends job obit

Description

Clinton has managed to make the server crash while trying to define and aggregate some resources on MOM nodes in hooks.

The error handling isn't sane in the server. We have routine:

static void
recv_job_obit(stream)
int stream;
{
int njobs;
int rc;
struct resc_used_update *prused;

njobs = disrui(stream, &rc); /* number of jobs in update */
if (rc)
return;

while (njobs--) {

/* IMPORTANT NOTE */
/* allocate resc_used_update here, but leave it to job_obit() */
/* to free it when job_obit() has completed */

prused = (struct resc_used_update *)
malloc(sizeof(struct resc_used_update));
if (prused != 0) {

prused->ru_next = NULL;
prused->ru_pjobid = NULL;

if (decode_stat_update(stream, prused) == 0) {

DBPRT(('recv_job_obit: decoded obit for %s\n',
prused->ru_pjobid))
job_obit(prused, stream);
} else {
mominfo_t *mp;

DBPRT(('recv_job_obit: failed to decode obit for %s\n',
prused->ru_pjobid))
/* had a error, discard rest of message */
if ((mp = tfind2((u_long)stream, 0, &streams)) != NULL) {
log_event(PBSEVENT_SYSTEM, PBS_EVENTCLASS_NODE,
LOG_NOTICE, mp->mi_host, 'error in recv_job_obit');
}
rpp_eom(stream);
FREE_RUU(prused)
}
}

}
}

which allocates prused but doesn't set prused->ru_attr. Since malloc and not calloc is used, it will also not be NULL initialised.

Sadly, there are reasons for decode_stat_update to return BEFORE it has actually done

CLEAR_HEAD(prused->ru_attr);

and in those cases, recv_job_obit will try to call FREE_RUU(prused) to free a hypothetical list of structures behind a junk pointer.

The prudent thing to do is to call CLEAR_HEAD(prused->ru_attr) in the routine that mallocs prused (which makes it an empty linked list with all pointers correctly set in the list header). Then it doesn't depend on the routines it calls to ensure that FREE_RUU can be called.

Note that relying on zero initialisation will not work, since GET_NEXT in free_attrlist will then try to dereference ru_attr.next (which will be NULL) to find its ll_struct pointer.

The stack traces for the crashes were:

in /home/pbsbuild/PBSPro_12_2_404/PBSPro_12.2.404.xxxx/pbs/src/lib/Libattr/attr_func.c
(gdb) bt
#0 free_attrlist (pattrlisthead=<optimized out>)
at /home/pbsbuild/PBSPro_12_2_404/PBSPro_12.2.404.xxxx/pbs/src/lib/Libattr/attr_func.c:281
#1 0x0000000000459979 in recv_job_obit (stream=<optimized out>)
at /home/pbsbuild/PBSPro_12_2_404/PBSPro_12.2.404.xxxx/pbs/src/server/node_manager.c:2016
#2 is_request (stream=1667199589, version=<optimized out>)
at /home/pbsbuild/PBSPro_12_2_404/PBSPro_12.2.404.xxxx/pbs/src/server/node_manager.c:4408
#3 0x0000000000460469 in do_rpp (stream=0)
at /home/pbsbuild/PBSPro_12_2_404/PBSPro_12.2.404.xxxx/pbs/src/server/pbsd_main.c:361
#4 0x0000000000460486 in rpp_request (fd=<optimized out>)
at /home/pbsbuild/PBSPro_12_2_404/PBSPro_12.2.404.xxxx/pbs/src/server/pbsd_main.c:391
#5 0x00000000004ad331 in wait_request (waittime=<optimized out>)
at /home/pbsbuild/PBSPro_12_2_404/PBSPro_12.2.404.xxxx/pbs/src/lib/Libnet/net_server.c:530
#6 0x00000000004623ec in main (argc=<optimized out>, argv=0x7fff34f6e348)
at /home/pbsbuild/PBSPro_12_2_404/PBSPro_12.2.404.xxxx/pbs/src/server/pbsd_main.c:1920

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

Former user

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Components

Priority

High
Configure