Wrapper support job launch fails when submitting large jobs

Description

While testing I discovered that the 'new' method to wrap Hydra MPIs that Tom was proposing doesn't really work on very large node count jobs, since for these Hydra uses a tree-based process spawning method, and pbs_tmrsh doesn't work on any node except the mother superior because it relies on the presence of PBS_NODEFILE for sanity checking, and that file only lives on mother superior…

My pbs_remsh (or plain ssh plus pbs_attach) style wrappers don't have that problem, but they have other drawbacks (since it's better if processes are spawned as tasks and thus children of MOM, in general).

I think we need to at least think about it. Sure, you can disable tree-based process launching, but that makes MPI job launching s-l-o-w as heck. So we need to find a way to support tree-based launching with something that uses the TM API but doesn't crash if PBS_NODEFILE is nowhere to be found

Acceptance Criteria

None

Status

Assignee

Unassigned

Reporter

Former user

Severity

None

OS

None

Start Date

None

Pull Request URL

None

Components

Priority

Critical
Configure