Community discussion is located here:  http://community.pbspro.org/t/pp-702-tests-for-installation-and-upgrades-on-cray-x-series-cle-5-2-systems/508

Overview:

These are the tests to verify the installation or upgrade of PBS on Cray X-series CLE 5.2 systems.


Pre:

    Use a real Cray machine having CLE 5.2. Use the sdb as the host for the PBS server, sched, and comm. Use login nodes as the host for the PBS MoM.

    Uninstall PBS if currently installed using steps in UNINSTALLATION section below before running INSTALLATION and UPGRADE sections.


1. INSTALLATION


  1.1. Install PBS Pro using the default data service account pbsdata.


    1.1.1 Determine the NID of the node that will run the PBS Pro server and scheduler. In this example, the sdb node will be used.

      boot# ssh sdb cat /proc/cray_xt/nid

      5


    1.1.2 If not existing, then create user pbsdata on the server host. 

      For example:

      boot# xtopview -n 5 -e "useradd -c '#Altair' -d /home/users/pbsdata -g 14901 -m -u 12796 pbsdata"

      Note: The user id and group id above are examples. Use appropriate values that apply to your system to avoid non-existent or conflicting group id and conflicting value user id.


    1.1.3 Follow the fresh install instructions in the link below using pbsdata as the dataservice user.

      The message below should not be seen, and it is an error if the message comes up during installation:

      "NOTE: /etc/pbs.conf and the PBS_HOME directory must be deleted manually"


      https://pbspro.atlassian.net/wiki/display/PD/PP-702%3A+Installation+and+upgrades+on+Cray+X-series+CLE+5.2+systems?preview=/50658327/50725795/cray_install.txt


  1.2  Optional: After installation of PBS MoMs add the line below to the PBS_HOME/mom_priv/config file of each MoM node:

      $usecp *:/home /home


    and HUP the MoM after changing PBS_HOME/mom_priv/config:

      login# pkill -HUP pbs_mom


  1.3 After installation, follow the steps in the POST UPGRADE section.


  1.4. Submit jobs

    Follow the job submission steps in JOBS section.


  1.5. Cleanup

    Uninstall PBS using steps in UNINSTALLATION section.


  1.6. Repeat the steps 1.1.3 to 1.5 above using the user crayadm as the data service account.

    In step 1.1.3 start PBS Pro for the first time by:


    boot# ssh sdb

    === Welcome to sdb ===

    sdb# PBS_DATA_SERVICE_USER=crayadm /etc/init.d/pbs start



2. UPGRADE


  2.1. Install PBS Pro 17.x and then overlay upgrade to PBS Pro 17.y (where y > x).

    Use pbsdata as the dataservice account.


    2.1.1 Install PBS Pro 17.x using the installation procedure in section 1.1.


    2.1.2 Follow the overlay upgrade instructions in the link below in the section "Overlay upgrade procedure when installed PBS Pro version is 17.0 or higher".


      The message below should not be seen, and it is an error if the message comes up during upgrade:

      "NOTE: /etc/pbs.conf and the PBS_HOME directory must be deleted manually".

      

      https://pbspro.atlassian.net/wiki/display/PD/PP-702%3A+Installation+and+upgrades+on+Cray+X-series+CLE+5.2+systems?preview=/50658327/50725795/cray_install.txt


    2.1.3 After upgrade, login to the server and MoM nodes and:


      2.1.3.1 Perform the checks in the POST UPGRADE section.


      2.1.3.2 Check that PBS_HOME in /etc/pbs.conf is still /var/spool/pbs

          sdb# grep PBS_HOME /etc/pbs.conf

          login# grep PBS_HOME /etc/pbs.conf


      2.1.3.3 Check that /var/spool/pbs exists and is populated, as root:

          sdb# ls -R /var/spool/pbs

          login# ls -R /var/spool/pbs


      2.1.3.4 Submit jobs as shown in JOBS section.


      2.1.3.5 Clean up PBS installation using steps in UNINSTALLATION section.



  2.2 Install PBS Pro 13.0.40x and then overlay upgrade to PBS Pro 17.x or higher.

    Use crayadm as the dataservice account.


    2.2.1 Install PBS Pro 13.0.40x using the installation procedure in the PBS Pro 13.0 Installation Guide.

      Add the lines below to the PBS_HOME/mom_priv/config on each MoM node:

        $alps_client /opt/cray/alps/default/bin/apbasil

        $usecp *:/home /home  → optional


      and HUP the MoM after changing the PBS_HOME/mom_priv/config:

        login# pkill -HUP pbs_mom


    2.2.2 Follow the overlay upgrade instructions in the link below in the section "Overlay upgrade procedure when installed PBS Pro version is 13.0.40x or lower".


      The message below should not be seen, and it is an error if the message comes up during upgrade:

      "NOTE: /etc/pbs.conf and the PBS_HOME directory must be deleted manually".


      Prior to version 17, PBS Pro used a script named INSTALL to install and configure the RPMs. 

      As of version 17, the INSTALL script is no longer supported. The new package

      allows the administrator to install and upgrade PBS Pro as they would any other RPM based package.


      Due to the significant packaging changes, it is recommended that the administrator uninstall the old version of PBS Pro

      prior to installing the new version. Uninstalling the old version after installing the new version will prevent PBS

      Pro from starting automatically at boot.


      https://pbspro.atlassian.net/wiki/display/PD/PP-702%3A+Installation+and+upgrades+on+Cray+X-series+CLE+5.2+systems?preview=/50658327/50725795/cray_install.txt


    2.2.3 After the upgrade, login to the server and MoM nodes and:


      2.2.3.1 Perform the checks in the POST UPGRADE section below.


      2.2.3.2 Check that PBS_HOME in /etc/pbs.conf is still /var/spool/PBS

          sdb# grep PBS_HOME /etc/pbs.conf

          login# grep PBS_HOME /etc/pbs.conf


      2.2.3.3 Check that /var/spool/PBS exists and is populated, as root:

          sdb# ls -R /var/spool/PBS

          login# ls -R /var/spool/PBS


      2.2.3.4 Submit jobs as shown in JOBS section.


      2.2.3.5 Clean up PBS installation using steps in UNINSTALLATION section.



3. POST UPGRADE

    Login to the server and MoM nodes and check that:


    3.1 Check that PBS_EXEC in /etc/pbs.conf is /opt/pbs

       sdb# grep PBS_EXEC /etc/pbs.conf

       login# grep PBS_EXEC /etc/pbs.conf


    3.2 Check that /opt/pbs exists and is populated, as root:

      sdb# ls -R /opt/pbs

      login# ls -R /opt/pbs


    3.3 PATH includes /opt/pbs/bin

      sdb# echo $PATH | grep pbs | grep -v grep

      login# echo $PATH | grep pbs | grep -v grep


        expect to find /opt/pbs/bin


    3.4 MANPATH includes /opt/pbs/man

      sdb# echo $MANPATH | grep pbs | grep -v grep

      login# echo $MANPATH | grep pbs | grep -v grep


        expect to find /opt/pbs/man


    3.5 'module' includes pbs.

      sdb# module list

      login# module list


        expect to find information about the pbs module.


    3.6 PBS init script is intact

      sdb# ls -l /etc/init.d/pbs

      login# ls -l /etc/init.d/pbs


        expect to find the pbs init script


    3.7 PBS is enabled in chkconfig.

      sdb# chkconfig pbs

      login# chkconfig pbs


        expect pbs to be 'on'


    3.8 login to the MoM hosts and check that

      PBS_HOME/mom_priv/config contains these lines:

        $vnodedef_additive 0

        $alps_client /opt/cray/alps/default/bin/apbasil

        $usecp *:/home /home  → optional


    3.9  Check that the nodes are free:

  sdb#: pbsnodes -av

  expected:  All nodes have "state = free"


    Check that the server is active and licensed:

  sdb#: qstat -Bf

  expected:  qstat output shows

    "server_state = Active"

    "license_count" shows that the server is licensed.


4. JOBS

  Submit jobs as regular user such as crayadm from the login node.


  4.1 Configuration for MoMs, on each node running a MoM:

    Add the line below to PBS_HOME/mom_priv/config if not there:

        $usecp *:/home /home  –> optional


    and HUP the MoM after changing PBS_HOME/mom_priv/config:


      login# pkill -HUP pbs_mom


  4.2 As a regular user, request for Cray compute node in a job.


    login$ qsub -l select=1:ncpus=1:vntype=cray_compute

      aprun -B sleep 10

      ^D

    login$ apstat -rn

    login$ qstat -f


    expect that:

      - There is a reservation for a compute node. For example, for NID 28 below we see that it has ApId 26949 in State "conf,claim".


      login$ apstat -rn

      NID Arch State CU Rv Pl  PgSz     Avl   Conf Placed PEs Apids

        2   XT UP  B  8  -  -    4K 8388608      0      0   0

        <...snip...>

       28   XT UP  B  8  1  1    4K 8388608 262144 262144   1 26949

       29   XT UP  B  8  -  -    4K 8388608      0      0   0

      Compute node summary

          arch config     up   resv    use  avail   down

            XT     24     24      1      1     23      0


        ResId  ApId From    Arch PEs N d Memory State

        49096 26948 batch:2   XT   1 1 1   1024 NID list,conf,claim

      A 49096 26949 batch:2   XT   1 - -   1024 conf,claim


      - The job's exec_vnode is on a compute node.

      - The job terminates normally without errors.


  4.3 request for Cray login node

    login$ qsub -l select=1:ncpus=1:vntype=cray_login

      sleep 10

      ^D

    login$ apstat -rn


    login$ qstat -f


      expect that:

      - There are no reservations on the compute nodes. For example,

      login$ apstat -rn

      NID Arch State CU Rv Pl  PgSz      Avl Conf Placed PEs Apids

        8   XT UP  B 72  -  -    4K 25165824    0      0   0

        <...snip...>

      106   XT UP  B 72  -  -    4K 27262976    0      0   0

      107   XT UP  I 72  -  -    4K 27262976    0      0   0

      Compute node summary

          arch config     up   resv    use  avail   down

            XT     69     69      0      0     69      0


      No resource reservations are present


      - The job's exec_vnode is on a login node.

      - The job terminates normally without errors.


  4.4 submit an interactive job

    login$ qsub -I -l select=1:ncpus=1:vntype=cray_compute


      inside interactive job:

        - type 'hostname' and 'aprun /bin/hostname'

        - expect that different hostnames are returned

      The steps below may be done inside the interactive job or outside the interactive job (e.g. in another terminal):

        - Check 'apstat -rn' output.

            expect that there is a reservation for a compute node

            (see section 4.2 for an example).

        - Check 'qstat -f' output.

            expect that job's exec_vnode is on a compute node.


      Exit out of the interactive job and check 'apstat -rn' output.

      Expect that there are no reservations on any compute node.

      (see section 4.3 for an example).



5. UNINSTALLATION


  5.1 Shutdown and uninstall PBS on the server and MoM nodes


    5.1.1 Drain the system of jobs, from sdb node

      sdb# qdel `qselect`

        expect that all the jobs are gone.


    5.1.2 Shutdown PBS Pro on all login nodes.

      login# /etc/init.d/pbs stop


    5.1.3 Shutdown PBS Pro on the server.

      sdb# /etc/init.d/pbs stop


    5.1.4 Determine the NID of the node that will run the PBS Pro server and scheduler. In this example, the sdb node will be used.

        boot# ssh sdb cat /proc/cray_xt/nid

        5


    5.1.5 There may be more than one version of PBS Pro installed. Obtain the list of all currently installed PBS Pro RPMs.

        boot# xtopview -e "rpm -qa" | grep pbs


    5.1.6 Remove each installed version of PBS Pro.

        boot# xtopview -e "rpm -e pbspro-server"


      During uninstallation check that the message below appears:

      "NOTE: /etc/pbs.conf and the PBS_HOME directory must be deleted manually".


    5.1.7 Remove the /opt/pbs directory:

        boot# xtopview -e "rm -rf /opt/pbs"


  5.2. After uninstallation login to the server and MoM nodes and check that:


    5.2.1 PBS init script has been deleted

      sdb# ls -l /etc/init.d/pbs

      login# ls -l /etc/init.d/pbs


        expect to find that the pbs init script does not exist.


    5.2.2 PBS is disabled in chkconfig.

      sdb# chkconfig pbs

      login# chkconfig pbs


        expected: pbs: unknown service


  5.3. Delete /etc/pbs.conf, as root on boot node:


    Remove specialization of the existing /etc/pbs.conf file:

      - Remove node-specialized version of the file:

          boot# xtopview -e "xtunspec -N /etc/pbs.conf"

      - Remove class-specialized version of the file

          boot# xtopview -e "xtunspec -C /etc/pbs.conf"


     and delete /etc/pbs.conf:

      boot# xtopview -e "rm /etc/pbs.conf"


  5.4. Delete PBS_HOME from the server and MoM nodes.


    sdb# rm -rf /var/spool/{pbs,PBS}

    login# rm -rf /var/spool/{pbs,PBS}


  5.5. Remove the user pbsdata from the server host (e.g. NID 5) if it still exists

       boot# xtopview -e "userdel -r pbsdata"

     Remove the home dir of pbsdata from the server host if it still exists.

       sdb# rm -rf /home/users/pbsdata