Friday, February 4, 2011

SGE in PWBC

Below is the output from installing SGE via apt-get:

Creating config file /etc/default/gridengine with new version
Setting up gridengine-master (6.2u5-1ubuntu1) ...
Initializing cluster with the following parameters:
 => SGE_ROOT: /var/lib/gridengine
 => SGE_CELL: default
 => Spool directory: /var/spool/gridengine/spooldb
 => Initial manager user: sgeadmin
Initializing spool (/var/spool/gridengine/spooldb)
Initializing global configuration based on /usr/share/gridengine/default-configuration
Initializing complexes based on /usr/share/gridengine/centry
Initializing usersets based on /usr/share/gridengine/usersets
Adding user sgeadmin as a manager
Cluster creation complete
Setting up libxp6 (1:1.0.0.xsf1-2build1) ...
Setting up lesstif2 (1:0.95.2-1) ...
Setting up gridengine-qmon (6.2u5-1ubuntu1) ...
Processing triggers for libc-bin ...
ldconfig deferred processing now taking place

When running "qstat-f" on exec host, it complains:

error: commlib error: got select error (No route to host)
error: unable to contact qmaster using port 6444 on host "pwbclinuxlab.garvan.unsw.edu.au"

pwbclinuxlab.garvan.unsw.edu.au is the ex-qmaster which has been removed. Even in the new qmaster, add the exec host again, the exec host still remember the old one. Because it's the string hardcoded in:

/var/lib/gridengine/default/common/act_qmaster

When running "qstat-f" on exec host, it has another complain...

error: commlib error: access denied (client IP resolved to host name "". This is not identical to clients host name "")
error: unable to contact qmaster using port 6444 on host "sgeqmast01.garvan.unsw.edu.au"

Read Things to think about before installing Grid Engine.

This is very likely related to DNS. SGE requires both forward and reverse DNS queries. So make sure DNS server has been setup properly. In case of DNS server setup is too difficult, adding proper entries to /etc/hosts will fix the issue.

Remember: qmaster's hosts file must contain all SGE hosts' (qsub host, qexec host etc) record. Any other SGE host must contain qmaster's record.

127.0.0.1       localhost
129.94.136.232  sgeqexec01.garvan.unsw.edu.au   sgeqexec01
129.94.136.230  sgeqmast01.garvan.unsw.edu.au   sgeqmast01