This one follows my previous article focusing on setting up torque system. However, it is found that torque 2.6.1 in Ubuntu system is out of date and not working properly. To circumvent this problem, I decide to move to torque/maui for better schedule efficiency.
http://www.adaptivecomputing.com/support/download-center/torque-download/
It is also noticed that adaptive computing is not maintaining torque and mari any more. which means bugs will not be cleaned. The ultimate solution for the system really is to move to slurm or sun grid system.
First, Download torque and maui from their websites:
maui has to be installed after torque installation
error 1:
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex
solution:
echo '/usr/local/lib' > /etc/ld.so.conf.d/torque.conf
ldconfig
error 2:
socket_connect_unix failed: 15137
qstat: cannot connect to server (null) (errno=15137) could not connect to trqauthd
solution: make sure
trqauthd is running with pbs_mom
error 3: at the client
pbs_mom
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc
error 4 at the client
./torque-mom start
* Starting Torque Mom torque-mom
/usr/sbin/pbs_mom: symbol lookup error: /usr/sbin/pbs_mom: undefined symbol: dis_getc
...fail!
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc
solution:
ldd /usr/local/sbin/pbs_mom
linux-vdso.so.1 => (0x00007fff9f7ff000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2abbbed000)
libtorque.so.2 => /usr/local/lib/libtorque.so.2 (0x00007f2abb2f6000)
libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2 (0x00007f2abaf99000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2abad7c000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2abab74000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2aba873000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2aba577000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2aba361000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2ab9fa1000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2ab9d9d000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2ab9b86000)
/lib64/ld-linux-x86-64.so.2 (0x00007f2abbe0b000)
the solution so far is to resintall the torque 5.0.1, 2015-05-27
it takes the whole morning to fix it
this happens again 2015-09-28
this file is located in
/usr/local/sbin/pbs_mom
just run it should be ok
dis_getc is the old package from apt-get
first: remove the torque in apt repo : apt-get remove torque-mom
now if run pbs_mom you wiil see
./pbs_mom
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory
then reinstall torque-5.0.1-1_4fa836f5
torque-package-clients-linux-x86_64.sh --install
torque-package-mom-linux-x86_64.sh --install
question 1:
limit the maximum processes per user
http://docs.adaptivecomputing.com/maui/6.2throttlingpolicies.php
install pam torque
libtool --finish /lib64/security
/lib64/security/ is the place where pam files are located
/etc/security/access.conf give access to anyone you wish to give
set maui to limit the jobs and process per user
USERCFG[DEFAULT] MAXPROC=64 MAXJOB=5 #working
#GROUPCFG[useraid] MAXJOB[USER]=5 # not working
#CLASSCFG[batch] MAXJOB[USER]=5 working
CLASSCFG[batch] MAXJOB[USER]=5 MAXPROC=64 # not working
Working solution to use pam to prevent user from logging into compute nodes
give some users into compute nodes while others staying outside
versions: torque-5.0.1-1_4fa836f5 maui-3.3.tar.gz
in the tutorial given by official maui http://docs.adaptivecomputing.com/torque/3-0-5/3.4hostsecurity.php
it says
1. first configure torque with ./configure --with-pam
2.
/etc/pam.c/sshd.
account required
pam_pbssimpleauth.so
account required
pam_access.so
and
3.
In
/etc/security/access.conf make sure all users who access the compute node are added to the configuration.This is an example which allows the users root, george, allen, and michael access.
-:ALL EXCEPT root george allen michael torque:ALL
However, I found this method is too strong, specifically, none of root george allen can log into compute node.
my solution:
1. do not need to resinstall torque with ./configure --with-pam
2. put
account required pam_access.so
into /etc/pam.d/sshd
which means pam_access has to be considered for each ssh login
3. put
-:ALL EXCEPT root szhang czhang storres torque:ALL
into /etc/security/access.conf
now only szhang czhang root can log into compute nodes
I think this idea is working and understandable. because at the moment all the submission is done by pbs_mom which is running under root, so pam_pbssimpleauth.so doesn't have to take into effect.
reload maui
just restart it. it wont affect the queue
pkill maui && qterm -t quick && sleep 5&& /usr/local/maui/sbin/maui && pbs_server && ps aux |grep maui
showres working
showres -n
checkjob 810 working
checknode macondo01 % very good feedback
showgrid AVGXFACTOR
showstats
mbal this will kill maui!!!!!!!!!!!!!!!
mdiag same as diagnose
I still didn't get the idea of maxnode. does it mean all job for one person has to go to one perticular node?
mjobct
ERROR: corrupt command received
mclient
ERROR: unknown command: 'mclient'
mprof
USAGE ERROR: (tracefile not specified)
mstat
ERROR: command 'mstat' args not handled
ERROR: service 36 not handled
ERROR: Service[36] 'mstat' not implemented
showbf
backfill window (user: 'czhang' group: 'useraid' partition: ALL) Sun Jan 18 15:25:07
231 procs available for 7:11:35:38
175 procs available for 21:18:13:37
118 procs available for 40:14:55:01
62 procs available for 40:21:06:15
diagnose -j | grep -o -P '(?<=job \047).*(?=\047 utilizes more procs than)
# this line can find out all the job where warnings comes out.
diagnose -j
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features
381 Running DEF 1 DEF 10:00:00:00 1 1 cwang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
569 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '569' utilizes more procs than dedicated (10.35 > 1)
650 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:20 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '650' utilizes more procs than dedicated (13.00 > 1)
651 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:20 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '651' utilizes more procs than dedicated (10.28 > 1)
669 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:19 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '669' utilizes more procs than dedicated (14.00 > 1)
671 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '671' utilizes more procs than dedicated (9.57 > 1)
672 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '672' utilizes more procs than dedicated (7.80 > 1)
\047 octal ascii represent single quote
diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)'
\047 octal ascii represent 'left bracket'
adse=$(diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)')
store result into adse
if [ "$a" != "$b" ]
then
echo "$a is not equal to $b."
echo "(string comparison)"
# "4" != "5"
# ASCII 52 != ASCII 53
fi
#!/bin/bash
x=5.0
y=3.0
#ans= $(( $x + $y |bc ))
#ans=$(echo $x + $y |bc )
#ans=$(echo $x / $y |bc -l ) # this ends up with good result
#ans=$(echo $x / $y |bc ) # this does not give good result
#ans=$(python -c "print $x / $y") # this one is also ok but format is a problem
#ans=$(python -c "print( "%.2f" %($x / $y) ) ") #failed
#alpha=`echo "$a/100" | bc -l | awk '{printf("%06.2f", $1);}'`
ans=`echo "$x/$y" | bc -l | awk '{printf("%6.4f", $1);}'`
echo "$x / $y = $ans"
maui starts off to be deprecated. use
Sun Grid Engine (SGE, rock cluster uses this Oracle Grid Engine) or slurm instead.
it feels to me that the soft hard limit only works for the groups not rather for users
/usr/local/maui
http://www.physics.oregonstate.edu/cluster_install
Problem 2016-01-12:
once running trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
this happens for the server, the server has been runing for a few days. once trqauthd is killed, it can not reboot, properly.
root@macondo03:/home/users/uqczhan2# trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
root@macondo03:/home/users/uqczhan2# pbs_server
pbs_server: symbol lookup error: pbs_server: undefined symbol: job_log_mutex
root@macondo03:/home/users/uqczhan2# pbs_mom
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# which trqauthd
/usr/local/sbin/trqauthd
root@macondo03:/home/users/uqczhan2# pbs_
pbs_demux pbs_mom pbs_restart pbs_sched pbs_server pbs_track
root@macondo03:/home/users/uqczhan2# pbs_sched
pbs_sched: symbol lookup error: pbs_sched: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# pbs_restart
Cannot connect to default server host 'macondo03' - check pbs_server daemon.
qterm: could not connect to server '' (1) Operation not permitted
torque-package-server-linux-x86_64.sh
we get pbs_sched pbs_server qschedd qserverd
./torque-package-mom-linux-x86_64.sh --install
Installing TORQUE archive...
Done.
root@macondo03:/home/user/uqczhan2/czhang/Downloads/torque-5.0.1-1_4fa836f5# ls /usr/local/sbin
momctl pbs_demux pbs_mom pbs_sched pbs_server qnoded qschedd qserverd
solution:
ldd trqauthd
linux-vdso.so.1 => (0x00007ffcf55e1000)
libtorque.so.2 => /usr/local/lib/libtorque.so.2 (0x00007f365ed33000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f365eb16000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f365e816000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f365e458000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f365e250000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f365df54000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f365dd3e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f365f62a000)
today problem resolved again:
infact fds model gets the system hangs. it changes the address of libtorque.so.2 and so trqauthd is not working.
solution: i have removed everything associated with FDS in .bashrc (from LD_LIBRARY_PATH). and check ldd trqauthd. the right one should be the same as the ones above.
also after the restore, there is a bit problem in restart pbs_mom pbs_server and pbs_sched .
solution:
first, apt-get remove torque-mom torque-server torque-sched, make sure the torque in apt system is not installed.
second, reinstall torque 5.0.1 by configure, make make install.
run one by one.
the below are the errors appears when running pbs_mom pbs_server pbs_sched.
pbs_mom
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory
for pbs_server and pbs_sched, once running it, it doesn't show as a process in the system.
as long as reinstall torque 5.0.1 problem get resolved. 2016-01-12
problem
pbsnodes
pbsnodes: Server has no node list MSG=node list is empty - check 'server_priv/nodes' file
cd /var/spool/torque/server_priv