Although having been using sutra for long, I still find good GUI for preprocessing is lacking. A recent project allows me to further search an effect tool to create quad mesh
first, the system lacks a input promt to put accurate coordinate for the object. the way to get around is to manually input from the gpt file.
vertical exaggeration=vertical /horizontal ratio
Porosity: select the bigget object-> double click-> evaluated at nodes-> datasets tab-> initial head,
Show nodal no ->
at the moment nature is not able to run:
reason: U-solution infereed from matrix equation a*u=0 solver not called
so I set another
nature_reset.UFluxBcs is always accepted, but the other ones are not working at the moment.
initial head working 9800. * (240-Y)
Tuesday, 13 January 2015
Monday, 5 January 2015
Setup torque/maui system _debug the system
This one follows my previous article focusing on setting up torque system. However, it is found that torque 2.6.1 in Ubuntu system is out of date and not working properly. To circumvent this problem, I decide to move to torque/maui for better schedule efficiency.
It is also noticed that adaptive computing is not maintaining torque and mari any more. which means bugs will not be cleaned. The ultimate solution for the system really is to move to slurm or sun grid system.
First, Download torque and maui from their websites:
maui has to be installed after torque installation
error 1:
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex
echo '/usr/local/lib' > /etc/
error 2:
socket_connect_unix failed: 15137
qstat: cannot connect to server (null) (errno=15137) could not connect to trqauthd
solution: make sure trqauthd is running with pbs_mom
error 3: at the client
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc
error 4 at the client
./torque-mom start
* Starting Torque Mom torque-mom
/usr/sbin/pbs_mom: symbol lookup error: /usr/sbin/pbs_mom: undefined symbol: dis_getc!
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc
ldd /usr/local/sbin/pbs_mom => (0x00007fff9f7ff000) => /lib/x86_64-linux-gnu/ (0x00007f2abbbed000) => /usr/local/lib/ (0x00007f2abb2f6000) => /usr/lib/x86_64-linux-gnu/ (0x00007f2abaf99000) => /lib/x86_64-linux-gnu/ (0x00007f2abad7c000) => /lib/x86_64-linux-gnu/ (0x00007f2abab74000) => /usr/lib/x86_64-linux-gnu/ (0x00007f2aba873000) => /lib/x86_64-linux-gnu/ (0x00007f2aba577000) => /lib/x86_64-linux-gnu/ (0x00007f2aba361000) => /lib/x86_64-linux-gnu/ (0x00007f2ab9fa1000) => /lib/x86_64-linux-gnu/ (0x00007f2ab9d9d000) => /lib/x86_64-linux-gnu/ (0x00007f2ab9b86000)
/lib64/ (0x00007f2abbe0b000)
the solution so far is to resintall the torque 5.0.1, 2015-05-27 it takes the whole morning to fix it
this happens again 2015-09-28
this file is located in
just run it should be ok
dis_getc is the old package from apt-get
first: remove the torque in apt repo : apt-get remove torque-mom
now if run pbs_mom you wiil see
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory
then reinstall torque-5.0.1-1_4fa836f5 --install
question 1:
limit the maximum processes per user
install pam torque
libtool --finish /lib64/security
/lib64/security/ is the place where pam files are located
/etc/security/access.conf give access to anyone you wish to give
set maui to limit the jobs and process per user
#GROUPCFG[useraid] MAXJOB[USER]=5 # not working
#CLASSCFG[batch] MAXJOB[USER]=5 working
CLASSCFG[batch] MAXJOB[USER]=5 MAXPROC=64 # not working
Working solution to use pam to prevent user from logging into compute nodes
give some users into compute nodes while others staying outside
versions: torque-5.0.1-1_4fa836f5 maui-3.3.tar.gz
in the tutorial given by official maui
it says
1. first configure torque with ./configure --with-pam
account required
account required
In /etc/security/access.conf make sure all users who access the compute node are added to the configuration.This is an example which allows the users root, george, allen, and michael access.
However, I found this method is too strong, specifically, none of root george allen can log into compute node.
my solution:
1. do not need to resinstall torque with ./configure --with-pam
2. put
account required
into /etc/pam.d/sshd
which means pam_access has to be considered for each ssh login
3. put
-:ALL EXCEPT root szhang czhang storres torque:ALL
into /etc/security/access.conf
now only szhang czhang root can log into compute nodes
I think this idea is working and understandable. because at the moment all the submission is done by pbs_mom which is running under root, so doesn't have to take into effect.
reload maui
just restart it. it wont affect the queue
pkill maui && qterm -t quick && sleep 5&& /usr/local/maui/sbin/maui && pbs_server && ps aux |grep maui
showres working
showres -n
checkjob 810 working
checknode macondo01 % very good feedback
mbal this will kill maui!!!!!!!!!!!!!!!
mdiag same as diagnose
I still didn't get the idea of maxnode. does it mean all job for one person has to go to one perticular node?
ERROR: corrupt command received
ERROR: unknown command: 'mclient'
USAGE ERROR: (tracefile not specified)
ERROR: command 'mstat' args not handled
ERROR: service 36 not handled
ERROR: Service[36] 'mstat' not implemented
backfill window (user: 'czhang' group: 'useraid' partition: ALL) Sun Jan 18 15:25:07
231 procs available for 7:11:35:38
175 procs available for 21:18:13:37
118 procs available for 40:14:55:01
62 procs available for 40:21:06:15
diagnose -j | grep -o -P '(?<=job \047).*(?=\047 utilizes more procs than)
# this line can find out all the job where warnings comes out.
diagnose -j
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features
381 Running DEF 1 DEF 10:00:00:00 1 1 cwang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
569 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '569' utilizes more procs than dedicated (10.35 > 1)
650 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:20 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '650' utilizes more procs than dedicated (13.00 > 1)
651 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:20 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '651' utilizes more procs than dedicated (10.28 > 1)
669 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:19 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '669' utilizes more procs than dedicated (14.00 > 1)
671 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '671' utilizes more procs than dedicated (9.57 > 1)
672 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '672' utilizes more procs than dedicated (7.80 > 1)
\047 octal ascii represent single quote
diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)'
\047 octal ascii represent 'left bracket'
adse=$(diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)')
store result into adse
if [ "$a" != "$b" ]
echo "$a is not equal to $b."
echo "(string comparison)"
# "4" != "5"
# ASCII 52 != ASCII 53
#ans= $(( $x + $y |bc ))
#ans=$(echo $x + $y |bc )
#ans=$(echo $x / $y |bc -l ) # this ends up with good result
#ans=$(echo $x / $y |bc ) # this does not give good result
#ans=$(python -c "print $x / $y") # this one is also ok but format is a problem
#ans=$(python -c "print( "%.2f" %($x / $y) ) ") #failed
#alpha=`echo "$a/100" | bc -l | awk '{printf("%06.2f", $1);}'`
ans=`echo "$x/$y" | bc -l | awk '{printf("%6.4f", $1);}'`
echo "$x / $y = $ans"
maui starts off to be deprecated. use Sun Grid Engine (SGE, rock cluster uses this Oracle Grid Engine) or slurm instead.
it feels to me that the soft hard limit only works for the groups not rather for users
Problem 2016-01-12:
once running trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
root@macondo03:/home/users/uqczhan2# trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
root@macondo03:/home/users/uqczhan2# pbs_server
pbs_server: symbol lookup error: pbs_server: undefined symbol: job_log_mutex
root@macondo03:/home/users/uqczhan2# pbs_mom
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# which trqauthd
root@macondo03:/home/users/uqczhan2# pbs_
pbs_demux pbs_mom pbs_restart pbs_sched pbs_server pbs_track
root@macondo03:/home/users/uqczhan2# pbs_sched
pbs_sched: symbol lookup error: pbs_sched: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# pbs_restart
Cannot connect to default server host 'macondo03' - check pbs_server daemon.
qterm: could not connect to server '' (1) Operation not permitted => (0x00007ffcf55e1000) => /usr/local/lib/ (0x00007f365ed33000) => /lib/x86_64-linux-gnu/ (0x00007f365eb16000) => /usr/lib/x86_64-linux-gnu/ (0x00007f365e816000) => /lib/x86_64-linux-gnu/ (0x00007f365e458000) => /lib/x86_64-linux-gnu/ (0x00007f365e250000) => /lib/x86_64-linux-gnu/ (0x00007f365df54000) => /lib/x86_64-linux-gnu/ (0x00007f365dd3e000)
/lib64/ (0x00007f365f62a000)
It is also noticed that adaptive computing is not maintaining torque and mari any more. which means bugs will not be cleaned. The ultimate solution for the system really is to move to slurm or sun grid system.
First, Download torque and maui from their websites:
maui has to be installed after torque installation
error 1:
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex
echo '/usr/local/lib' > /etc/
error 2:
socket_connect_unix failed: 15137
qstat: cannot connect to server (null) (errno=15137) could not connect to trqauthd
solution: make sure trqauthd is running with pbs_mom
error 3: at the client
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc
error 4 at the client
./torque-mom start
* Starting Torque Mom torque-mom
/usr/sbin/pbs_mom: symbol lookup error: /usr/sbin/pbs_mom: undefined symbol: dis_getc!
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc
ldd /usr/local/sbin/pbs_mom => (0x00007fff9f7ff000) => /lib/x86_64-linux-gnu/ (0x00007f2abbbed000) => /usr/local/lib/ (0x00007f2abb2f6000) => /usr/lib/x86_64-linux-gnu/ (0x00007f2abaf99000) => /lib/x86_64-linux-gnu/ (0x00007f2abad7c000) => /lib/x86_64-linux-gnu/ (0x00007f2abab74000) => /usr/lib/x86_64-linux-gnu/ (0x00007f2aba873000) => /lib/x86_64-linux-gnu/ (0x00007f2aba577000) => /lib/x86_64-linux-gnu/ (0x00007f2aba361000) => /lib/x86_64-linux-gnu/ (0x00007f2ab9fa1000) => /lib/x86_64-linux-gnu/ (0x00007f2ab9d9d000) => /lib/x86_64-linux-gnu/ (0x00007f2ab9b86000)
/lib64/ (0x00007f2abbe0b000)
the solution so far is to resintall the torque 5.0.1, 2015-05-27 it takes the whole morning to fix it
this happens again 2015-09-28
this file is located in
just run it should be ok
dis_getc is the old package from apt-get
first: remove the torque in apt repo : apt-get remove torque-mom
now if run pbs_mom you wiil see
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory
then reinstall torque-5.0.1-1_4fa836f5 --install --install
question 1:
limit the maximum processes per user
install pam torque
libtool --finish /lib64/security
/lib64/security/ is the place where pam files are located
/etc/security/access.conf give access to anyone you wish to give
set maui to limit the jobs and process per user
#GROUPCFG[useraid] MAXJOB[USER]=5 # not working
#CLASSCFG[batch] MAXJOB[USER]=5 working
CLASSCFG[batch] MAXJOB[USER]=5 MAXPROC=64 # not working
Working solution to use pam to prevent user from logging into compute nodes
give some users into compute nodes while others staying outside
versions: torque-5.0.1-1_4fa836f5 maui-3.3.tar.gz
in the tutorial given by official maui
it says
1. first configure torque with ./configure --with-pam
account required
account required
In /etc/security/access.conf make sure all users who access the compute node are added to the configuration.This is an example which allows the users root, george, allen, and michael access.
-:ALL EXCEPT root george allen michael torque:ALL
However, I found this method is too strong, specifically, none of root george allen can log into compute node.
my solution:
1. do not need to resinstall torque with ./configure --with-pam
2. put
account required
into /etc/pam.d/sshd
which means pam_access has to be considered for each ssh login
3. put
-:ALL EXCEPT root szhang czhang storres torque:ALL
into /etc/security/access.conf
now only szhang czhang root can log into compute nodes
I think this idea is working and understandable. because at the moment all the submission is done by pbs_mom which is running under root, so doesn't have to take into effect.
reload maui
just restart it. it wont affect the queue
pkill maui && qterm -t quick && sleep 5&& /usr/local/maui/sbin/maui && pbs_server && ps aux |grep maui
showres working
showres -n
checkjob 810 working
checknode macondo01 % very good feedback
mbal this will kill maui!!!!!!!!!!!!!!!
mdiag same as diagnose
I still didn't get the idea of maxnode. does it mean all job for one person has to go to one perticular node?
ERROR: corrupt command received
ERROR: unknown command: 'mclient'
USAGE ERROR: (tracefile not specified)
ERROR: command 'mstat' args not handled
ERROR: service 36 not handled
ERROR: Service[36] 'mstat' not implemented
backfill window (user: 'czhang' group: 'useraid' partition: ALL) Sun Jan 18 15:25:07
231 procs available for 7:11:35:38
175 procs available for 21:18:13:37
118 procs available for 40:14:55:01
62 procs available for 40:21:06:15
diagnose -j | grep -o -P '(?<=job \047).*(?=\047 utilizes more procs than)
# this line can find out all the job where warnings comes out.
diagnose -j
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features
381 Running DEF 1 DEF 10:00:00:00 1 1 cwang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
569 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '569' utilizes more procs than dedicated (10.35 > 1)
650 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:20 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '650' utilizes more procs than dedicated (13.00 > 1)
651 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:20 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '651' utilizes more procs than dedicated (10.28 > 1)
669 Running DEF 1 DEF 41:16:00:00 1 1 mgholami useraid uq-Civil 00:49:19 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '669' utilizes more procs than dedicated (14.00 > 1)
671 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '671' utilizes more procs than dedicated (9.57 > 1)
672 Running DEF 1 DEF 25:00:00:00 1 1 pzhang useraid uq-Civil 00:49:21 [NONE] [NONE] [NONE] >=0 >=0 NC0 [batch:1] [NONE]
WARNING: job '672' utilizes more procs than dedicated (7.80 > 1)
\047 octal ascii represent single quote
diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)'
\047 octal ascii represent 'left bracket'
adse=$(diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)')
store result into adse
if [ "$a" != "$b" ]
echo "$a is not equal to $b."
echo "(string comparison)"
# "4" != "5"
# ASCII 52 != ASCII 53
#ans= $(( $x + $y |bc ))
#ans=$(echo $x + $y |bc )
#ans=$(echo $x / $y |bc -l ) # this ends up with good result
#ans=$(echo $x / $y |bc ) # this does not give good result
#ans=$(python -c "print $x / $y") # this one is also ok but format is a problem
#ans=$(python -c "print( "%.2f" %($x / $y) ) ") #failed
#alpha=`echo "$a/100" | bc -l | awk '{printf("%06.2f", $1);}'`
ans=`echo "$x/$y" | bc -l | awk '{printf("%6.4f", $1);}'`
echo "$x / $y = $ans"
maui starts off to be deprecated. use Sun Grid Engine (SGE, rock cluster uses this Oracle Grid Engine) or slurm instead.
it feels to me that the soft hard limit only works for the groups not rather for users
Problem 2016-01-12:
once running trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
this happens for the server, the server has been runing for a few days. once trqauthd is killed, it can not reboot, properly.
root@macondo03:/home/users/uqczhan2# trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
root@macondo03:/home/users/uqczhan2# pbs_server
pbs_server: symbol lookup error: pbs_server: undefined symbol: job_log_mutex
root@macondo03:/home/users/uqczhan2# pbs_mom
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# which trqauthd
root@macondo03:/home/users/uqczhan2# pbs_
pbs_demux pbs_mom pbs_restart pbs_sched pbs_server pbs_track
root@macondo03:/home/users/uqczhan2# pbs_sched
pbs_sched: symbol lookup error: pbs_sched: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# pbs_restart
Cannot connect to default server host 'macondo03' - check pbs_server daemon.
qterm: could not connect to server '' (1) Operation not permitted
we get pbs_sched pbs_server qschedd qserverd
./ --install
Installing TORQUE archive...
root@macondo03:/home/user/uqczhan2/czhang/Downloads/torque-5.0.1-1_4fa836f5# ls /usr/local/sbin
momctl pbs_demux pbs_mom pbs_sched pbs_server qnoded qschedd qserverd
ldd => (0x00007ffcf55e1000) => /usr/local/lib/ (0x00007f365ed33000) => /lib/x86_64-linux-gnu/ (0x00007f365eb16000) => /usr/lib/x86_64-linux-gnu/ (0x00007f365e816000) => /lib/x86_64-linux-gnu/ (0x00007f365e458000) => /lib/x86_64-linux-gnu/ (0x00007f365e250000) => /lib/x86_64-linux-gnu/ (0x00007f365df54000) => /lib/x86_64-linux-gnu/ (0x00007f365dd3e000)
/lib64/ (0x00007f365f62a000)
today problem resolved again:
infact fds model gets the system hangs. it changes the address of and so trqauthd is not working.
solution: i have removed everything associated with FDS in .bashrc (from LD_LIBRARY_PATH). and check ldd trqauthd. the right one should be the same as the ones above.
also after the restore, there is a bit problem in restart pbs_mom pbs_server and pbs_sched .
first, apt-get remove torque-mom torque-server torque-sched, make sure the torque in apt system is not installed.
second, reinstall torque 5.0.1 by configure, make make install.
run one by one.
the below are the errors appears when running pbs_mom pbs_server pbs_sched.
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory
infact fds model gets the system hangs. it changes the address of and so trqauthd is not working.
solution: i have removed everything associated with FDS in .bashrc (from LD_LIBRARY_PATH). and check ldd trqauthd. the right one should be the same as the ones above.
also after the restore, there is a bit problem in restart pbs_mom pbs_server and pbs_sched .
first, apt-get remove torque-mom torque-server torque-sched, make sure the torque in apt system is not installed.
second, reinstall torque 5.0.1 by configure, make make install.
run one by one.
the below are the errors appears when running pbs_mom pbs_server pbs_sched.
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory
for pbs_server and pbs_sched, once running it, it doesn't show as a process in the system.
as long as reinstall torque 5.0.1 problem get resolved. 2016-01-12
pbsnodes: Server has no node list MSG=node list is empty - check 'server_priv/nodes' file
cd /var/spool/torque/server_priv
pbsnodes: Server has no node list MSG=node list is empty - check 'server_priv/nodes' file
Saturday, 3 January 2015
install environment modules in a cluster
use ganglia to monitor the system
1. follow the instruction at to finish the installation.
2. install gexec in each client
3. change /etc/ganglia/gmond.conf
change gexec= yes
4. reboot by sudo service ganglia-monitor restart
it is noted that the server has to do this in the last so that all the clients can be found
one incident: macondo03 is down. after it is rebooted, gstat can not see other machines. the only way to make everything back to normal is to run "sudo service ganglia-monitor restart" on every client so that the host can find all the machines.
2. install gexec in each client
3. change /etc/ganglia/gmond.conf
change gexec= yes
4. reboot by sudo service ganglia-monitor restart
it is noted that the server has to do this in the last so that all the clients can be found
one incident: macondo03 is down. after it is rebooted, gstat can not see other machines. the only way to make everything back to normal is to run "sudo service ganglia-monitor restart" on every client so that the host can find all the machines.
Subscribe to:
Posts (Atom)