Wednesday, 25 March 2015

gdb (gefortran --debug) debug

  Intel fortran on linux is no longer free, which is annoying as a fortran user. I dicided to change to gfortran instead.
  to creat a executable file that can debug, one can use  '-g' flag
TBH, at the very beginning, it is quite annoying to use command interface to do the debugging, but

gfortran -g a.F90 -o ab

1.  use gdb (15-06-29) 
if gdb does not diplay the line properly, use -ggdb flag instead.
it is also found that the files has to be compiled by the same compiler in the same machine to ensure the readability of the code.
gfortran -ggdb a.f -o a.out

then use the following command to start debug processes.

gdb ab

other useful command:

break main

run  [r]

step [s]  (step into the subroutine)

print a [p a]

where

quit

next (n) go to the next line without step into subfunctions just over the current
n 5   go to the next for 5 times
2.  Info command
info b   see all the breaking points info breakpoints check all of the breakpoints
info line  shows where current script is stoped at



delete 1 delete break point 1 from list info b
disable 1 
enable  1

disable disable all break points

break sutra.f:333    %stop at line 333 of file sutra.f

frame   display current line

finish  stepout



https://www.google.com.au/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&ved=0CDcQFjAE&url=http%3A%2F%2Fdarkdust.net%2Ffiles%2FGDB%2520Cheat%2520Sheet.pdf&ei=HH4zVZnBO4PsmAXPz4HoCQ&usg=AFQjCNHCDpbVUfxmJyacITfnesZ0PKBKqg&sig2=ImgMp83YjaeYNoztPgTw-w&bvm=bv.91071109,d.dGY&cad=rja

http://stackoverflow.com/questions/501486/getting-gdb-to-save-a-list-of-breakpoints
save breakpoints a.txt

how to load break points --
source a.txt


Error: Dummy argument 'cherin' of procedure 'readif' at (1) has an attribute that requires an explicit interface for this procedure


set breakpoint pending on


No symbol table is loaded.  Use the "file" command.


2.  Correct the program. (15-06-29)
TIPS:
  (1) if something is wrong in the source code. one does not need to quit gdb. instead, just correct the file, make it with debug flag, and rerun simulation in gdb using r.

  1. Start the program being debugged.Example 1. The program is printch, which can take an optional command line argument. Start it running with no command line argument.
    (gdb) r
    
    Example 2. Start printch running with command line argument A.
    (gdb) r A
    
  2. Execute a single statement. If the statement is a function call, just single step into the function.
    (gdb) s
    
  3. Execute a single statement. If the statement is a function call, execute the entire function and return to the statement just after the call; that is, step over the function.
    (gdb) n
    
  4. Execute from the current point up to the next breakpoint if there is one, otherwise execute until the program terminates.
    (gdb) c
    
  5. Execute the rest of the current function; that is, step out of the function.
    (gdb) finish


Sunday, 15 February 2015

mflab -- malab read excel using xlsread slow down problem

1. it is the second time i found matlab is doing dodgy things.
Specifically, It is found that mflab in some machines are extremetly slow. the main slow down script
turns out,
line 88 h=feval(['COM.' convertedProgID], 'server', machinename, interface);
at 
C:\Program Files\MATLAB\R2012b\toolbox\matlab\winfun\actxserver.m
or
C:\Program Files\MATLAB\R2014a\toolbox\matlab\winfun\actxserver.m

is the problematic file. just change it into

line 88 h=feval([convertedProgID], 'server', machinename, interface);
solves the slow down problem

2. aera object not working.
2015-02-16 change fdm3.m file into the old one will make area object working

1. et doesn't work, no matter how many time i have checked the input.
solution. reboot matlab, and use mflab in southern folder.
10/03/15

rect_2 is designed to make very small et and start from low, now i am trying to make et very low at the beginning.
apparently the flow one does not work at all!!!



There is a tricky part for the volumetric flow given by budget, particularly at the cell where water table locates. the front or right side flux is not calculated by volumetric flow on cross section area of the cell, but volumetric flow on wetting area of the CELL!!!




Comment 1:  if there is only one layer of cell, the first layer SHOULD be set as unconfied rather than varying cells. so that the result is correct. a experience from changing one layer from  modflow_no_leak
and modflow_rect_dd_1



Comment 2: if the time-variant specified head is not working properly, particularly the cells above the time variant hydraulic head is not wetted again, one should always make sure the cell abve the manipulated cell wet. This can be done either by moving the manipulated cell downward, or change the bottom hydraulic head that is manipulating.


example study:

mf2005\TerwisschaPE     lots of stress periods. theis equation is considered
mf2005\DutchTop\RainLens very good chd package, but the simulation is not working. needs to find out the way to get it running. perhaps git back to the previous versions.


an further analyze has indicated that mf2005 is performing better than mf2000 in terms of handling wetting and drying functions. specifically when WETDRY is positive, meaning cells from all directions can make the dry cell rewetted, mf2005 is able to deliver nice results while mf2000 can not get converged.



 *** ERROR OPENING FILE "Q2.HDS" ON UNIT    51
       SPECIFIED FILE STATUS: UNKNOWN
       SPECIFIED FILE FORMAT: BINARY
       SPECIFIED FILE ACCESS: SEQUENTIAL
       SPECIFIED FILE ACTION: READWRITE
  -- STOP EXECUTION (SGWF2BAS7OPEN)

solution:
1. download modflow for unix (
2. in openspec.inc, comment  DATA FORM/'BINARY'/ and uncomment   DATA FORM/'UNFORMATTED'/.
3. make the binary file




mf2kgmg.h: In function ‘MF2KGMG_BIGH’:
mf2kgmg.h:448:29: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
pcgn2.f90:26.6:
  USE PCG_MAIN
      1
Fatal Error: Can't open module file 'pcg_main.mod' for reading at (1): No such file or directory
utl7.f:234.14:
         IF ( LVAL.EQ. .FALSE. ) THEN                                
              1
Error: Logicals at (1) must be compared with .eqv. instead of .eq.
utl7.f:263.14:
         IF ( LVAL.EQ. .FALSE. ) THEN                                
              1
Error: Logicals at (1) must be compared with .eqv. instead of .eq.


(1) improve quiver object
$ grep -r -i --include *.m quiver ~/Projects/mflab/trunk/examples/
/home/chenming/Projects/mflab/trunk/examples/Analytic/GGOR/Analytic04.m:if isphi==0 % if averag seepage from second aquiver is given
/home/chenming/Projects/mflab/trunk/examples/mf2k/Qanats/TafiletMdl/mfiles/mf_analyze.m:%gr.quiver(B,1,'k','power',0.5);
/home/chenming/Projects/mflab/trunk/examples/mf2k/Qanats/TafiletMdl/SCENARIOS/mf_analyze.m:%gr.quiver(B,1,'k','power',0.5);

samba share

For file linux file system applying on windows, it is always the best to use samba share.
The confiuration file is given as follows:

1.  uncommet the follow line in /etc/samba/smb.conf

#======================= Share Definitions =======================

# Un-comment the following (and tweak the other settings below to suit)
# to enable the default home directory shares. This will share each 
# user's home director as \\server\username
[homes]
   comment = Home Directories
   browseable = no

2. creater new users by

#smbpasswd -a chenming

3. restart samba process by:

#systemctl restart smbd
#systemctl restart nmbd
4. test the connections by: 
#smbclient -L localhost

once there is an error
ae429-1105 chenming # smbclient -L 10.33.20.12X
Enter root's password: 
Connection to 10.33.20.120 failed (Error NT_STATUS_CONNECTION_REFUSED)
the problem is solved by restart samba 



5. now it is time to mouprint(gcf,'-dpng',sprintf('-r%d',r), 'bar.png');nt the samba share by right click My computer -> map networkdrive->\\10.33.20.120\chenming and then password


Tuesday, 13 January 2015

modelmuse study

Although having been using sutra for long, I still find good GUI for preprocessing is lacking. A recent project allows me to further search an effect tool to create quad mesh


first, the system lacks a input promt to put accurate coordinate for the object. the way to get around is to manually input from the gpt file.

vertical exaggeration=vertical /horizontal ratio


Porosity:   select the bigget object-> double click-> evaluated at nodes-> datasets tab-> initial head,
Show nodal no ->


at the moment nature is not able to run:
reason: U-solution infereed from matrix equation a*u=0 solver not called
so I set another

nature_reset.UFluxBcs is always accepted, but the other ones are not working at the moment.

initial head working 9800. * (240-Y)

Monday, 5 January 2015

Setup torque/maui system _debug the system

This one follows my previous article focusing on setting up torque system. However, it is found that torque 2.6.1 in Ubuntu system is out of date and not working properly. To circumvent this problem, I decide to move to torque/maui for better schedule efficiency.
http://www.adaptivecomputing.com/support/download-center/torque-download/
It is also noticed that adaptive computing is not maintaining torque and mari any more. which means bugs will not be cleaned. The ultimate solution for the system really is to move to slurm or sun grid system.


First, Download torque and maui from their websites:

maui has to be installed after torque installation

error 1:
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex

solution:
echo '/usr/local/lib' > /etc/ld.so.conf.d/torque.conf
ldconfig


error 2:
socket_connect_unix failed: 15137
qstat: cannot connect to server (null) (errno=15137) could not connect to trqauthd

solution: make sure trqauthd is running with pbs_mom

error 3: at the client
pbs_mom
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc


error 4 at the client
./torque-mom start
 * Starting Torque Mom torque-mom
/usr/sbin/pbs_mom: symbol lookup error: /usr/sbin/pbs_mom: undefined symbol: dis_getc
   ...fail!
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: dis_getc

solution:
ldd /usr/local/sbin/pbs_mom
        linux-vdso.so.1 =>  (0x00007fff9f7ff000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f2abbbed000)
        libtorque.so.2 => /usr/local/lib/libtorque.so.2 (0x00007f2abb2f6000)
        libxml2.so.2 => /usr/lib/x86_64-linux-gnu/libxml2.so.2 (0x00007f2abaf99000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2abad7c000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2abab74000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2aba873000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2aba577000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2aba361000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2ab9fa1000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2ab9d9d000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f2ab9b86000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f2abbe0b000)

the solution so far is to resintall the torque 5.0.1, 2015-05-27 it takes the whole morning to fix it
this happens again 2015-09-28
this file is located in
/usr/local/sbin/pbs_mom
just run it should be ok
dis_getc is the old package from apt-get
first: remove the torque in apt repo   : apt-get remove torque-mom
now if run pbs_mom you wiil see
./pbs_mom
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory

then reinstall torque-5.0.1-1_4fa836f5
torque-package-clients-linux-x86_64.sh  --install
torque-package-mom-linux-x86_64.sh  --install







question 1:
limit the maximum processes per user
http://docs.adaptivecomputing.com/maui/6.2throttlingpolicies.php


install pam torque
libtool --finish /lib64/security
/lib64/security/ is the place where pam files are located
/etc/security/access.conf give access to anyone you wish to give



set maui to limit the jobs and process per user

USERCFG[DEFAULT] MAXPROC=64 MAXJOB=5 #working

#GROUPCFG[useraid] MAXJOB[USER]=5  # not working
#CLASSCFG[batch] MAXJOB[USER]=5   working


CLASSCFG[batch] MAXJOB[USER]=5 MAXPROC=64  # not working


Working solution to use pam to prevent user from logging into compute nodes
give some users into compute nodes while others staying outside

versions: torque-5.0.1-1_4fa836f5 maui-3.3.tar.gz
in the tutorial given by official maui http://docs.adaptivecomputing.com/torque/3-0-5/3.4hostsecurity.php
it says

1. first configure torque with ./configure --with-pam

2.
/etc/pam.c/sshd.
account required pam_pbssimpleauth.so
account required pam_access.so

and
3.
In /etc/security/access.conf make sure all users who access the compute node are added to the configuration.This is an example which allows the users root, george, allen, and michael access.
-:ALL EXCEPT root george allen michael torque:ALL


However, I found this method is too strong, specifically, none of root george allen can log into compute node.

my solution:

1. do not need to resinstall torque with  ./configure --with-pam

2. put
account required     pam_access.so
 into /etc/pam.d/sshd

which means pam_access has to be considered for each ssh login

3. put

-:ALL EXCEPT root szhang czhang storres torque:ALL
into /etc/security/access.conf
now only szhang czhang root can log into compute nodes

I think this idea is working and understandable. because at the moment all the submission is done by pbs_mom which is running under root, so pam_pbssimpleauth.so doesn't have to take into effect.







reload maui

just restart it. it wont affect the queue

pkill maui && qterm -t quick && sleep 5&& /usr/local/maui/sbin/maui && pbs_server && ps aux |grep maui


showres working
showres -n
checkjob 810 working
checknode macondo01  % very good feedback
showgrid AVGXFACTOR
showstats
mbal this will kill maui!!!!!!!!!!!!!!!
mdiag same as diagnose

I still didn't get the idea of maxnode. does it mean all job for one person has to go to one perticular node?

mjobct
ERROR:    corrupt command received


mclient
ERROR:    unknown command: 'mclient'

mprof
USAGE ERROR:  (tracefile not specified)

mstat
ERROR:  command 'mstat' args not handled
ERROR:    service 36 not handled
ERROR:    Service[36] 'mstat' not implemented

showbf
backfill window (user: 'czhang' group: 'useraid' partition: ALL) Sun Jan 18 15:25:07

231 procs available for    7:11:35:38
175 procs available for   21:18:13:37
118 procs available for   40:14:55:01
 62 procs available for   40:21:06:15



diagnose -j | grep -o -P '(?<=job \047).*(?=\047 utilizes more procs than)
# this line can find out all the job where warnings comes out.

diagnose -j
Name                  State Par Proc QOS     WCLimit R  Min     User    Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs       Class Features

381                 Running DEF    1 DEF 10:00:00:00 1    1    cwang  useraid uq-Civil    00:49:21   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
569                 Running DEF    1 DEF 25:00:00:00 1    1   pzhang  useraid uq-Civil    00:49:21   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '569' utilizes more procs than dedicated (10.35 > 1)
650                 Running DEF    1 DEF 41:16:00:00 1    1 mgholami  useraid uq-Civil    00:49:20   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '650' utilizes more procs than dedicated (13.00 > 1)
651                 Running DEF    1 DEF 41:16:00:00 1    1 mgholami  useraid uq-Civil    00:49:20   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '651' utilizes more procs than dedicated (10.28 > 1)
669                 Running DEF    1 DEF 41:16:00:00 1    1 mgholami  useraid uq-Civil    00:49:19   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '669' utilizes more procs than dedicated (14.00 > 1)
671                 Running DEF    1 DEF 25:00:00:00 1    1   pzhang  useraid uq-Civil    00:49:21   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '671' utilizes more procs than dedicated (9.57 > 1)
672                 Running DEF    1 DEF 25:00:00:00 1    1   pzhang  useraid uq-Civil    00:49:21   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '672' utilizes more procs than dedicated (7.80 > 1)


\047 octal ascii represent single quote

diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)'

\047 octal ascii represent 'left bracket'

adse=$(diagnose -j | grep -o -P '(?<=than dedicated \050).*(?=>)')
store result into adse

if [ "$a" != "$b" ]
then
  echo "$a is not equal to $b."
  echo "(string comparison)"
  #     "4"  != "5"
  # ASCII 52 != ASCII 53
fi

#!/bin/bash
x=5.0
y=3.0
#ans= $(( $x + $y |bc  ))
#ans=$(echo  $x + $y |bc )
#ans=$(echo  $x / $y |bc -l )   # this ends up with good result
#ans=$(echo  $x / $y |bc  )     # this does not give good result

#ans=$(python -c "print $x / $y")    # this one is also ok but format is a problem

#ans=$(python -c "print( "%.2f"     %($x / $y) ) ")  #failed
#alpha=`echo "$a/100" | bc -l | awk '{printf("%06.2f", $1);}'`
ans=`echo "$x/$y" | bc -l | awk '{printf("%6.4f", $1);}'`
echo "$x / $y = $ans"


maui starts off to be deprecated. use Sun Grid Engine (SGE, rock cluster uses this Oracle Grid Engine)  or slurm instead. 

it feels to me that the soft hard limit only works for the groups not rather for users
/usr/local/maui
http://www.physics.oregonstate.edu/cluster_install

Problem 2016-01-12:
once running trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
this happens for the server, the server has been runing for a few days. once trqauthd is killed, it can not reboot, properly.



root@macondo03:/home/users/uqczhan2#  trqauthd
trqauthd: symbol lookup error: trqauthd: undefined symbol: debug_mode
root@macondo03:/home/users/uqczhan2# pbs_server
pbs_server: symbol lookup error: pbs_server: undefined symbol: job_log_mutex
root@macondo03:/home/users/uqczhan2# pbs_mom
pbs_mom: symbol lookup error: pbs_mom: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# which trqauthd
/usr/local/sbin/trqauthd
root@macondo03:/home/users/uqczhan2# pbs_
pbs_demux    pbs_mom      pbs_restart  pbs_sched    pbs_server   pbs_track  
root@macondo03:/home/users/uqczhan2# pbs_sched
pbs_sched: symbol lookup error: pbs_sched: undefined symbol: log_mutex
root@macondo03:/home/users/uqczhan2# pbs_restart
Cannot connect to default server host 'macondo03' - check pbs_server daemon.
qterm: could not connect to server '' (1) Operation not permitted


 torque-package-server-linux-x86_64.sh
we get pbs_sched  pbs_server  qschedd  qserverd

./torque-package-mom-linux-x86_64.sh --install

Installing TORQUE archive... 

Done.
root@macondo03:/home/user/uqczhan2/czhang/Downloads/torque-5.0.1-1_4fa836f5# ls /usr/local/sbin
momctl  pbs_demux  pbs_mom  pbs_sched  pbs_server  qnoded  qschedd  qserverd
solution:
ldd trqauthd
        linux-vdso.so.1 =>  (0x00007ffcf55e1000)
        libtorque.so.2 => /usr/local/lib/libtorque.so.2 (0x00007f365ed33000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f365eb16000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f365e816000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f365e458000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f365e250000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f365df54000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f365dd3e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f365f62a000)
today problem resolved again:

infact fds model gets the system hangs. it changes the address of libtorque.so.2 and so trqauthd is not working.
solution: i have removed everything associated with FDS in .bashrc (from LD_LIBRARY_PATH). and check ldd trqauthd. the right one should be the same as the ones above.

also after the restore, there is a bit problem in restart pbs_mom pbs_server and pbs_sched .
solution:
first, apt-get remove torque-mom torque-server torque-sched, make sure the torque in apt system is not installed.
second, reinstall torque 5.0.1 by configure, make make install.
run one by one.
the below are the errors appears when running pbs_mom pbs_server pbs_sched.
 pbs_mom
pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with "/var/spool/torque/checkpoint" - /var/spool/torque/checkpoint cannot be lstat'd - errno=2, No such file or directory


for pbs_server and pbs_sched, once running it, it doesn't show as a process in the system. 


as long as reinstall torque 5.0.1 problem get resolved. 2016-01-12

problem
pbsnodes
pbsnodes: Server has no node list MSG=node list is empty - check 'server_priv/nodes' file



cd /var/spool/torque/server_priv



Saturday, 3 January 2015

install environment modules in a cluster

http://linuxcluster.wordpress.com/2012/11/05/installing-and-configuting-environment-modules/

use ganglia to monitor the system

1. follow the instruction at https://www.digitalocean.com/community/tutorials/introduction-to-ganglia-on-ubuntu-14-04 to finish the installation.

2. install gexec in each client

3. change /etc/ganglia/gmond.conf
  change    gexec= yes

4. reboot by sudo service ganglia-monitor restart

it is noted that the server has to do this in the last so that all the clients can be found


one incident: macondo03 is down. after it is rebooted, gstat can not see other machines. the only way to make everything back to normal is to run "sudo service ganglia-monitor restart" on every client so that the host can find all the machines.