Monitoring¶
This page includes some backend specific instruction for monitoring the different sites.
Monitoring sites¶
UK ARC INFO¶
./src/pyHepGrid/extras/get_site_info.py
Grid storage management¶
Duncan has written a wrapper to the gfal commands in order to simplify manual navigation of the DPM filesystems. A frequently updated version can be found at dpm_manager.
dpm_manager <gfal_dir> -s <file search terms> -r <file reject terms> [-cp (copy to gridui)] [-rm (delete from grid storage)] [-j (number threads)] [-cpg (copy from gridui to storage)] ...
More info is given in the help text (dpm_manager -h
)
NNLOJET distributed warmup¶
NNLOJET can run warmups in distributed mode by using the vegas-socket implementation.
compile NNLOJET with
sockets=true
to enable distributionset up server
./vegas_socket.py -p PORT -N NO_CLIENTS -w WAIT
1 unique port for each separate NNLOJET run going on [rough range 7000-9999]
wait parameter is time limit (secs) from first client registering with the server before starting without all clients present. If not set, it will wait forever! Jobs die if they try and join after this wait limit.
NB. the wait parameter can’t be used for distribution elsewhere -> it relies on
nnlorun.py
(grid script) in jobscripts to kill runThe server is automatically killed on finish by
nnlorun.py
(grid script)
In the grid runcard set
sockets_active >= N_clients
,port = port number
looking for.Will send
sockets_active
as number of jobs. If this is more than the number of sockets looked for by the server, any excess will be killed as soon as they start running
NNLOJET runcard
# events
is the total number of events (to be divided by amongst the clients)The server must set up on the same gridui as defined in the header parameter
server_host
. Otherwise the jobs will never be found by the running job.
Hamilton queues¶
There are multiple queues I suggest using on the HAMILTON cluster:
par6.q
16 cores per node [set warmupthr = 16]
No known job # limit, so you can chain as many nodes as you like to
turbocharge warmups
3 day job time limit
par7.q
24 cores per node [set warmupthr = 24]
# jobs limited to 12, so you can use a maximum of 12*24 cores at a given
time
3 day job time limit
openmp7.q NOT RECOMMENDED
58 cores total - this is tiny, so I would recommend par7 or par6
# jobs limited to 12
No time limit
Often in competition with other jobs, so not great for sockets
Use NNLOJET ludicrous mode if possible - it will have a reasonable speedup when using 16-24 core nodes in warmups.
Current monitoring info can be found using the sfree
command, which gives
the number of cores in use at any given time.