Monitoring

This page includes some backend specific instruction for monitoring the different sites.

Grid storage management

Duncan has written a wrapper to the gfal commands in order to simplify manual navigation of the DPM filesystems. A frequently updated version can be found at dpm_manager.

dpm_manager <gfal_dir> -s <file search terms> -r <file reject terms> [-cp (copy to gridui)] [-rm (delete from grid storage)] [-j (number threads)] [-cpg (copy from gridui to storage)] ...

More info is given in the help text (dpm_manager -h)

NNLOJET distributed warmup

NNLOJET can run warmups in distributed mode by using the vegas-socket implementation.

  • compile NNLOJET with sockets=true to enable distribution

  • set up server ./vegas_socket.py -p PORT -N NO_CLIENTS -w WAIT

    • 1 unique port for each separate NNLOJET run going on [rough range 7000-9999]

    • wait parameter is time limit (secs) from first client registering with the server before starting without all clients present. If not set, it will wait forever! Jobs die if they try and join after this wait limit.

      NB. the wait parameter can’t be used for distribution elsewhere -> it relies on nnlorun.py (grid script) in jobscripts to kill run

    • The server is automatically killed on finish by nnlorun.py (grid script)

  • In the grid runcard set sockets_active >= N_clients, port = port number looking for.

    • Will send sockets_active as number of jobs. If this is more than the number of sockets looked for by the server, any excess will be killed as soon as they start running

  • NNLOJET runcard # events is the total number of events (to be divided by amongst the clients)

  • The server must set up on the same gridui as defined in the header parameter server_host. Otherwise the jobs will never be found by the running job.

Hamilton queues

There are multiple queues I suggest using on the HAMILTON cluster:

  • par6.q

    • 16 cores per node [set warmupthr = 16]

    • No known job # limit, so you can chain as many nodes as you like to

    • turbocharge warmups

    • 3 day job time limit

  • par7.q

    • 24 cores per node [set warmupthr = 24]

    • # jobs limited to 12, so you can use a maximum of 12*24 cores at a given

    • time

    • 3 day job time limit

  • openmp7.q NOT RECOMMENDED

    • 58 cores total - this is tiny, so I would recommend par7 or par6

    • # jobs limited to 12

    • No time limit

    • Often in competition with other jobs, so not great for sockets

Use NNLOJET ludicrous mode if possible - it will have a reasonable speedup when using 16-24 core nodes in warmups.

Current monitoring info can be found using the sfree command, which gives the number of cores in use at any given time.