.. _monitor-label: =============== Monitoring =============== This page includes some backend specific instruction for monitoring the different sites. .. contents:: :local: :depth: 3 Monitoring sites =================== Durham ARC monitoring website ----------------------------- https://grafana.mon.dur.scotgrid.ac.uk/d/LNUGi5yWk/general-grid DIRAC monitoring website ------------------------ https://dirac.gridpp.ac.uk:8443/DIRAC/ UK ARC INFO ----------- .. code-block:: bash ./src/pyHepGrid/extras/get_site_info.py Grid storage management ======================= `Duncan `_ has written a wrapper to the gfal commands in order to simplify manual navigation of the DPM filesystems. A frequently updated version can be found at `dpm_manager `_. .. code-block:: bash dpm_manager -s -r [-cp (copy to gridui)] [-rm (delete from grid storage)] [-j (number threads)] [-cpg (copy from gridui to storage)] ... More info is given in the help text (``dpm_manager -h``) NNLOJET distributed warmup ========================== NNLOJET can run warmups in distributed mode by using the `vegas-socket `_ implementation. - compile NNLOJET with ``sockets=true`` to enable distribution - set up server ``./vegas_socket.py -p PORT -N NO_CLIENTS -w WAIT`` - 1 unique port for each separate NNLOJET run going on [rough range 7000-9999] - wait parameter is time limit (secs) from first client registering with the server before starting without all clients present. If not set, it will wait *forever*! Jobs die if they try and join after this wait limit. NB. the wait parameter can't be used for distribution elsewhere -> it relies on ``nnlorun.py`` (grid script) in jobscripts to kill run - The server is automatically killed on finish by ``nnlorun.py`` (grid script) - In the grid runcard set ``sockets_active >= N_clients``, ``port = port number`` looking for. - Will send ``sockets_active`` as number of jobs. If this is more than the number of sockets looked for by the server, any excess will be killed as soon as they start running - NNLOJET runcard ``# events`` is the total number of events (to be divided by amongst the clients) - The server must set up on the same gridui as defined in the header parameter ``server_host``. Otherwise the jobs will never be found by the running job. Hamilton queues =============== .. TODO is this up to date/still relevant? There are multiple queues I suggest using on the HAMILTON cluster: - par6.q * 16 cores per node [set warmupthr = 16]\ * No known job \# limit, so you can chain as many nodes as you like to * turbocharge warmups\ * 3 day job time limit - par7.q * 24 cores per node [set warmupthr = 24]\ * \# jobs limited to 12, so you can use a maximum of 12\*24 cores at a given * time\ * 3 day job time limit - openmp7.q **NOT RECOMMENDED** * 58 cores total - this is tiny, so I would recommend par7 or par6\ * \# jobs limited to 12\ * No time limit\ * Often in competition with other jobs, so not great for sockets Use NNLOJET ludicrous mode if possible - it will have a reasonable speedup when using 16-24 core nodes in warmups. Current monitoring info can be found using the ``sfree`` command, which gives the number of cores in use at any given time.