Skip to content

Commit 952668f

Browse files
committed
Improve deployment documentation
1 parent da2cb14 commit 952668f

15 files changed

Lines changed: 162 additions & 138 deletions

File tree

docs/deployment/allocation.md

Lines changed: 68 additions & 49 deletions
Large diffs are not rendered by default.

docs/deployment/cloud.md

Lines changed: 32 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,54 @@
1-
# Starting HQ without shared file system
1+
# Starting HQ without a shared filesystem
22

3-
On system without shared file system, all what is needed is to distribute access file (`access.json`) to clients and workers.
4-
This file contains address and port where server is running and secret keys.
5-
By default, client and worker search for `access.json` in `$HOME/.hq-server`.
3+
By default, HyperQueue assumes the existence of a shared filesystem, which it uses to exchange metadata required to connect servers and workers and to run various HQ commands.
64

7-
## Generate access file in advance
5+
On systems without a shared filesystem, you will have to distribute an *access file* (`access.json`) to clients and workers.
6+
This file contains the address and port where the server is running, and also secret keys required for encrypted communication.
87

9-
In many cases you, we want to generate an access file in advance before any server is started;
10-
moreover, we do not want to regenerate secret keys in every start of server,
11-
because we do not want to redistribute access when server is restarted.
8+
## Sharing the access file
129

13-
To solve this, an access file can be generated in advance by command "generate-access", e.g.:
10+
After you start a server, you can find its `access.json` file in the `$HOME/.hq-server/hq-current` directory. You can then copy it to a different filesystem using a method of your choosing, and configure clients and workers to use that file.
1411

15-
```commandline
12+
By default, clients and workers search for the `access.json` file in the `$HOME/.hq-server` directory, but you can override that using the `--server-dir` argument, which is available for all `hq` CLI commands. If you moved the `access.json` file into a directory called `/home/foo/hq-access` on the worker's node, you should start the worker like this:
13+
14+
```bash
15+
$ hq --server-dir=/home/foo/hq-access worker start
16+
```
17+
18+
!!! tip
19+
20+
You can also configure the server directory using an [environment variable](./server.md#server-directory).
21+
22+
## Generate an access file in advance
23+
24+
In some cases you might want to generate the access file in advance, before the server is started, and let the server, clients and workers use that access file. This can be useful so that you don't have to redistribute the access file to client/worker nodes everytime the server restarts, which could be cumbersome.
25+
26+
To achieve this, an access file can be generated in advance by the `generate-access` command:
27+
28+
```bash
1629
$ hq server generate-access myaccess.json --client-port=6789 --worker-port=1234
1730
```
1831

19-
This generates `myaccess.json` that contains generates keys and host information.
32+
This generates a `myaccess.json` file that contains generates keys and host information.
2033

2134
The server can be later started with this configuration as follows:
2235

23-
```commandline
36+
```bash
2437
$ hq server start --access-file=myaccess.json
2538
```
2639

27-
Note: That server still generates and manages "own" `access.json` in the server directory path.
28-
For connecting clients and workers you can use both, `myaccess.json` or newly generated `access.json`, they are same.
40+
Clients and workers should load the pre-generated access file in the same way as was described [above](#sharing-the-access-file). However, you will have to rename the generated file to `access.json`, because clients and workers look it up by its exact name in the provided server directory.
2941

30-
Example of starting a worker from `myaccess.json`
42+
!!! note
43+
44+
The server will still generate and manages its "own" `access.json` in the server directory path, even if you provide your own access file. These files are the same, so you can use either when connectiong clients and workers.
3145

32-
```commandline
33-
$ mv myaccess.json /mydirectory/access.json
34-
$ hq --server-dir=/mydirectory worker start
35-
```
3646

3747
## Splitting access for client and workers
3848

39-
Access file contains two secret keys and two points to connect, for clients and for workers.
40-
This information can be divided into two separate files,
41-
containing only information needed only by clients or only by workers.
49+
The default access file contains two secret keys and two TCP/IP addresses, one for clients and one for workers. This metadata can be divided into two separate files, containing only information needed only by clients or only by workers.
4250

43-
```commandline
51+
```bash
4452
$ hq server generate-access full.json --client-file=client.json --worker-file=worker.json --client-port=6789 --worker-port=1234
4553
```
4654

@@ -56,6 +64,6 @@ For starting server (`hq server start --access-file=...`) you have to use `full.
5664

5765
You can use the following command to configure different hostnames under which the server is visible to workers and clients.
5866

59-
```commandline
67+
```bash
6068
hq server generate-access full.json --worker-host=<WORKER_HOST> --client-host=<CLIENT_HOST> ...
6169
```

docs/deployment/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,5 @@ not required. A common use-case is to start the server on a login of an HPC syst
1616
[comment]: <> (TODO: describe scheduler)
1717

1818
Learn more about deploying [server](server.md) and the [workers](worker.md).
19+
20+
There is also a third component that we call the **client**, which represents the users of HyperQueue invoking various `hq` commands to communicate with the server component.

docs/deployment/server.md

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -44,19 +44,16 @@ $ hq --server-dir=foo worker start
4444
$ hq worker start &
4545
```
4646

47-
!!! important
48-
49-
When you start the server, it will create a new subdirectory in the server directory, which will store the data of the current running instance. It will also create a symlink `hq-current` which will point to the currently active
50-
subdirectory.
51-
Using this approach, you can start a server using the same server directory multiple times without overwriting data
52-
of the previous runs.
53-
5447
!!! danger "Server directory access"
5548

5649
Encryption keys are stored in the server directory. Whoever has access to the server directory may submit jobs,
5750
connect workers to the server and decrypt communication between HyperQueue components. By default, the directory is
5851
only accessible by the user who started the server.
5952

53+
## Running multiple servers
54+
55+
When you start the server, it will create a new subdirectory in the server directory, which will store the data of the current running instance. It will also create a symlink `hq-current` which will point to the currently active subdirectory. Using this approach, you can start a server using the same server directory multiple times without overwriting data of the previous runs.
56+
6057
## Keeping the server alive
6158

6259
The server is supposed to be a long-lived component. If you shut it down, all workers will disconnect and all
@@ -98,7 +95,7 @@ have to be connected to the server after it restarts.
9895

9996
If the server crashes, the last few seconds of progress may be lost. For example,
10097
when a task is finished and the server crashes before the journal is written, then
101-
after resuming the server, the task will be not be computed after a server restart.
98+
after resuming the server, the task will be recomputed.
10299

103100
### Exporting journal events
104101

@@ -110,7 +107,7 @@ $ hq journal export <journal-path>
110107
```
111108

112109
The events will be read from the provided journal and printed to `stdout` encoded in JSON, one
113-
event per line (this corresponds to line-delimited JSON, i.e. [NDJSON](http://ndjson.org/)).
110+
event per line (this corresponds to line-delimited JSON, i.e. [JSON Lines](https://jsonlines.org/)).
114111

115112
You can also directly stream events in real-time from the server using the following command:
116113

@@ -123,17 +120,15 @@ $ hq journal stream
123120
The JSON format of the journal events and their definition is currently unstable and can change
124121
with a new HyperQueue version.
125122

126-
### Pruning journal
123+
### Pruning the journal
127124

128-
Command `hq journal prune` removes all completed jobs and disconnected workers from the journal file.
125+
The `hq journal prune` command removes all completed jobs and disconnected workers from the journal file, in order to reduce its size on disk.
129126

130-
### Flushing journal
127+
### Flushing the journal
131128

132-
Command `hq journal flush` will force the server to flush the journal.
133-
It is mainly for the testing purpose or if you are going to `hq journal export` on
134-
a live journal (however, it is usually better to use `hq journal stream`).
129+
Command `hq journal flush` will force the server to flush the journal, so that the latest state of affairs is persisted to disk. It is mainly useful for testing or if you are going to run `hq journal export` while a server is running (however, it is usually better to use `hq journal stream`).
135130

136-
## Stopping server
131+
## Stopping the server
137132

138133
You can stop a running server with the following command:
139134

docs/deployment/worker.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
Workers connect to a running instance of a HyperQueue [server](server.md) and wait for task assignments. Once some task
1+
Workers manage the computational resources of a single computer (node) and use them to execute tasks submitted into HyperQueue.
2+
They connect to a running instance of a HyperQueue [server](server.md) and wait for task assignments. Once some task
23
is assigned to them, they will compute it and notify the server of its completion.
34

45
## Starting workers
@@ -7,8 +8,8 @@ HPC cluster. You can either use the automatic allocation system of HyperQueue to
78
workers manually.
89

910
### Automatic worker deployment (recommended)
10-
If you are using a job manager (PBS or Slurm) on an HPC cluster, the easiest way of deploying workers is to use
11-
[**Automatic allocation**](allocation.md). It is a component of HyperQueue that takes care of submitting PBS/Slurm jobs
11+
If you are using an allocation manager (PBS or Slurm) on an HPC cluster, the easiest way of deploying workers is to use
12+
[**Automatic allocation**](allocation.md). It is a component of HyperQueue that takes care of submitting PBS/Slurm allocations
1213
and spawning HyperQueue workers.
1314

1415
### Manual worker deployment
@@ -32,11 +33,11 @@ If you want to connect to a different server, use the `--server-dir` option.
3233

3334
However, if a shared filesystem is not available on your cluster, you can just copy the server directory from the
3435
server machine to the worker machine and access it from there. The worker machine still has to be able to initiate
35-
a TCP/IP connection to the server machine though.
36+
a TCP/IP connection to the server machine though. See [this page](./cloud.md) for more details.
3637

3738
#### Deploying a worker using PBS/Slurm
3839
If you want to manually start a worker using PBS or Slurm, simply use the corresponding submit command (`qsub` or `sbatch`)
39-
and run the `hq worker start` command inside the allocated job. If you want to start a worker on each allocated node,
40+
and run the `hq worker start` command inside the created allocation. If you want to start a worker on each allocated node,
4041
you can run this command on each node using e.g. `mpirun`.
4142

4243
Example submission script:
@@ -69,15 +70,15 @@ Example submission script:
6970
srun --overlap /<path-to-hyperqueue>/hq worker start --manager slurm
7071
```
7172

72-
The worker will try to automatically detect that it is started under a PBS/Slurm job, but you can also explicitly pass
73+
The worker will try to automatically detect that it is started under a PBS/Slurm allocation, but you can also explicitly pass
7374
the option `--manager <pbs/slurm>` to tell the worker that it should expect a specific environment.
7475

7576
#### Deploying a worker using SSH
7677

7778
If you have an OpenSSH-compatible `ssh` binary available in your environment, HQ can deploy workers to a set of hostnames using the `deploy-ssh` command:
7879

7980
```bash
80-
$ hq worker deploy-ssh <nodefile> <worker-args>
81+
$ hq worker deploy-ssh <nodefile> <worker-start-args>
8182
```
8283

8384
To use this command, you need to prepare a *hostfile*, which should contain a set of lines describing individual hostnames on which you want to deploy the workers:
@@ -109,13 +110,12 @@ $ hq worker stop <selector>
109110

110111
## Time limit
111112
HyperQueue workers are designed to be volatile, i.e. it is expected that they will be stopped from time to time, because
112-
they are often started inside PBS/Slurm jobs that have a limited duration.
113+
they are often started inside PBS/Slurm allocations that have a limited duration.
113114

114-
It is very useful for the workers to know how much remaining time ("lifetime") do they have until they will be stopped.
115+
It is very useful for the workers to know how much remaining time ("lifetime") they have until they will be stopped.
115116
This duration is called the `Worker time limit`.
116117

117-
When a worker is started manually inside a PBS or Slurm job, it will automatically calculate the time limit from the job's
118-
metadata. If you want to set time limit for workers started outside of PBS/Slurm jobs or if you want to
118+
When a worker is started manually inside a PBS or Slurm allocation, it will automatically calculate the time limit from the metadata of the allocation. If you want to set time limit for workers started outside of PBS/Slurm allocations or if you want to
119119
override the detected settings, you can use the `--time-limit=<DURATION>` option[^1] when starting the worker.
120120

121121
[^1]: You can use various [shortcuts](../cli/shortcuts.md#duration) for the duration value.
@@ -126,7 +126,7 @@ The time limit of a worker affects what tasks can be scheduled to it. For exampl
126126
will not be scheduled onto a worker that only has a remaining time limit of 5 minutes.
127127

128128
## Idle timeout
129-
When you deploy *HQ* workers inside a PBS or Slurm job, keeping the worker alive will drain resources from your
129+
When you deploy *HQ* workers inside a PBS or Slurm allocation, keeping the worker alive will drain resources from your
130130
accounting project (unless you use a free queue). If a worker has nothing to do, it might be better to terminate it
131131
sooner to avoid paying these costs for no reason.
132132

@@ -152,26 +152,30 @@ This value will be then used for each worker that does not explicitly specify it
152152
Each worker can be in one of the following states:
153153

154154
* **Running** Worker is running and is able to process tasks
155-
* **Connection lost** Worker lost connection to the server. Probably someone manually killed the worker or job walltime
156-
in its PBS/Slurm job was [reached](#time-limit).
155+
* **Connection lost** Worker lost connection to the server. Probably someone manually killed the worker or the walltime
156+
of its PBS/Slurm allocation was [reached](#time-limit).
157157
* **Heartbeat lost** Communication between server and worker was interrupted. It usually signifies a network problem or
158158
a hardware crash of the computational node.
159159
* **Stopped** Worker was [stopped](#stopping-workers).
160160
* **Idle timeout** Worker was terminated due to [Idle timeout](#idle-timeout).
161161

162162
### Lost connection to the server
163163

164-
The behavior of what should happen with a worker that lost its connection to the server is configured
164+
The behavior of what should happen when a worker loses its connection to the server is configured
165165
via `hq worker start --on-server-lost=<policy>`. You can select from two policies:
166166

167167
* `stop` - The worker immediately terminates and kills all currently running tasks.
168-
* `finish-running` - The worker does not start to execute any new tasks, but it tries to finish tasks
168+
* `finish-running` - The worker does not start executing any new tasks, but it tries to finish tasks
169169
that are already running. When all such tasks finish, the worker will terminate.
170170

171171
`stop` is the default policy when a worker is manually started by `hq worker start`.
172172
When a worker is started by the [automatic allocator](allocation.md), then `finish-running` is used
173173
as the default value.
174174

175+
## Worker groups
176+
177+
Each worker is a member of exactly one worker group. Groups are used to determine which workers are eligible to execute multi-node tasks. You can find more information about worker groups [here](../jobs/multinode.md#groups).
178+
175179
## Useful worker commands
176180
Here is a list of useful worker commands:
177181

@@ -188,7 +192,3 @@ If you also want to include workers that are offline (i.e. that have crashed or
188192
```bash
189193
$ hq worker info <worker-id>
190194
```
191-
192-
### Worker groups
193-
194-
Each worker is a member exactly of one group. Groups are used when multi-node tasks are used. See more [here](../jobs/multinode.md#groups)

docs/faq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ about anything related to HyperQueue, feel free to ask on our [discussion forum]
5454
each with a single task.
5555

5656
HQ also supports [streaming](jobs/streaming.md) of task outputs into a single file.
57-
This avoids creating many small files for each task on a distributed file system, which improves
57+
This avoids creating many small files for each task on a distributed filesystem, which improves
5858
scaling.
5959

6060
??? question "Does HQ support multi-CPU tasks?"

docs/jobs/arrays.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ If `--array` defines an ID that exceeds the number of lines in the file (or the
110110

111111
For example:
112112

113-
```commandline
113+
```bash
114114
$ hq submit --each-line input.txt --array "2, 8-10"
115115
```
116116

docs/jobs/failure.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,21 @@ recompute only tasks with a specific status (e.g. failed tasks).
99
By following combination of commands you may recompute only failed tasks. Let us assume that we want to recompute
1010
all failed tasks in job 5:
1111

12-
```commandline
12+
```bash
1313
$ hq submit --array=`hq job task-ids 5 --filter=failed` ./my-computation
1414
```
1515
It works as follows: Command `hq job task-ids 5 --filter=failed` returns IDs of failed jobs of job `5`, and we set
1616
it to `--array` parameter that starts only tasks for given IDs.
1717

1818
If we want to recompute all failed tasks and all canceled tasks we can do it as follows:
1919

20-
```commandline
20+
```bash
2121
$ hq submit --array=`hq job task-ids 5 --filter=failed,canceled` ./my-computation
2222
```
2323

2424
Note that it also works with `--each-line` or `--from-json`, i.e.:
2525

26-
```commandline
26+
```bash
2727
# Original computation
2828
$ hq submit --each-line=input.txt ./my-computation
2929

@@ -56,7 +56,7 @@ You can change this behavior with the `--max-fails=<X>` option of the `submit` c
5656
If specified, once more tasks than `X` tasks fail, the rest of the job's tasks that were not completed yet will be canceled.
5757

5858
For example:
59-
```commandline
59+
```bash
6060
$ hq submit --array 1-1000 --max-fails 5 ...
6161
```
6262
This will create a task array with `1000` tasks. Once `5` or more tasks fail, the remaining uncompleted tasks of the job

docs/jobs/jobfile.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ command = ["sleep", "1"]
2222
Let us assume that we have named this file as ``myfile.toml``,
2323
then we can run the following command to submit a job:
2424

25-
```commandline
25+
```bash
2626
$ hq job submit-file myfile.toml
2727
```
2828

2929
The effect will be same as running:
3030

31-
```commandline
31+
```bash
3232
$ hq submit sleep 1
3333
```
3434

docs/jobs/jobs.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -427,25 +427,25 @@ Here is a list of useful job commands:
427427

428428
### Display a summary table of all jobs
429429

430-
```commandline
430+
```bash
431431
$ hq job summary
432432
```
433433

434434
### Display information about a specific job
435435

436-
```commandline
436+
```bash
437437
$ hq job info <job-selector>
438438
```
439439

440440
### Display information about individual tasks (potentially across multiple jobs)
441441

442-
```commandline
442+
```bash
443443
$ hq task list <job-selector> [--task-status <status>] [--tasks <task-selector>]
444444
```
445445

446446
### Display job `stdout`/`stderr`
447447

448-
```commandline
448+
```bash
449449
$ hq job cat <job-id> [--tasks <task-selector>] <stdout/stderr>
450450
```
451451

@@ -456,7 +456,7 @@ worker. HyperQueue server remembers how many times were a task running while a w
456456
If the count reaches the limit, then the task is set to the failed state.
457457
By default, this limit is `5` but it can be changed as follows:
458458

459-
```commandline
459+
```bash
460460
$ hq submit --crash-limit=<NEWLIMIT> ...
461461
```
462462

0 commit comments

Comments
 (0)