You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On system without shared file system, all what is needed is to distribute access file (`access.json`) to clients and workers.
4
-
This file contains address and port where server is running and secret keys.
5
-
By default, client and worker search for `access.json` in `$HOME/.hq-server`.
3
+
By default, HyperQueue assumes the existence of a shared filesystem, which it uses to exchange metadata required to connect servers and workers and to run various HQ commands.
6
4
7
-
## Generate access file in advance
5
+
On systems without a shared filesystem, you will have to distribute an *access file* (`access.json`) to clients and workers.
6
+
This file contains the address and port where the server is running, and also secret keys required for encrypted communication.
8
7
9
-
In many cases you, we want to generate an access file in advance before any server is started;
10
-
moreover, we do not want to regenerate secret keys in every start of server,
11
-
because we do not want to redistribute access when server is restarted.
8
+
## Sharing the access file
12
9
13
-
To solve this, an access file can be generated in advance by command "generate-access", e.g.:
10
+
After you start a server, you can find its `access.json` file in the `$HOME/.hq-server/hq-current` directory. You can then copy it to a different filesystem using a method of your choosing, and configure clients and workers to use that file.
14
11
15
-
```commandline
12
+
By default, clients and workers search for the `access.json` file in the `$HOME/.hq-server` directory, but you can override that using the `--server-dir` argument, which is available for all `hq` CLI commands. If you moved the `access.json` file into a directory called `/home/foo/hq-access` on the worker's node, you should start the worker like this:
You can also configure the server directory using an [environment variable](./server.md#server-directory).
21
+
22
+
## Generate an access file in advance
23
+
24
+
In some cases you might want to generate the access file in advance, before the server is started, and let the server, clients and workers use that access file. This can be useful so that you don't have to redistribute the access file to client/worker nodes everytime the server restarts, which could be cumbersome.
25
+
26
+
To achieve this, an access file can be generated in advance by the `generate-access` command:
27
+
28
+
```bash
16
29
$ hq server generate-access myaccess.json --client-port=6789 --worker-port=1234
17
30
```
18
31
19
-
This generates `myaccess.json` that contains generates keys and host information.
32
+
This generates a `myaccess.json` file that contains generates keys and host information.
20
33
21
34
The server can be later started with this configuration as follows:
22
35
23
-
```commandline
36
+
```bash
24
37
$ hq server start --access-file=myaccess.json
25
38
```
26
39
27
-
Note: That server still generates and manages "own" `access.json` in the server directory path.
28
-
For connecting clients and workers you can use both, `myaccess.json` or newly generated `access.json`, they are same.
40
+
Clients and workers should load the pre-generated access file in the same way as was described [above](#sharing-the-access-file). However, you will have to rename the generated file to `access.json`, because clients and workers look it up by its exact name in the provided server directory.
29
41
30
-
Example of starting a worker from `myaccess.json`
42
+
!!! note
43
+
44
+
The server will still generate and manages its "own" `access.json` in the server directory path, even if you provide your own access file. These files are the same, so you can use either when connectiong clients and workers.
31
45
32
-
```commandline
33
-
$ mv myaccess.json /mydirectory/access.json
34
-
$ hq --server-dir=/mydirectory worker start
35
-
```
36
46
37
47
## Splitting access for client and workers
38
48
39
-
Access file contains two secret keys and two points to connect, for clients and for workers.
40
-
This information can be divided into two separate files,
41
-
containing only information needed only by clients or only by workers.
49
+
The default access file contains two secret keys and two TCP/IP addresses, one for clients and one for workers. This metadata can be divided into two separate files, containing only information needed only by clients or only by workers.
42
50
43
-
```commandline
51
+
```bash
44
52
$ hq server generate-access full.json --client-file=client.json --worker-file=worker.json --client-port=6789 --worker-port=1234
45
53
```
46
54
@@ -56,6 +64,6 @@ For starting server (`hq server start --access-file=...`) you have to use `full.
56
64
57
65
You can use the following command to configure different hostnames under which the server is visible to workers and clients.
58
66
59
-
```commandline
67
+
```bash
60
68
hq server generate-access full.json --worker-host=<WORKER_HOST> --client-host=<CLIENT_HOST> ...
Copy file name to clipboardExpand all lines: docs/deployment/index.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,3 +16,5 @@ not required. A common use-case is to start the server on a login of an HPC syst
16
16
[comment]: <>(TODO: describe scheduler)
17
17
18
18
Learn more about deploying [server](server.md) and the [workers](worker.md).
19
+
20
+
There is also a third component that we call the **client**, which represents the users of HyperQueue invoking various `hq` commands to communicate with the server component.
When you start the server, it will create a new subdirectory in the server directory, which will store the data of the current running instance. It will also create a symlink `hq-current` which will point to the currently active
50
-
subdirectory.
51
-
Using this approach, you can start a server using the same server directory multiple times without overwriting data
52
-
of the previous runs.
53
-
54
47
!!! danger "Server directory access"
55
48
56
49
Encryption keys are stored in the server directory. Whoever has access to the server directory may submit jobs,
57
50
connect workers to the server and decrypt communication between HyperQueue components. By default, the directory is
58
51
only accessible by the user who started the server.
59
52
53
+
## Running multiple servers
54
+
55
+
When you start the server, it will create a new subdirectory in the server directory, which will store the data of the current running instance. It will also create a symlink `hq-current` which will point to the currently active subdirectory. Using this approach, you can start a server using the same server directory multiple times without overwriting data of the previous runs.
56
+
60
57
## Keeping the server alive
61
58
62
59
The server is supposed to be a long-lived component. If you shut it down, all workers will disconnect and all
@@ -98,7 +95,7 @@ have to be connected to the server after it restarts.
98
95
99
96
If the server crashes, the last few seconds of progress may be lost. For example,
100
97
when a task is finished and the server crashes before the journal is written, then
101
-
after resuming the server, the task will be not be computed after a server restart.
98
+
after resuming the server, the task will be recomputed.
The events will be read from the provided journal and printed to `stdout` encoded in JSON, one
113
-
event per line (this corresponds to line-delimited JSON, i.e. [NDJSON](http://ndjson.org/)).
110
+
event per line (this corresponds to line-delimited JSON, i.e. [JSON Lines](https://jsonlines.org/)).
114
111
115
112
You can also directly stream events in real-time from the server using the following command:
116
113
@@ -123,17 +120,15 @@ $ hq journal stream
123
120
The JSON format of the journal events and their definition is currently unstable and can change
124
121
with a new HyperQueue version.
125
122
126
-
### Pruning journal
123
+
### Pruning the journal
127
124
128
-
Command`hq journal prune` removes all completed jobs and disconnected workers from the journal file.
125
+
The`hq journal prune`command removes all completed jobs and disconnected workers from the journal file, in order to reduce its size on disk.
129
126
130
-
### Flushing journal
127
+
### Flushing the journal
131
128
132
-
Command `hq journal flush` will force the server to flush the journal.
133
-
It is mainly for the testing purpose or if you are going to `hq journal export` on
134
-
a live journal (however, it is usually better to use `hq journal stream`).
129
+
Command `hq journal flush` will force the server to flush the journal, so that the latest state of affairs is persisted to disk. It is mainly useful for testing or if you are going to run `hq journal export` while a server is running (however, it is usually better to use `hq journal stream`).
135
130
136
-
## Stopping server
131
+
## Stopping the server
137
132
138
133
You can stop a running server with the following command:
The worker will try to automatically detect that it is started under a PBS/Slurm job, but you can also explicitly pass
73
+
The worker will try to automatically detect that it is started under a PBS/Slurm allocation, but you can also explicitly pass
73
74
the option `--manager <pbs/slurm>` to tell the worker that it should expect a specific environment.
74
75
75
76
#### Deploying a worker using SSH
76
77
77
78
If you have an OpenSSH-compatible `ssh` binary available in your environment, HQ can deploy workers to a set of hostnames using the `deploy-ssh` command:
To use this command, you need to prepare a *hostfile*, which should contain a set of lines describing individual hostnames on which you want to deploy the workers:
@@ -109,13 +110,12 @@ $ hq worker stop <selector>
109
110
110
111
## Time limit
111
112
HyperQueue workers are designed to be volatile, i.e. it is expected that they will be stopped from time to time, because
112
-
they are often started inside PBS/Slurm jobs that have a limited duration.
113
+
they are often started inside PBS/Slurm allocations that have a limited duration.
113
114
114
-
It is very useful for the workers to know how much remaining time ("lifetime") do they have until they will be stopped.
115
+
It is very useful for the workers to know how much remaining time ("lifetime") they have until they will be stopped.
115
116
This duration is called the `Worker time limit`.
116
117
117
-
When a worker is started manually inside a PBS or Slurm job, it will automatically calculate the time limit from the job's
118
-
metadata. If you want to set time limit for workers started outside of PBS/Slurm jobs or if you want to
118
+
When a worker is started manually inside a PBS or Slurm allocation, it will automatically calculate the time limit from the metadata of the allocation. If you want to set time limit for workers started outside of PBS/Slurm allocations or if you want to
119
119
override the detected settings, you can use the `--time-limit=<DURATION>` option[^1] when starting the worker.
120
120
121
121
[^1]: You can use various [shortcuts](../cli/shortcuts.md#duration) for the duration value.
@@ -126,7 +126,7 @@ The time limit of a worker affects what tasks can be scheduled to it. For exampl
126
126
will not be scheduled onto a worker that only has a remaining time limit of 5 minutes.
127
127
128
128
## Idle timeout
129
-
When you deploy *HQ* workers inside a PBS or Slurm job, keeping the worker alive will drain resources from your
129
+
When you deploy *HQ* workers inside a PBS or Slurm allocation, keeping the worker alive will drain resources from your
130
130
accounting project (unless you use a free queue). If a worker has nothing to do, it might be better to terminate it
131
131
sooner to avoid paying these costs for no reason.
132
132
@@ -152,26 +152,30 @@ This value will be then used for each worker that does not explicitly specify it
152
152
Each worker can be in one of the following states:
153
153
154
154
***Running** Worker is running and is able to process tasks
155
-
***Connection lost** Worker lost connection to the server. Probably someone manually killed the worker or job walltime
156
-
in its PBS/Slurm job was [reached](#time-limit).
155
+
***Connection lost** Worker lost connection to the server. Probably someone manually killed the worker or the walltime
156
+
of its PBS/Slurm allocation was [reached](#time-limit).
157
157
***Heartbeat lost** Communication between server and worker was interrupted. It usually signifies a network problem or
158
158
a hardware crash of the computational node.
159
159
***Stopped** Worker was [stopped](#stopping-workers).
160
160
***Idle timeout** Worker was terminated due to [Idle timeout](#idle-timeout).
161
161
162
162
### Lost connection to the server
163
163
164
-
The behavior of what should happen with a worker that lost its connection to the server is configured
164
+
The behavior of what should happen when a worker loses its connection to the server is configured
165
165
via `hq worker start --on-server-lost=<policy>`. You can select from two policies:
166
166
167
167
*`stop` - The worker immediately terminates and kills all currently running tasks.
168
-
*`finish-running` - The worker does not start to execute any new tasks, but it tries to finish tasks
168
+
*`finish-running` - The worker does not start executing any new tasks, but it tries to finish tasks
169
169
that are already running. When all such tasks finish, the worker will terminate.
170
170
171
171
`stop` is the default policy when a worker is manually started by `hq worker start`.
172
172
When a worker is started by the [automatic allocator](allocation.md), then `finish-running` is used
173
173
as the default value.
174
174
175
+
## Worker groups
176
+
177
+
Each worker is a member of exactly one worker group. Groups are used to determine which workers are eligible to execute multi-node tasks. You can find more information about worker groups [here](../jobs/multinode.md#groups).
178
+
175
179
## Useful worker commands
176
180
Here is a list of useful worker commands:
177
181
@@ -188,7 +192,3 @@ If you also want to include workers that are offline (i.e. that have crashed or
188
192
```bash
189
193
$ hq worker info <worker-id>
190
194
```
191
-
192
-
### Worker groups
193
-
194
-
Each worker is a member exactly of one group. Groups are used when multi-node tasks are used. See more [here](../jobs/multinode.md#groups)
0 commit comments