-
Notifications
You must be signed in to change notification settings - Fork 10
docs: Add architecture doc in developer guide. #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sitaowang1998
wants to merge
4
commits into
y-scope:main
Choose a base branch
from
sitaowang1998:architect-doc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| # Architecture | ||
|
|
||
| ## Spider Architecture | ||
| `Spider` consists of several components that work together to provide a scalable, low-latency and | ||
| fault-tolerant distributed task execution system. | ||
|
|
||
| ```{image} ./arch.png | ||
| :width: 80% | ||
| :align: center | ||
| :alt: Spider Architecture | ||
| ``` | ||
|
|
||
| ### Storage | ||
| `Spdier` relies on a fault-tolerant and ACID storage, e.g. MariaDB, to persist all the states of the | ||
| system. | ||
| The storage stores the following information: | ||
| - Tasks metadata, including: | ||
| - Task ID | ||
| - Task inputs/outputs type and values | ||
| - Task status | ||
| - Job metadata, including | ||
| - Job ID | ||
| - Task graph | ||
| - Job status | ||
| - Data objects, including: | ||
| - Data object ID | ||
| - Data object type | ||
| - Data object value | ||
| - References from tasks and clients | ||
| - Client/Scheduler/Worker metadata, including: | ||
| - Client ID | ||
| - Scheduler ID | ||
| - Worker ID | ||
| - Heartbeat timestamps | ||
|
|
||
| ### Scheduler | ||
| Scheduler is responsible for: | ||
| - Allocating tasks to idle workers on their request | ||
| - Failure detection and recovery | ||
| - Garbage collection | ||
| - Straggler detection and task replication | ||
| For now `Spider` only supports a single scheduler, and we plan to support multiple schedulers if it | ||
| becomes the bottleneck of the system. | ||
|
|
||
| ### Worker | ||
| A worker executes tasks allocated by the scheduler. It runs the following steps in loop: | ||
| 1. Request a task from the scheduler | ||
| 2. Fetch task inputs from the storage | ||
| 3. Spawn a process to execute the task | ||
| 4. Store task outputs in the storage and update task and job states | ||
| Each worker only executes one task at a time. | ||
|
|
||
| ### Client | ||
| Client communicates only with the storage to submit jobs and query job status and fetch job | ||
| results. | ||
|
|
||
| ## Data Abstraction | ||
| `Spider` provides a simple data abstraction for task inputs and outputs, which encapsulates the | ||
| - locality of the data, i.e. the addresses of the data | ||
| - checkpointed or not, i.e. whether the data is persisted | ||
| This abstraction allows `Spider` to support: | ||
| - locality-aware task scheduling | ||
| - fine-grained failure recovery | ||
| - garbage collection in background | ||
|
|
||
| ## Fault Tolerance | ||
| `Spider` is designed to be fault-tolerant. The system can recover from failures of a scheduler or a | ||
| worker. | ||
|
|
||
| Schedulers, workers and clients send periodic heartbeats to the storage to indicate their liveness. | ||
| If a scheduler fails, the host can restart a new scheduler instance and fetch the latest state from | ||
| the storage. | ||
| If a worker fails while executing a task, the scheduler will detect the failure and perform | ||
| recovery of the job. | ||
| - Identify all the failed tasks within the job | ||
| - Compute the minimum subgraph that contains the fail tasks where all inputs to the subgraph are | ||
| available | ||
| - Invalidate all the tasks in the subgraph, set the tasks on the input boundary as ready and the | ||
| rest as waiting | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.