A hardware task scheduler for heterogeneous multi-core, multi-chiplet SoCs. It accepts a stream of task descriptors with encoded dependency information, resolves inter-task dependencies through a counter-based dependency matrix, and dispatches ready tasks to execution cores.
Authors: Fanchen Kong, Xiaoling Yi, Yunhao Deng
Affiliation: KU Leuven (MICAS)
Host / Software Runtime
|
| Task descriptors (64-bit packed structs)
v
+--------------+ +------------------------------------------+
| Task Queue | | Per-Chiplet HW Manager |
| (AXI-Lite or |------>| |
| Master) | | +-- stream_demux (by core_id) ------+ |
+--------------+ | | | |
| v | |
+--------+----------+ +--------+----------+ | |
| Waiting Queue | | Waiting Queue |...| |
| Core 0 (depth 8) | | Core 1 (depth 8) | | |
+--------+----------+ +--------+----------+ | |
| | | |
dep_check_manager FSM dep_check_manager | |
(IDLE->CHECK->QUEUE->FINISH) | |
| | | |
+--------v-----------+-----------v---------+ | |
| Counter-Based Dependency Matrix | | |
| (per cluster, 8-bit saturating counters)| | |
| set: counter++ (always accepts) | | |
| check: all required counters >= 1 | | |
| clear: counter-- (on successful check) | | |
+--------+-----------+---------------------+ | |
| | |
+--------------+--------------+ | |
v v | |
+-------+--------+ +-------+--------+ | |
| Ready Queue | | Checkout Queue | | |
| [core][cluster]| | [core][cluster]| | |
| -> Device Core | | -> dep_set | | |
+----------------+ +-------+--------+ | |
| | | |
(execute) +---------+----------+ | |
| | | | |
v Local dep_set Remote dep_set (H2H) | |
+-------+--------+ | +-------------------+ | |
| Done Queue | | | Chiplet Dep Set | | |
| [core][cluster]| | | AXI-Lite Master |------+ | |
| (per-pair FIFO)| | | -> remote chiplet | | | |
+-------+--------+ | +-------------------+ | | |
| | | | |
+----> Arbiter -> dep_matrix.set_column() | | |
| | |
+-----------------------------------------------------------+ | |
| From Remote Chiplets (H2H) | |
| -> Chiplet Done Queue -> Arbiter -> dep_matrix.set_column()| |
+--------------------------------------------------------------+ |
+------------------------------------------------------------------+
Each task is a 64-bit packed struct pushed into the task queue:
| Field | Width | Description |
|---|---|---|
task_type |
1 | 0 = normal (executes on core), 1 = dummy (synchronization only) |
task_id |
12 | Unique identifier (0-4095) |
assigned_chiplet_id |
8 | Target chiplet |
assigned_cluster_id |
log2(clusters) | Target cluster within chiplet |
assigned_core_id |
log2(cores) | Target core within cluster |
dep_check_en |
1 | Enable dependency checking before dispatch |
dep_check_code |
N_CORES | Bitmask: which core columns to check in dep matrix |
dep_set_en |
1 | Enable dependency signaling after completion |
dep_set_all_chiplet |
1 | Broadcast dep_set to all chiplets |
dep_set_chiplet_id |
8 | Target chiplet for dep_set |
dep_set_cluster_id |
log2(clusters) | Target cluster for dep_set |
dep_set_code |
N_CORES | Bitmask: which core rows to signal in dep matrix |
1. PUSH Host writes task descriptor to task queue
2. ROUTE Demux routes task to assigned core's waiting queue
3. CHECK dep_check_manager reads dep_matrix:
- dep_check_en=0: bypass (immediate pass)
- dep_check_en=1: wait until all required counters >= 1
4. CLEAR On pass, decrement checked counters by 1
5. DISPATCH Task enters ready queue; core reads and executes
6. COMPLETE Core writes done_info to per-(core,cluster) done queue
7. SIGNAL Done queue + checkout queue match triggers dep_set:
- Local: increment counter in target cluster's dep matrix
- Remote: AXI-Lite write to target chiplet's H2H mailbox
Each cluster has a dependency matrix with N_CORES x N_CORES cells, where each cell is an 8-bit saturating counter (not a single bit).
Column (signal source core)
core 0 core 1 core 2
Row 0 (co0) [cnt] [cnt] [cnt] <- what core 0 waits for
Row 1 (co1) [cnt] [cnt] [cnt] <- what core 1 waits for
Row 2 (co2) [cnt] [cnt] [cnt] <- what core 2 waits for
Operations:
set_column(col, mask): For each row in mask, incrementcounter[row][col]. Always succeeds (no overlap rejection,dep_set_ready = '1).check_row(row, mask): True ifcounter[row][c] >= 1for every columncin the mask.clear_row(row, mask): Decrementcounter[row][c]by 1 for each columncin the mask.
This design eliminates the deadlock caused by the old 1-bit overlap detection, where a second set to an already-set bit was rejected, creating circular backpressure through the done queue.
Each (core, cluster) pair has its own independent done queue FIFO:
Done Queues: [NUM_CORES][NUM_CLUSTERS] independent FIFOs
done_q[0][0] done_q[0][1] <- core 0, clusters 0..1
done_q[1][0] done_q[1][1] <- core 1, clusters 0..1
done_q[2][0] done_q[2][1] <- core 2, clusters 0..1
The pop condition for each FIFO depends ONLY on its own state:
done_q_pop[core][cluster] = !done_q_empty[core][cluster]
&& checkout[core][cluster].task_type == NORMAL
&& arbiter_ready[core + cluster * N_CORES]
No cross-core or cross-cluster dependency in the pop logic. This eliminates head-of-line blocking where one core's completion stalls behind another core's entry in a shared FIFO.
bingo_hw_manager_top
|
+-- Task Queue (1x)
| +-- write_mailbox (AXI-Lite slave mode) OR
| +-- task_queue_master (AXI-Lite master mode)
|
+-- Per-Core Pipeline (NUM_CORES_PER_CLUSTER instances)
| +-- fifo_v3 (waiting_dep_check_queue, depth=8)
| +-- dep_check_manager (4-state FSM)
| +-- stream_filter (dep_check_en bypass)
| +-- stream_demux (route to cluster)
| +-- stream_filter (dummy task filter)
| +-- stream_demux (route to cluster, ready+checkout path)
|
+-- Per-Cluster Dep Matrix (NUM_CLUSTERS_PER_CHIPLET instances)
| +-- dep_matrix (counter-based, 8-bit cells)
|
+-- Per-(Core, Cluster) Queues (N_CORES x N_CLUSTERS instances each)
| +-- Ready Queue: read_mailbox or fifo_v3
| +-- Checkout Queue: fifo_v3 (depth=CheckoutQueueDepth)
| +-- Done Queue: fifo_v3 (depth=DoneQueueDepth)
| +-- stream_demux (local vs H2H dep_set routing)
| +-- stream_filter (dep_set_en filtering)
|
+-- Dep Matrix Set Arbiter (1x)
| +-- stream_arbiter (N_CORES*N_CLUSTERS + 1 inputs)
| +-- stream_demux (route to cluster dep matrix)
| +-- stream_demux (route to core within cluster)
|
+-- H2H Communication
| +-- Chiplet Dep Set Master (AXI-Lite master, 1x)
| +-- stream_arbiter (chiplet dep set, from all cores)
| +-- Chiplet Done Queue (write_mailbox, 1x)
|
+-- Power Manager (1x)
+-- bingo_hw_manager_pm (idle-based clock gating)
| Parameter | Default | Description |
|---|---|---|
NUM_CORES_PER_CLUSTER |
4 | Execution cores per cluster |
NUM_CLUSTERS_PER_CHIPLET |
2 | Clusters per chiplet |
TaskIdWidth |
12 | Task ID width (max 4096 tasks) |
ChipIdWidth |
8 | Chiplet ID width (max 256 chiplets) |
HostAxiLiteAddrWidth |
48 | Host-side AXI address width |
HostAxiLiteDataWidth |
64 | Host-side AXI data width (task descriptor) |
DeviceAxiLiteAddrWidth |
48 | Device-side AXI address width |
DeviceAxiLiteDataWidth |
32 | Device-side AXI data width (done info) |
TaskQueueDepth |
32 | Incoming task FIFO depth |
DoneQueueDepth |
32 | Per-(core,cluster) done FIFO depth |
CheckoutQueueDepth |
8 | Per-(core,cluster) checkout FIFO depth |
ReadyQueueDepth |
8 | Per-(core,cluster) ready FIFO depth |
TASK_QUEUE_TYPE |
1 | 0: AXI-Lite slave, 1: AXI-Lite master |
READY_AND_DONE_QUEUE_INTERFACE_TYPE |
1 | 0: AXI-Lite, 1: CSR req/resp |
Task Queue:
- Mode 0 (Slave): Host pushes task descriptors via AXI-Lite writes to a mailbox
- Mode 1 (Master): HW manager fetches task descriptors from host memory at
task_list_base_addr_i
Ready/Done Queues:
- Mode 0 (AXI-Lite): Cores read ready tasks and write completions via AXI-Lite
- Mode 1 (CSR): Cores use lightweight CSR req/resp interface (lower latency)
When a task's dep_set_chiplet_id != chip_id_i, the dependency signal is routed to a remote chiplet via the H2H path:
- Checkout queue entry routed to chiplet dep_set arbiter
bingo_hw_manager_chiplet_dep_setmodule sends AXI-Lite write to remote chiplet's mailbox- Remote chiplet receives via
from_remote_axi_lite_req_iinto its chiplet done queue - Remote chiplet processes the signal through its dep matrix set arbiter
Broadcast mode (dep_set_all_chiplet = 1) sends the signal to all chiplets simultaneously.
- AXI v0.39.1 — AXI-Lite definitions, crossbar
- common_cells v1.37.0 — FIFO, stream arbiter/demux/filter, counters
DARTS extends the static scheduler with conditional execution support for data-dependent workloads (MoE routing, early exit). See dev_doc/ for full architecture documentation.
A 16-entry Conditional Execution Register File (CERF) per chiplet enables runtime task skipping. Tasks marked as conditional are either executed or skipped based on the CERF state, which is written by gating tasks on completion.
The user expresses conditional execution through conditional edges in the DFG:
# Router conditionally activates each expert (compiler handles the rest)
dfg.bingo_add_edge(router, expert_0, cond=True)
dfg.bingo_add_edge(router, expert_1, cond=True)
dfg.bingo_add_edge(expert_0, aggregator) # unconditional
# Compile: auto-assigns CERF groups, promotes router to gating task
compile_dfg(dfg)
# Simulate: specify which nodes are active
run_sim(dfg, config, active_nodes={expert_0})The compiler pass bingo_compile_conditional_regions():
- Scans edges for
cond=True - Auto-promotes source nodes to gating tasks (
task_type=2) - Groups conditional targets by connected components (unconditional edges between targets = shared CERF group)
- Assigns CERF group IDs (0-15) automatically
Skipped tasks still propagate dependency signals (via the checkout queue as dummies), preserving graph correctness.
| Module | Purpose |
|---|---|
bingo_hw_manager_cond_exec_controller.sv |
16-entry CERF register file |
bingo_hw_manager_load_monitor.sv |
Per-core pending task counters (load monitoring) |
Evaluated via cycle-accurate Python simulator (scripts/eval_darts.py):
| Workload | Configuration | Speedup |
|---|---|---|
| MoE 8 experts, top-2 | 1 cluster, 3 cores | 2.29x |
| MoE 16 experts, top-1 | 1 cluster, 3 cores | 4.15x |
| MoE 8 experts, top-2 | 2 chiplets | 1.84-1.95x |
| Early exit (stage 0/4) | 1 cluster, 3 cores | 3.28x |
| Level | File | Description |
|---|---|---|
| 0 | bingo_hw_manager_mailbox_adapter.sv |
AXI-Lite to mailbox adapter |
| 0 | bingo_hw_manager_read_mailbox.sv |
FIFO-to-AXI-Lite read bridge |
| 0 | bingo_hw_manager_write_mailbox.sv |
AXI-Lite-to-FIFO write bridge |
| 0 | bingo_hw_manager_task_queue_master.sv |
AXI-Lite master for task fetching |
| 0 | bingo_hw_manager_csr_to_fifo*.sv |
CSR interface adapters |
| 1 | bingo_hw_manager_dep_matrix.sv |
Counter-based dependency matrix |
| 1 | bingo_hw_manager_chiplet_dep_set.sv |
H2H AXI-Lite master |
| 1 | bingo_hw_manager_dep_check_manager.sv |
Dependency check FSM |
| 1 | bingo_hw_manager_pm.sv |
Power manager |
| 1 | bingo_hw_manager_cond_exec_controller.sv |
CERF (conditional execution) |
| 1 | bingo_hw_manager_load_monitor.sv |
Load monitoring |
| 2 | bingo_hw_manager_top.sv |
Top-level integration |
The test infrastructure includes:
- RTL testbench harness (
test/tb_bingo_hw_manager_harness.svh) with deadlock detection, counter monitoring, and structured trace logging - 9 structured DFG patterns: serial chain, parallel fork-join, double buffer, diamond, stacked GEMM, multi-chiplet chain
- 44 random DAG tests: 10-40 tasks, 1-4 chiplets, sparse/dense edge configurations
- Cycle-accurate Python model (
model/) mirroring the RTL pipeline behavior - DFG compiler (
sw/bingo_dfg.py) with automatic dummy task insertion
# Compile and simulate (requires Questa)
make clean && make compile.log
cd build && echo "run -all" | vsim -sv_seed 0 tb_bingo_hw_manager_top -t 1ns -voptargs="+acc"
# Run Python model tests
source ../.venv/bin/activate
python3 scripts/run_all_tests.py --stress 200Test results:
- RTL: 66/66 pass (includes CERF conditional tests) with zero deadlocks
- Python model: 68/68 pass (20 structured + 48 random stress)
- Evaluation: 4 experiment suites, all deadlock-free (CSV results in
eval_results/)