Skip to content

KULeuven-MICAS/bingo_hw_manager

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bingo Hardware Task Manager

A hardware task scheduler for heterogeneous multi-core, multi-chiplet SoCs. It accepts a stream of task descriptors with encoded dependency information, resolves inter-task dependencies through a counter-based dependency matrix, and dispatches ready tasks to execution cores.

Authors: Fanchen Kong, Xiaoling Yi, Yunhao Deng
Affiliation: KU Leuven (MICAS)

Architecture Overview

Host / Software Runtime
         |
         | Task descriptors (64-bit packed structs)
         v
  +--------------+       +------------------------------------------+
  | Task Queue   |       |  Per-Chiplet HW Manager                  |
  | (AXI-Lite or |------>|                                          |
  |  Master)     |       |  +-- stream_demux (by core_id) ------+  |
  +--------------+       |  |                                    |  |
                         |  v                                    |  |
                +--------+----------+   +--------+----------+   |  |
                | Waiting Queue     |   | Waiting Queue     |...|  |
                | Core 0 (depth 8)  |   | Core 1 (depth 8)  |   |  |
                +--------+----------+   +--------+----------+   |  |
                         |                       |               |  |
                    dep_check_manager FSM    dep_check_manager   |  |
                    (IDLE->CHECK->QUEUE->FINISH)                 |  |
                         |                       |               |  |
                +--------v-----------+-----------v---------+     |  |
                |  Counter-Based Dependency Matrix         |     |  |
                |  (per cluster, 8-bit saturating counters)|     |  |
                |  set: counter++ (always accepts)         |     |  |
                |  check: all required counters >= 1       |     |  |
                |  clear: counter-- (on successful check)  |     |  |
                +--------+-----------+---------------------+     |  |
                         |                                       |  |
          +--------------+--------------+                        |  |
          v                             v                        |  |
  +-------+--------+   +-------+--------+                       |  |
  | Ready Queue    |   | Checkout Queue |                       |  |
  | [core][cluster]|   | [core][cluster]|                       |  |
  | -> Device Core |   | -> dep_set     |                       |  |
  +----------------+   +-------+--------+                       |  |
          |                     |                                |  |
     (execute)        +---------+----------+                     |  |
          |           |                    |                     |  |
          v      Local dep_set      Remote dep_set (H2H)        |  |
  +-------+--------+  |           +-------------------+         |  |
  | Done Queue     |  |           | Chiplet Dep Set   |         |  |
  | [core][cluster]|  |           | AXI-Lite Master   |------+  |  |
  | (per-pair FIFO)|  |           | -> remote chiplet |      |  |  |
  +-------+--------+  |           +-------------------+      |  |  |
          |            |                                      |  |  |
          +----> Arbiter -> dep_matrix.set_column()           |  |  |
                                                              |  |  |
  +-----------------------------------------------------------+  |  |
  | From Remote Chiplets (H2H)                                   |  |
  |   -> Chiplet Done Queue -> Arbiter -> dep_matrix.set_column()|  |
  +--------------------------------------------------------------+  |
  +------------------------------------------------------------------+

Task Descriptor Format

Each task is a 64-bit packed struct pushed into the task queue:

Field Width Description
task_type 1 0 = normal (executes on core), 1 = dummy (synchronization only)
task_id 12 Unique identifier (0-4095)
assigned_chiplet_id 8 Target chiplet
assigned_cluster_id log2(clusters) Target cluster within chiplet
assigned_core_id log2(cores) Target core within cluster
dep_check_en 1 Enable dependency checking before dispatch
dep_check_code N_CORES Bitmask: which core columns to check in dep matrix
dep_set_en 1 Enable dependency signaling after completion
dep_set_all_chiplet 1 Broadcast dep_set to all chiplets
dep_set_chiplet_id 8 Target chiplet for dep_set
dep_set_cluster_id log2(clusters) Target cluster for dep_set
dep_set_code N_CORES Bitmask: which core rows to signal in dep matrix

Task Lifecycle

1. PUSH      Host writes task descriptor to task queue
2. ROUTE     Demux routes task to assigned core's waiting queue
3. CHECK     dep_check_manager reads dep_matrix:
             - dep_check_en=0: bypass (immediate pass)
             - dep_check_en=1: wait until all required counters >= 1
4. CLEAR     On pass, decrement checked counters by 1
5. DISPATCH  Task enters ready queue; core reads and executes
6. COMPLETE  Core writes done_info to per-(core,cluster) done queue
7. SIGNAL    Done queue + checkout queue match triggers dep_set:
             - Local: increment counter in target cluster's dep matrix
             - Remote: AXI-Lite write to target chiplet's H2H mailbox

Counter-Based Dependency Matrix

Each cluster has a dependency matrix with N_CORES x N_CORES cells, where each cell is an 8-bit saturating counter (not a single bit).

             Column (signal source core)
             core 0    core 1    core 2
Row 0 (co0)  [cnt]     [cnt]     [cnt]    <- what core 0 waits for
Row 1 (co1)  [cnt]     [cnt]     [cnt]    <- what core 1 waits for
Row 2 (co2)  [cnt]     [cnt]     [cnt]    <- what core 2 waits for

Operations:

  • set_column(col, mask): For each row in mask, increment counter[row][col]. Always succeeds (no overlap rejection, dep_set_ready = '1).
  • check_row(row, mask): True if counter[row][c] >= 1 for every column c in the mask.
  • clear_row(row, mask): Decrement counter[row][c] by 1 for each column c in the mask.

This design eliminates the deadlock caused by the old 1-bit overlap detection, where a second set to an already-set bit was rejected, creating circular backpressure through the done queue.

Per-(Core, Cluster) Done Queues

Each (core, cluster) pair has its own independent done queue FIFO:

Done Queues: [NUM_CORES][NUM_CLUSTERS] independent FIFOs

done_q[0][0]   done_q[0][1]     <- core 0, clusters 0..1
done_q[1][0]   done_q[1][1]     <- core 1, clusters 0..1
done_q[2][0]   done_q[2][1]     <- core 2, clusters 0..1

The pop condition for each FIFO depends ONLY on its own state:

done_q_pop[core][cluster] = !done_q_empty[core][cluster]
                          && checkout[core][cluster].task_type == NORMAL
                          && arbiter_ready[core + cluster * N_CORES]

No cross-core or cross-cluster dependency in the pop logic. This eliminates head-of-line blocking where one core's completion stalls behind another core's entry in a shared FIFO.

Module Hierarchy

bingo_hw_manager_top
 |
 +-- Task Queue (1x)
 |    +-- write_mailbox (AXI-Lite slave mode) OR
 |    +-- task_queue_master (AXI-Lite master mode)
 |
 +-- Per-Core Pipeline (NUM_CORES_PER_CLUSTER instances)
 |    +-- fifo_v3 (waiting_dep_check_queue, depth=8)
 |    +-- dep_check_manager (4-state FSM)
 |    +-- stream_filter (dep_check_en bypass)
 |    +-- stream_demux (route to cluster)
 |    +-- stream_filter (dummy task filter)
 |    +-- stream_demux (route to cluster, ready+checkout path)
 |
 +-- Per-Cluster Dep Matrix (NUM_CLUSTERS_PER_CHIPLET instances)
 |    +-- dep_matrix (counter-based, 8-bit cells)
 |
 +-- Per-(Core, Cluster) Queues (N_CORES x N_CLUSTERS instances each)
 |    +-- Ready Queue: read_mailbox or fifo_v3
 |    +-- Checkout Queue: fifo_v3 (depth=CheckoutQueueDepth)
 |    +-- Done Queue: fifo_v3 (depth=DoneQueueDepth)
 |    +-- stream_demux (local vs H2H dep_set routing)
 |    +-- stream_filter (dep_set_en filtering)
 |
 +-- Dep Matrix Set Arbiter (1x)
 |    +-- stream_arbiter (N_CORES*N_CLUSTERS + 1 inputs)
 |    +-- stream_demux (route to cluster dep matrix)
 |    +-- stream_demux (route to core within cluster)
 |
 +-- H2H Communication
 |    +-- Chiplet Dep Set Master (AXI-Lite master, 1x)
 |    +-- stream_arbiter (chiplet dep set, from all cores)
 |    +-- Chiplet Done Queue (write_mailbox, 1x)
 |
 +-- Power Manager (1x)
      +-- bingo_hw_manager_pm (idle-based clock gating)

Parameters

Parameter Default Description
NUM_CORES_PER_CLUSTER 4 Execution cores per cluster
NUM_CLUSTERS_PER_CHIPLET 2 Clusters per chiplet
TaskIdWidth 12 Task ID width (max 4096 tasks)
ChipIdWidth 8 Chiplet ID width (max 256 chiplets)
HostAxiLiteAddrWidth 48 Host-side AXI address width
HostAxiLiteDataWidth 64 Host-side AXI data width (task descriptor)
DeviceAxiLiteAddrWidth 48 Device-side AXI address width
DeviceAxiLiteDataWidth 32 Device-side AXI data width (done info)
TaskQueueDepth 32 Incoming task FIFO depth
DoneQueueDepth 32 Per-(core,cluster) done FIFO depth
CheckoutQueueDepth 8 Per-(core,cluster) checkout FIFO depth
ReadyQueueDepth 8 Per-(core,cluster) ready FIFO depth
TASK_QUEUE_TYPE 1 0: AXI-Lite slave, 1: AXI-Lite master
READY_AND_DONE_QUEUE_INTERFACE_TYPE 1 0: AXI-Lite, 1: CSR req/resp

Interface Modes

Task Queue:

  • Mode 0 (Slave): Host pushes task descriptors via AXI-Lite writes to a mailbox
  • Mode 1 (Master): HW manager fetches task descriptors from host memory at task_list_base_addr_i

Ready/Done Queues:

  • Mode 0 (AXI-Lite): Cores read ready tasks and write completions via AXI-Lite
  • Mode 1 (CSR): Cores use lightweight CSR req/resp interface (lower latency)

Cross-Chiplet Communication

When a task's dep_set_chiplet_id != chip_id_i, the dependency signal is routed to a remote chiplet via the H2H path:

  1. Checkout queue entry routed to chiplet dep_set arbiter
  2. bingo_hw_manager_chiplet_dep_set module sends AXI-Lite write to remote chiplet's mailbox
  3. Remote chiplet receives via from_remote_axi_lite_req_i into its chiplet done queue
  4. Remote chiplet processes the signal through its dep matrix set arbiter

Broadcast mode (dep_set_all_chiplet = 1) sends the signal to all chiplets simultaneously.

Dependencies

  • AXI v0.39.1 — AXI-Lite definitions, crossbar
  • common_cells v1.37.0 — FIFO, stream arbiter/demux/filter, counters

DARTS: Dynamic Adaptive Runtime Task Scheduling

DARTS extends the static scheduler with conditional execution support for data-dependent workloads (MoE routing, early exit). See dev_doc/ for full architecture documentation.

Conditional Execution (CERF)

A 16-entry Conditional Execution Register File (CERF) per chiplet enables runtime task skipping. Tasks marked as conditional are either executed or skipped based on the CERF state, which is written by gating tasks on completion.

The user expresses conditional execution through conditional edges in the DFG:

# Router conditionally activates each expert (compiler handles the rest)
dfg.bingo_add_edge(router, expert_0, cond=True)
dfg.bingo_add_edge(router, expert_1, cond=True)
dfg.bingo_add_edge(expert_0, aggregator)          # unconditional

# Compile: auto-assigns CERF groups, promotes router to gating task
compile_dfg(dfg)

# Simulate: specify which nodes are active
run_sim(dfg, config, active_nodes={expert_0})

The compiler pass bingo_compile_conditional_regions():

  1. Scans edges for cond=True
  2. Auto-promotes source nodes to gating tasks (task_type=2)
  3. Groups conditional targets by connected components (unconditional edges between targets = shared CERF group)
  4. Assigns CERF group IDs (0-15) automatically

Skipped tasks still propagate dependency signals (via the checkout queue as dummies), preserving graph correctness.

Additional Modules

Module Purpose
bingo_hw_manager_cond_exec_controller.sv 16-entry CERF register file
bingo_hw_manager_load_monitor.sv Per-core pending task counters (load monitoring)

Evaluation Results

Evaluated via cycle-accurate Python simulator (scripts/eval_darts.py):

Workload Configuration Speedup
MoE 8 experts, top-2 1 cluster, 3 cores 2.29x
MoE 16 experts, top-1 1 cluster, 3 cores 4.15x
MoE 8 experts, top-2 2 chiplets 1.84-1.95x
Early exit (stage 0/4) 1 cluster, 3 cores 3.28x

Source Files

Level File Description
0 bingo_hw_manager_mailbox_adapter.sv AXI-Lite to mailbox adapter
0 bingo_hw_manager_read_mailbox.sv FIFO-to-AXI-Lite read bridge
0 bingo_hw_manager_write_mailbox.sv AXI-Lite-to-FIFO write bridge
0 bingo_hw_manager_task_queue_master.sv AXI-Lite master for task fetching
0 bingo_hw_manager_csr_to_fifo*.sv CSR interface adapters
1 bingo_hw_manager_dep_matrix.sv Counter-based dependency matrix
1 bingo_hw_manager_chiplet_dep_set.sv H2H AXI-Lite master
1 bingo_hw_manager_dep_check_manager.sv Dependency check FSM
1 bingo_hw_manager_pm.sv Power manager
1 bingo_hw_manager_cond_exec_controller.sv CERF (conditional execution)
1 bingo_hw_manager_load_monitor.sv Load monitoring
2 bingo_hw_manager_top.sv Top-level integration

Testing

The test infrastructure includes:

  • RTL testbench harness (test/tb_bingo_hw_manager_harness.svh) with deadlock detection, counter monitoring, and structured trace logging
  • 9 structured DFG patterns: serial chain, parallel fork-join, double buffer, diamond, stacked GEMM, multi-chiplet chain
  • 44 random DAG tests: 10-40 tasks, 1-4 chiplets, sparse/dense edge configurations
  • Cycle-accurate Python model (model/) mirroring the RTL pipeline behavior
  • DFG compiler (sw/bingo_dfg.py) with automatic dummy task insertion
# Compile and simulate (requires Questa)
make clean && make compile.log
cd build && echo "run -all" | vsim -sv_seed 0 tb_bingo_hw_manager_top -t 1ns -voptargs="+acc"

# Run Python model tests
source ../.venv/bin/activate
python3 scripts/run_all_tests.py --stress 200

Test results:

  • RTL: 66/66 pass (includes CERF conditional tests) with zero deadlocks
  • Python model: 68/68 pass (20 structured + 48 random stress)
  • Evaluation: 4 experiment suites, all deadlock-free (CSV results in eval_results/)

About

The hardware task manager for heterogneous multicore soc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors