|
1 | 1 |     |
2 | 2 |
|
3 | | -# Tiny Tapeout Verilog Tensor Processing Unit |
| 3 | +# Tiny Tapeout Verilog Processing Unit |
4 | 4 |
|
5 | 5 | - [Read the documentation for project](docs/info.md) |
6 | 6 |
|
7 | | -## Overview: Verilog Matrix Multiply Accelerator |
| 7 | +## Overview |
8 | 8 |
|
9 | | -This project implements a small-scale, hardware-efficient Tensor Processing Unit (TPU) that performs 2×2 signed matrix multiplications using a systolic array of Multiply-Accumulate (MAC) units. It is designed in Verilog and deployable via the Tiny Tapeout ASIC flow. |
| 9 | +This project implements a small-scale, hardware-efficient Tensor Processing Unit (TPU) that performs 2×2 matrix multiplications using a systolic array of Multiply-Accumulate (MAC) units. It is designed in Verilog and deployable via the Tiny Tapeout ASIC flow. |
10 | 10 |
|
11 | | -## Key Features |
| 11 | +Hardware design files are in the `./src` folder, while the ML inference setup is inside `./test/tpu`. |
| 12 | + |
| 13 | +## Hardware Features |
12 | 14 |
|
13 | 15 | - **Systolic Array:** A 2×2 grid of MAC units propagates data left-to-right and top-to-bottom, emulating a systolic matrix multiplication engine. |
14 | 16 | - **Signed 8-bit Inputs, 16-bit Outputs:** Handles signed integers (-128 to 127) and accumulates products in 16-bit precision. |
15 | 17 | - **Streaming Input/Output:** Supports pipelined loading and output to achieve >99.8M operations/sec. |
16 | 18 | - **Control FSM:** Automates input loading, matrix multiplication timing, and result collection. |
17 | 19 | - **Optional Features:** On-chip fused matrix transpose (`Bᵀ`) and ReLU activation. |
18 | 20 |
|
| 21 | +## Machine Learning Ecosystem |
| 22 | + |
| 23 | +- **Accurate Low-Precision Training**: since the chip only runs on 8-bit arithmetic but training needs higher preciison I run Quantization-Aware Training, simulating the quantization noise from 32-bit float to 8-bit integer. This allows minimal accuracy loss during quantization. |
| 24 | + - The process is made easier with TorchAO inserting quantization stubs automatically, and ExecuTorch handling fast quantization and low-precision training. |
| 25 | +- **Simplified Model Deployment**: rather than run inference by calling the chip's `matmul` kernel manually while training in a separate `torch.nn` module, a PyTorch compiler backend unifies the process in `torch.nn`. One can train the model using the module, call `torch.compile` with a custom backend, and run inference with `model(input)`. |
| 26 | + |
19 | 27 | --- |
20 | 28 |
|
21 | 29 | ## System Architecture |
|
0 commit comments