Skip to content

Commit 7d06c64

Browse files
update README
1 parent dc23efc commit 7d06c64

1 file changed

Lines changed: 12 additions & 4 deletions

File tree

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,29 @@
11
![](../../workflows/gds/badge.svg) ![](../../workflows/docs/badge.svg) ![](../../workflows/test/badge.svg) ![](../../workflows/fpga/badge.svg)
22

3-
# Tiny Tapeout Verilog Tensor Processing Unit
3+
# Tiny Tapeout Verilog Processing Unit
44

55
- [Read the documentation for project](docs/info.md)
66

7-
## Overview: Verilog Matrix Multiply Accelerator
7+
## Overview
88

9-
This project implements a small-scale, hardware-efficient Tensor Processing Unit (TPU) that performs 2×2 signed matrix multiplications using a systolic array of Multiply-Accumulate (MAC) units. It is designed in Verilog and deployable via the Tiny Tapeout ASIC flow.
9+
This project implements a small-scale, hardware-efficient Tensor Processing Unit (TPU) that performs 2×2 matrix multiplications using a systolic array of Multiply-Accumulate (MAC) units. It is designed in Verilog and deployable via the Tiny Tapeout ASIC flow.
1010

11-
## Key Features
11+
Hardware design files are in the `./src` folder, while the ML inference setup is inside `./test/tpu`.
12+
13+
## Hardware Features
1214

1315
- **Systolic Array:** A 2×2 grid of MAC units propagates data left-to-right and top-to-bottom, emulating a systolic matrix multiplication engine.
1416
- **Signed 8-bit Inputs, 16-bit Outputs:** Handles signed integers (-128 to 127) and accumulates products in 16-bit precision.
1517
- **Streaming Input/Output:** Supports pipelined loading and output to achieve >99.8M operations/sec.
1618
- **Control FSM:** Automates input loading, matrix multiplication timing, and result collection.
1719
- **Optional Features:** On-chip fused matrix transpose (`Bᵀ`) and ReLU activation.
1820

21+
## Machine Learning Ecosystem
22+
23+
- **Accurate Low-Precision Training**: since the chip only runs on 8-bit arithmetic but training needs higher preciison I run Quantization-Aware Training, simulating the quantization noise from 32-bit float to 8-bit integer. This allows minimal accuracy loss during quantization.
24+
- The process is made easier with TorchAO inserting quantization stubs automatically, and ExecuTorch handling fast quantization and low-precision training.
25+
- **Simplified Model Deployment**: rather than run inference by calling the chip's `matmul` kernel manually while training in a separate `torch.nn` module, a PyTorch compiler backend unifies the process in `torch.nn`. One can train the model using the module, call `torch.compile` with a custom backend, and run inference with `model(input)`.
26+
1927
---
2028

2129
## System Architecture

0 commit comments

Comments
 (0)