int4

Here are 18 public repositories matching this topic...

intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on PyTorch, TensorFlow, and ONNX Runtime

sparsity pruning quantization knowledge-distillation auto-tuning int8 low-precision quantization-aware-training post-training-quantization awq int4 large-language-models gptq smoothquant sparsegpt fp4 mxformat

Updated Jun 3, 2026
Python

intel / auto-round

Star

A SOTA quantization algorithm for high-accuracy low-bit LLM inference, seamlessly optimized for CPU/XPU/CUDA, with multi-datatype support and full compatibility with vLLM, SGLang, and Transformers.

transformers rounding quantization int4 llms vllm gguf vlms sglang mxfp4 nvfp4

Updated Jun 4, 2026
Python

tpoisonooo / how-to-optimize-gemm

Star

row-major matmul optimization

vulkan cuda armv7 arm64 ptx gemm-optimization cuda-kernel int4

Updated May 14, 2026
C++

intel / neural-speed

Star

An innovative library for efficient LLM inference via low-bit quantization

Updated Aug 30, 2024
C++

Geekgineer / needle-rs

Star

258 KB WASM runtime for Needle a 26M-parameter tool-calling transformer. Runs in browser, Cloudflare Workers, and Node.js. No backend required.

Updated May 20, 2026
Rust

🧬🔍 Vecgo is a pure Go, embeddable, hybrid vector database designed for high-performance production workloads. It combines commit-oriented durability with HNSW + DiskANN indexing for best-in-class performance.

Updated Jan 19, 2026
Go

konjoai / squish

Star

🤖🗜️⚡️ Local LLM server for Apple Silicon. 5.4× faster end-to-end on long contexts vs Ollama, 33% less RAM, INT3 support for Qwen3. OpenAI + Ollama drop-in. Built for repeated long-context workloads on memory-constrained Macs.

Updated Jun 4, 2026
Python

NAME0x0 / AVA

Star

Research and training stack for AVA — a tool-using, memory-aware virtual assistant targeting 4 GB VRAM. Spans custom transformers, verifier-RL, external memory, multi-domain benchmarks, and Gemma 4 inference optimization.

Updated May 20, 2026
Python

Danaozhong / rust-bitwriter

Star

rust library to write integer types of any bit length into a buffer - from `i1` to `i64`.

rust bytebuffer bitbuffer bitbanging bitwriter int4 int15

Updated Jul 16, 2024
Rust

ambv231 / tinyllama-coreml-ios18-quantization

Star

Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4) for efficient on-device inference on iOS 18+.

nlp mobile ai transformers pytorch llama quantization int8 coreml on-device huggingface apple-silicon int4 llm tinyllama ios18 mlpackage

Updated Jun 4, 2026
Python

katolikov / triad-ptq

Star

PyTorch implementation of TRIAD-PTQ (Trace-Router-Interaction-Aware Decomposition) — weight-only INT3/INT4 PTQ for compact LLMs and edge CNNs/ViTs, with real benchmarks on SmolLM/TinyLlama/MobileNetV2/EfficientNet-B0/MobileViT-S.

pytorch mps quantization model-compression mobilenet efficient-inference edge-ai post-training-quantization apple-silicon vision-transformer int4 ptq llm gptq weight-only-quantization

Updated May 5, 2026
Python

metaSATOKEN / norm-separated-quantization

Star

Training-free fix for KV cache INT4 failures. Norm separation + per-channel quantization. Qwen2-7B: 744× improvement (ΔPPL +238 → +0.32). 12 models, 124M–40B. 4 lines of PyTorch.

reproducible-research pytorch open-science transformer outliers quantization memory-optimization kv-cache int4 llm-inference norm-separation per-channel-quantization

Updated Apr 18, 2026
Python

jvoltci / breccia

Sponsor

Star

Block-scaled FP8 / FP4 / INT4 tensor primitive with Triton scaled-matmul at FP32 parity on H100. NumPy / PyTorch / MLX / JAX backends.

machine-learning deep-learning numpy pytorch triton quantization mlx jax low-precision int4 fp8 fp4 torchao nvfp4 transformer-engine mxfp8

Updated May 24, 2026
Python

GreenBull31 / tinyllama-coreml-ios18-quantization

Star

Quantize TinyLlama-1.1B-Chat from PyTorch to CoreML (float16, int8, int4) for efficient on-device inference on iOS 18+.

nlp mobile ai transformers pytorch llama quantization int8 coreml on-device huggingface apple-silicon int4 llm tinyllama ios18 mlpackage

Updated May 18, 2025
Python

banda-larga / onnx-conv2matmul

Star

Utilities to rewrite ONNX convolution patterns into MatMul forms for optimal LLM-like int4 quantization (esp. Audio/Speech models).

speech-recognition quantization nemo asr onnx matmul int4

Updated Apr 17, 2026
Python

metaSATOKEN / nsn

Star

Training-free INT3 KV cache quantization: 5.09× compression, ~10 lines of Python, <5% WikiText-2 ΔPPL on 8 of 8 open-weight Transformers (GPT-J 2021 → Gemma-4 2026). No calibration, no codebook, no rotation, no adapter. +2.4% decode overhead with torch.compile (no custom CUDA).

reproducible-research transformers pytorch open-science quantization memory-optimization kv-cache int4 calibration-free llm-inference int3 norm-separation per-channel-quantization activation-outliers

Updated May 21, 2026
Python

hafizradzi8901 / spiking-ff-jepa

Star

Backprop-free learning study: spiking (LIF) neurons + Forward-Forward + JEPA + int4 QAT, with a full ablation notebook.

deep-learning pytorch spiking-neural-networks quantization neuromorphic norse self-supervised-learning int4 forward-forward jepa

Updated May 30, 2026
Python

dakshjain-1616 / codynamicslab-latch-qwen2-5-14

Star

Quantizes the unquantized LATCH-Qwen2.5-14B model to GGUF format with a strict perplexity delta constraint. Outputs a Markdown benchmark report comparing FP16 vs Q4_K_M accuracy.

python benchmark quantization model-compression perplexity int4 llm llama-cpp qwen gguf

Updated Mar 26, 2026
Python

Improve this page

Add a description, image, and links to the int4 topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the int4 topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int4

Here are 18 public repositories matching this topic...

intel / neural-compressor

intel / auto-round

tpoisonooo / how-to-optimize-gemm

intel / neural-speed

Geekgineer / needle-rs

hupe1980 / vecgo

konjoai / squish

NAME0x0 / AVA

Danaozhong / rust-bitwriter

ambv231 / tinyllama-coreml-ios18-quantization

katolikov / triad-ptq

metaSATOKEN / norm-separated-quantization

jvoltci / breccia

GreenBull31 / tinyllama-coreml-ios18-quantization

banda-larga / onnx-conv2matmul

metaSATOKEN / nsn

hafizradzi8901 / spiking-ff-jepa

dakshjain-1616 / codynamicslab-latch-qwen2-5-14

Improve this page

Add this topic to your repo