Betlang

CPU source-language detection for code with a tiny 50kb model.

[dependencies]
betlang = "0.1.0"

let detection = betlang::detect("fn main() { println!(\"hi\"); }");

assert_eq!(detection.language(), Some(betlang::Language::Rust));

Use betlang::detect(source) for UTF-8 source strings or byte slices. It returns a Detection; call Detection::language() to read the top language. Call Detection::top_languages() when you need ranked probabilities.

Supported Languages

Slugs parse through the standard FromStr implementation:

assert_eq!("rust".parse::<betlang::Language>(), Ok(betlang::Language::Rust));

asm, batch, c, clojure, cmake, cobol, cpp, cs, css, dart, dockerfile, elixir, erlang, gemfile, gemspec, go, gradle, groovy, haskell, html, ini, java, javascript, json, julia, kotlin, lisp, lua, markdown, objectivec, ocaml, perl, php, powershell, python, r, ruby, rust, scala, shell, sql, swift, toml, typescript, vba, verilog, xml, yaml.

These are the model's 48 output labels. Runtime detections expose them one-to-one with no label aggregation.

The confusion matrix uses the same labels:

Model

The embedded model is assets/magika/source-student-q4.bin, a 47,840-byte weights-only MSQ1 payload with SHA-256:

59ef24167bddd1364eb9c1650add8a67e1a542b5155fac67f5e1cda07df0c0f0

Architecture: wordseq-b1024-k3-m2048-tiny-3conv-hidden, tokenizer version 3. On the manifest-aligned held-out filesystem-label test split it reaches test_fs_accuracy=0.965238 with macro_recall=0.965411.

See MODEL_CARD.md for the training and evaluation summary.

Performance

Betlang uses a fixed 4096-byte Magika window and pads runtime inference to the same 2048-token shape used by evaluation. The model is loaded once per process and then reused through a OnceLock.

Native CPU inference dispatches through fearless_simd. Benchmark entry points are available through cargo bench. Current baseline numbers are tracked in BENCHMARKS.md.

License And Attribution

Betlang is licensed under MIT. The embedded student model was trained from outputs of Google's Magika teacher model; Magika is published by Google under Apache-2.0. Keep this attribution with redistributed model artifacts.

Confusion By File Size

The shipped wordseq model is evaluated below on the held-out bigorig test split. Each panel is a row-normalized confusion matrix for one file-size bucket: actual labels are rows, predicted labels are columns, and the diagonal is correct classification.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.claude		.claude
.github		.github
assets		assets
benches		benches
examples		examples
reports		reports
scripts		scripts
snippets		snippets
src		src
tests/fixtures/languages		tests/fixtures/languages
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
RELEASING.md		RELEASING.md
_typos.toml		_typos.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Betlang

Supported Languages

Model

Performance

License And Attribution

Confusion By File Size

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Betlang

Supported Languages

Model

Performance

License And Attribution

Confusion By File Size

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages