MAM-AI

On-device medical search for nurses and midwives in Zanzibar

Android app that answers clinical questions offline using on-device RAG — Gemma 4 E4B (LiteRT-LM) for generation, Gecko for embeddings, SQLite for vector search. No internet needed after the initial ~4.5 GB model download.

Architecture

┌─────────────────────────────────────────────────┐
│  Flutter UI (Dart)                              │
│  intro_page.dart · search_page.dart             │
├──────────────┬──────────────────────────────────┤
│ MethodChannel│  EventChannel (streaming)        │
├──────────────┴──────────────────────────────────┤
│  Android Native (Kotlin)                        │
│  MainActivity.kt · RagStream.kt                 │
│  ┌────────────────────────────────────────────┐ │
│  │ RagPipeline.kt                             │ │
│  │  ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │
│  │  │ Gemma 4  │ │  Gecko   │ │  SQLite    │ │ │
│  │  │ LiteRT-LM│ │ Embedder │ │ VectorStore│ │ │
│  │  └──────────┘ └──────────┘ └────────────┘ │ │
│  └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘

Query → Gecko embeds → SQLite retrieves top-3 guideline chunks → prompt assembled → LiteRT-LM streams response → Flutter renders markdown.

Build & Run

Requires a real Android device (LiteRT-LM needs hardware acceleration, not emulators).

cd app
flutter pub get
flutter run              # debug on connected device
flutter build apk        # release APK
adb logcat -s mam-ai     # timing, memory, inference logs

Install

Download the APK from Releases and sideload onto a real Android device.

Model Files

Downloaded on first launch from public HuggingFace repos — no auth required.

File	Size	Source
`gemma-4-E4B-it.litertlm`	3.65 GB	litert-community/gemma-4-E4B-it-litert-lm
`Gecko_1024_quant.tflite`	146 MB	litert-community/Gecko-110m-en
`sentencepiece.model`	794 KB	litert-community/Gecko-110m-en
`embeddings.sqlite`	~140 MB	mamai-medical-guidelines releases

The pinned RAG bundle version lives in config/rag_assets.lock.json.

Updating RAG Assets

Chunking and embedding are managed in the companion mamai-medical-guidelines repo. To pull in a new bundle:

Bump config/rag_assets.lock.json with the new version + manifest checksum
Run the staging and push scripts:

bash scripts/sync_rag_assets.sh          # download + stage bundle
bash scripts/sync_models.sh              # download Gecko + Gemma from HuggingFace
bash scripts/push_to_device.sh           # push everything to connected device
bash scripts/push_to_device.sh --embedding-models  # push Gecko + tokenizer only

Releasing

Tag from main only. CI builds a signed APK and publishes a GitHub Release automatically.

git tag v0.1.0-beta.1    # beta/alpha/rc → prerelease; vX.Y.Z → stable
git push origin v0.1.0-beta.1

Valid formats: vX.Y.Z, vX.Y.Z-alpha.N, vX.Y.Z-beta.N, vX.Y.Z-rc.N

Required GitHub secrets: ANDROID_KEYSTORE_BASE64, ANDROID_KEYSTORE_PASSWORD, ANDROID_KEY_ALIAS, ANDROID_KEY_PASSWORD. See app/android/key.properties.example for local signing setup.

Evaluation

Benchmarks run across AfriMedQA, MedQA USMLE, MedMCQA, Kenya Vignettes, AfriMedQA SAQ, and WHB Stumps under the app_parity_v1 protocol (same system prompt as the APK, versioned RAG contexts).

Model	MCQ avg	Open-ended avg
GPT-5 (no-RAG)	82.8%	4.19 / 5
Gemma 3n E4B (no-RAG)	45.5%	2.98 / 5
Gemma 4 E4B (deployed, no-RAG)	42.9%	2.61 / 5

RAG slightly hurts both on-device models on MCQ; GPT-5 is unaffected.

On an OPPO Snapdragon 8 Elite device, Gemma 4 E4B averages 11.7 s TTFT, 26.8 s total, 13.8 tok/s — slower TTFT than Gemma 3n E4B (6.8 s) despite faster decode.

With LiteRT-LM 0.11.0 on GPU (opt-in), TTFT drops to ~1–2 s on the same device; decode rate is unchanged. Multi-token Prediction (ExperimentalFlags.enableSpeculativeDecoding, gated behind the useMtpForLlm Gradle property) was smoke-tested and produced a ~10–20% decode slowdown rather than the vendor's claimed >2× — drafter acceptance is likely poor for our long retrieved-context prompts. Off by default; re-test before re-enabling.

Full results: eval report · latency report

Finetuning (archived)

Gemma 3n E4B was finetuned on medical QA data using LoRA (not deployed). Training code removed; artefacts archived externally: dataset · model.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
.github		.github
app		app
config		config
device_push		device_push
evaluation		evaluation
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAM-AI

Architecture

Build & Run

Install

Model Files

Updating RAG Assets

Releasing

Evaluation

Finetuning (archived)

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAM-AI

Architecture

Build & Run

Install

Model Files

Updating RAG Assets

Releasing

Evaluation

Finetuning (archived)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages