On-device medical search for nurses and midwives in Zanzibar
Demo · Eval Report · Latency Report
Android app that answers clinical questions offline using on-device RAG — Gemma 4 E4B (LiteRT-LM) for generation, Gecko for embeddings, SQLite for vector search. No internet needed after the initial ~4.5 GB model download.
┌─────────────────────────────────────────────────┐
│ Flutter UI (Dart) │
│ intro_page.dart · search_page.dart │
├──────────────┬──────────────────────────────────┤
│ MethodChannel│ EventChannel (streaming) │
├──────────────┴──────────────────────────────────┤
│ Android Native (Kotlin) │
│ MainActivity.kt · RagStream.kt │
│ ┌────────────────────────────────────────────┐ │
│ │ RagPipeline.kt │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────────┐ │ │
│ │ │ Gemma 4 │ │ Gecko │ │ SQLite │ │ │
│ │ │ LiteRT-LM│ │ Embedder │ │ VectorStore│ │ │
│ │ └──────────┘ └──────────┘ └────────────┘ │ │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Query → Gecko embeds → SQLite retrieves top-3 guideline chunks → prompt assembled → LiteRT-LM streams response → Flutter renders markdown.
Requires a real Android device (LiteRT-LM needs hardware acceleration, not emulators).
cd app
flutter pub get
flutter run # debug on connected device
flutter build apk # release APK
adb logcat -s mam-ai # timing, memory, inference logsDownload the APK from Releases and sideload onto a real Android device.
Downloaded on first launch from public HuggingFace repos — no auth required.
| File | Size | Source |
|---|---|---|
gemma-4-E4B-it.litertlm |
3.65 GB | litert-community/gemma-4-E4B-it-litert-lm |
Gecko_1024_quant.tflite |
146 MB | litert-community/Gecko-110m-en |
sentencepiece.model |
794 KB | litert-community/Gecko-110m-en |
embeddings.sqlite |
~140 MB | mamai-medical-guidelines releases |
The pinned RAG bundle version lives in config/rag_assets.lock.json.
Chunking and embedding are managed in the companion mamai-medical-guidelines repo. To pull in a new bundle:
- Bump
config/rag_assets.lock.jsonwith the new version + manifest checksum - Run the staging and push scripts:
bash scripts/sync_rag_assets.sh # download + stage bundle
bash scripts/sync_models.sh # download Gecko + Gemma from HuggingFace
bash scripts/push_to_device.sh # push everything to connected device
bash scripts/push_to_device.sh --embedding-models # push Gecko + tokenizer onlyTag from main only. CI builds a signed APK and publishes a GitHub Release automatically.
git tag v0.1.0-beta.1 # beta/alpha/rc → prerelease; vX.Y.Z → stable
git push origin v0.1.0-beta.1Valid formats: vX.Y.Z, vX.Y.Z-alpha.N, vX.Y.Z-beta.N, vX.Y.Z-rc.N
Required GitHub secrets: ANDROID_KEYSTORE_BASE64, ANDROID_KEYSTORE_PASSWORD, ANDROID_KEY_ALIAS, ANDROID_KEY_PASSWORD. See app/android/key.properties.example for local signing setup.
Benchmarks run across AfriMedQA, MedQA USMLE, MedMCQA, Kenya Vignettes, AfriMedQA SAQ, and WHB Stumps under the app_parity_v1 protocol (same system prompt as the APK, versioned RAG contexts).
| Model | MCQ avg | Open-ended avg |
|---|---|---|
| GPT-5 (no-RAG) | 82.8% | 4.19 / 5 |
| Gemma 3n E4B (no-RAG) | 45.5% | 2.98 / 5 |
| Gemma 4 E4B (deployed, no-RAG) | 42.9% | 2.61 / 5 |
RAG slightly hurts both on-device models on MCQ; GPT-5 is unaffected.
On an OPPO Snapdragon 8 Elite device, Gemma 4 E4B averages 11.7 s TTFT, 26.8 s total, 13.8 tok/s — slower TTFT than Gemma 3n E4B (6.8 s) despite faster decode.
With LiteRT-LM 0.11.0 on GPU (opt-in), TTFT drops to ~1–2 s on the same device; decode rate is unchanged. Multi-token Prediction (ExperimentalFlags.enableSpeculativeDecoding, gated behind the useMtpForLlm Gradle property) was smoke-tested and produced a ~10–20% decode slowdown rather than the vendor's claimed >2× — drafter acceptance is likely poor for our long retrieved-context prompts. Off by default; re-test before re-enabling.
Full results: eval report · latency report
Gemma 3n E4B was finetuned on medical QA data using LoRA (not deployed). Training code removed; artefacts archived externally: dataset · model.
