Introduction: The Case for On-Device LLM Inference
The dominant pattern in AI-powered mobile apps has been a thin client that proxies requests to a cloud-hosted model—OpenAI, Anthropic, Gemini API, or a self-hosted endpoint. It works, until it doesn't. Latency spikes during peak hours. API costs scale linearly with user growth. And every prompt your user types transits through infrastructure you don't fully control.
On-device inference inverts this model entirely. By running a quantized LLM directly on the user's hardware, you eliminate three structural problems at once:
| Concern | Cloud API | On-Device LLM |
|---|---|---|
| Privacy | User data leaves the device | All processing is local — zero data exfiltration surface |
| Latency | Network-bound (100ms–2s+ round-trip) | Compute-bound (~10–50ms first-token on modern SoCs) |
| Cost | Per-token billing, scales with MAU | Fixed cost — model ships with (or alongside) the app |
| Offline | Fails without connectivity | Fully functional in airplane mode, tunnels, rural areas |
Google's LiteRT runtime (the evolution of TensorFlow Lite, purpose-built for on-device LLM inference) makes this practical. Combined with Gemma — Google's family of lightweight, open-weight models — Android developers now have a production-viable path to ship local AI without maintaining inference infrastructure.
This post walks through the architecture, implementation, and performance characteristics of integrating a local LLM into a production Android application using LiteRT. No toy demos. No single-file prototypes. A multi-module Clean Architecture codebase with streaming inference, session persistence, and lifecycle-aware resource management.
Architecture Overview: Where the LLM Fits in a Production Android App
Dropping an LLM engine into an Android app without architectural boundaries is a recipe for leaked resources, untestable code, and a ViewModel that knows too much. The inference layer must be treated like any other data source — isolated behind abstractions, injected via DI, and observable through reactive streams.
Clean Architecture Boundaries
The dependency flow follows strict inversion: the UI layer knows nothing about LiteRT. The domain layer defines what a chat repository can do. The data layer implements how it does it — including all LiteRT lifecycle management.
┌──────────────────────────────────────────────────────────────┐
│ Presentation Layer │
│ ChatScreen (Compose) → ChatViewModel → StateFlow<UiState> │
└──────────────────────┬───────────────────────────────────────┘
│ observes / invokes
┌──────────────────────▼───────────────────────────────────────┐
│ Domain Layer │
│ GetGemmaResponseUseCase → ChatRepository (interface) │
│ Pure Kotlin. No Android, no LiteRT, no Room imports. │
└──────────────────────┬───────────────────────────────────────┘
│ implemented by
┌──────────────────────▼───────────────────────────────────────┐
│ Data Layer │
│ ChatRepositoryImpl │
│ ├── LiteRT Engine (lifecycle, inference, token streaming) │
│ ├── Room Database (session & message persistence) │
│ └── Entity ↔ Domain mappers │
└──────────────────────────────────────────────────────────────┘Key architectural decision: The ChatRepositoryImpl is the only class that imports com.google.ai.edge.litertlm. This means:
- The entire domain and presentation layers can be unit-tested without a physical model file.
- Swapping the inference backend (e.g., migrating from LiteRT to ONNX Runtime, or switching models) requires changing exactly one class.
- The ViewModel receives tokens as a
Flow<String>— it has no idea whether those tokens originate from a local model, a cloud API, or a mock.
Data Flow: User Prompt to Streamed Response
The inference pipeline is fully reactive, built on Kotlin Flow and callbackFlow:
User taps Send
→ ChatViewModel.sendMessage(prompt)
→ Updates UI state to Generating
→ GetGemmaResponseUseCase(prompt, sessionId)
→ ChatRepositoryImpl.generateResponse(prompt, sessionId)
→ Persists user message to Room
→ Creates model message placeholder (state: STREAMING)
→ Builds LiteRT Message object
→ Calls conversation.sendMessageAsync(message)
→ Token-by-token Flow emission
→ Each token: trySend(token) to callbackFlow
→ Batched DB writes (~50ms debounce)
→ On completion: final DB flush, state → COMPLETE
← Flow<String> tokens stream back up
← ViewModel batches tokens (~40ms) into StateFlow
← Compose ChatScreen recomposes with streaming textThis design ensures that token streaming, database persistence, and UI updates all happen concurrently on their appropriate dispatchers — without blocking any single layer.
Implementation Guide
Step 1: Project Setup and Dependencies
LiteRT LLM ships as a single Maven artifact. Add it to your data module's dependency graph:
Version Catalog (gradle/libs.versions.toml):
[versions]
litertlmAndroid = "0.10.2"
[libraries]
litertlm-android = { module = "com.google.ai.edge.litertlm:litertlm-android", version.ref = "litertlmAndroid" }Data module build.gradle.kts:
dependencies {
implementation(libs.litertlm.android)
}Note: The LiteRT LLM artifact is distinct from the legacy
tensorflow-liteartifacts. It provides a purpose-built API surface for conversational LLM inference — includingEngine,Conversation,Message, and streaming primitives — rather than generic tensor operations.
Step 2: Model File Provisioning
Gemma models in .litertlm format are pre-packaged for LiteRT. They are too large for APK bundling (~2–4 GB for quantized variants), so you'll need a provisioning strategy:
| Strategy | Tradeoff |
|---|---|
| ADB push to app's external files dir | Development only. Fast iteration. |
| On-demand download (Play Asset Delivery / custom CDN) | Production. Adds download UX complexity. |
| Bundled in APK Expansion files | Legacy. Not recommended for models >150 MB. |
For development, the simplest path:
adb push gemma-4-E2B-it.litertlm /storage/emulated/0/Android/data/<your.package>/files/In your repository implementation, resolve the model file path:
// TODO: Insert model file resolution logic here
// Resolve the .litertlm model file from the app's external files directory.
// Validate that the file exists before proceeding to engine initialization.
// Return Result.failure with a descriptive error if the model is missing.Step 3: LiteRT Engine Initialization
The LiteRT Engine is the heavyweight object — it loads the model into memory, allocates compute buffers, and prepares the inference graph. Initialization is expensive (2–10 seconds depending on model size and device) and must happen off the main thread.
// TODO: Insert LiteRT Engine initialization code here
// 1. Build an EngineConfig specifying:
// - modelPath: absolute path to the .litertlm file
// - backend: Backend.CPU() for text inference (most compatible)
// - visionBackend: Backend.GPU() if supporting multimodal (image) inputs
// - cacheDir: app's cache directory for LiteRT internal caching
// 2. Construct the Engine instance with the config.
// 3. Call engine.initialize() — this is the blocking operation that loads the model.
// 4. Create an initial Conversation from the engine with a ConversationConfig.
// 5. Wrap in try/catch and return Result<Unit> for clean error propagation.Design consideration: The engine should be treated as a lazy singleton — initialized once on first use and held for the lifetime of the repository. Avoid creating multiple engine instances; each one allocates its own model memory.
Step 4: Building Messages and Managing Conversations
LiteRT's conversation API is turn-based. Each Conversation maintains an internal context window with the full dialogue history. The API provides structured message types:
// TODO: Insert message construction code here
// For text-only prompts:
// Message.user(text = prompt)
//
// For multimodal (image + text) prompts:
// Message.user(Contents.of(
// Content.Text(prompt),
// Content.ImageFile(imageFilePath)
// ))
//
// For model responses (when replaying history):
// Message.model(text = responseText)Session switching requires careful lifecycle management. When the user switches to a different chat session, you must:
- Close the current
Conversationto release its context memory. - Load the historical messages from your persistence layer (Room).
- Create a new
ConversationwithinitialMessagespre-populated from history.
// TODO: Insert conversation history restoration code here
// 1. Close the existing conversation instance.
// 2. Query Room for all messages in the target session (ordered chronologically).
// 3. Map each ChatMessageEntity to LiteRT Message objects:
// - USER role → Message.user(...)
// - MODEL role → Message.model(...)
// 4. Create a new Conversation with ConversationConfig(initialMessages = ...).
// 5. This allows the model to "remember" prior context in the new session.Step 5: Token Streaming and Reactive Pipeline
This is where the architecture pays off. LiteRT's sendMessageAsync() returns a Flow that emits partial response tokens as the model generates them. You need to bridge this into your reactive pipeline without blocking any layer.
// TODO: Insert token streaming implementation here
// Use callbackFlow to bridge LiteRT's streaming API into your repository's Flow<String>:
//
// 1. Guard against concurrent generation (single-flight pattern).
// 2. Inside callbackFlow, launch a coroutine that:
// a. Resolves the target session ID (create new session if null).
// b. Persists the user message to Room.
// c. Inserts a model message placeholder with state STREAMING.
// d. Calls conversation.sendMessageAsync(userMessage).collect { partialToken ->
// - Accumulate tokens in a StringBuilder.
// - Batch DB writes (~50ms intervals) to reduce write pressure.
// - trySend(token) for immediate upstream delivery.
// }
// e. On completion: final DB write with full response, update state to COMPLETE.
// 3. flowOn(Dispatchers.Default) for inference work.
// 4. awaitClose { cancel job, reset generation flag }.Why callbackFlow? It gives you a ProducerScope that can emit values from within a coroutine launched internally — perfect for wrapping LiteRT's own flow collector while adding persistence and error handling as side effects.
Step 6: ViewModel — Batched UI Updates
The ViewModel consumes the token Flow and must update the Compose UI without thrashing recomposition. Raw token emission can trigger 50+ state updates per second — far more than the UI needs.
// TODO: Insert ViewModel token batching code here
// 1. Maintain a StringBuilder as a streaming buffer.
// 2. On each token received from the use case:
// a. Append to the buffer.
// b. Check if >= 40ms has elapsed since the last UI update.
// c. If yes: push buffer.toString() to StateFlow as streamingText.
// d. If no: skip — the next token will catch it.
// 3. This reduces StateFlow emissions from ~50/sec to ~25/sec,
// matching the UI's practical recomposition rate.This batching pattern is critical for maintaining smooth scroll performance in the chat list while tokens are streaming.
Performance & Optimization
Memory Management
On-device LLM inference is memory-intensive. A quantized Gemma model can consume 1–4 GB of RAM during active inference. Key strategies:
| Strategy | Implementation |
|---|---|
| Lazy initialization | Don't load the engine at app startup. Initialize on first chat interaction. |
| Singleton engine | Never create multiple Engine instances. Each one allocates its own model copy in memory. |
| Explicit release | Call engine.close() in onCleared() or when the app receives onTrimMemory(TRIMMEMORYRUNNING_CRITICAL). |
| Conversation cleanup | Close and recreate Conversation objects on session switch rather than keeping multiple contexts alive. |
| StringBuilder reuse | Reuse streaming buffers across inference calls. Call clear() instead of allocating new instances. |
// TODO: Insert resource release / cleanup code here
// 1. Null out the conversation reference.
// 2. Call engine?.close() to release native memory.
// 3. Set engine to null.
// 4. Reset the isGenerating flag.
// 5. Wrap in try/catch — close() can throw if the engine is in a bad state.Hardware Acceleration
LiteRT supports multiple compute backends. The optimal choice depends on the operation type and target hardware:
| Backend | Best For | Availability | Notes |
|---|---|---|---|
Backend.CPU() | Text inference | Universal | Most compatible. Reliable across all devices. |
Backend.GPU() | Vision/image processing | Devices with OpenCL/Vulkan GPU | Significant speedup for image encoding. Not all GPUs are supported. |
Backend.NNAPI() | General acceleration | Android 8.1+ (API 27) | Delegates to DSP/NPU when available. Performance varies widely by OEM. |
Practical guidance: Use Backend.CPU() for primary text inference and Backend.GPU() selectively for vision tasks. NNAPI support is fragmented — test on your target device matrix before enabling in production.
// TODO: Insert backend configuration code here
// EngineConfig(
// modelPath = modelFile.absolutePath,
// backend = Backend.CPU(), // Primary: text inference
// visionBackend = Backend.GPU(), // Secondary: image encoding
// cacheDir = context.cacheDir.absolutePath
// )Model Quantization Impact
Quantization is the single most impactful optimization for on-device inference. It reduces model size, memory footprint, and per-token latency at the cost of marginal quality degradation.
| Quantization | Model Size (approx.) | RAM Usage | Quality Impact | Use Case |
|---|---|---|---|---|
| FP32 (full) | ~10 GB | ~12 GB | Baseline | Research only — not viable on mobile |
| FP16 (half) | ~5 GB | ~6 GB | Negligible | High-end tablets with 12+ GB RAM |
| INT8 | ~2.5 GB | ~3 GB | Minor | Flagship phones (8+ GB RAM) |
| INT4 | ~1.3 GB | ~1.8 GB | Moderate | Mid-range devices (6+ GB RAM). Best mobile tradeoff. |
The .litertlm format from Google AI Edge ships pre-quantized models. The Gemma 4 E2B-IT variant used in this project is a 2B-parameter model optimized for instruction-following tasks, packaged with 4-bit quantization — making it viable on devices with 6 GB+ RAM.
Streaming Performance Tuning
Two batching windows control the tradeoff between responsiveness and system load:
- Repository → DB writes: ~50ms debounce. Reduces SQLite write operations from hundreds to ~20 per response.
- ViewModel → UI updates: ~40ms batching. Caps StateFlow emissions to ~25/sec, matching Compose's practical recomposition throughput.
These values are tunable. On lower-end devices, increasing both to ~100ms reduces CPU contention. On flagships, you can tighten them to ~20ms for a snappier streaming feel.
Conclusion: The On-Device Orchestration Frontier
Running a local LLM on Android is no longer a proof-of-concept — it's a production-viable architecture pattern. The tooling has matured: LiteRT provides a stable inference runtime, Gemma offers capable open-weight models at mobile-friendly sizes, and Kotlin's Flow primitives make reactive token streaming natural.
But local inference is just the beginning. The real opportunity is on-device orchestration — where the local model doesn't just answer questions but coordinates actions: querying local databases, invoking on-device APIs, managing multi-step workflows, all without a network round-trip. Think of it as an AI agent that runs entirely within the user's trust boundary.
The architectural patterns in this post — strict layer separation, reactive streaming, lifecycle-aware resource management — are designed to scale toward that future. The inference engine is isolated behind an interface. Swap Gemma for a larger model when hardware allows. Add tool-calling capabilities. Chain multiple inference steps. The architecture doesn't change.
On-device AI is not a compromise. It's a competitive advantage.
Built with Kotlin, Jetpack Compose, LiteRT LLM 0.10.2, and Google Gemma. The full source code is available on GitHub.