Packaging On-Device Gemma for Flutter Chat: A Dual-Track Runtime Analysis for 2026-06-12
A technical evaluation of Realbits' on-device Gemma strategy across web and Flutter runtimes, comparing ONNX/Transformers.js WebGPU execution with native LiteRT-LM packaging and proposing an architecture that treats model formats as deployable contracts.
Abstract
This article analyzes the practical architecture Realbits needs for local chat in 2026 with Gemma models, focusing on two production tracks already present in the codebase: browser-side Transformers.js on ONNX models and native LiteRT-LM on Flutter apps. The goal is not to compare benchmark charts only, but to determine what packaging and integration choices create stable distribution, faster iteration, and acceptable quality for real devices.
Gemma is attractive for this role because its model family is explicitly positioned as compact and strong for downstream deployment, and recent work has focused on memory-aware and multimodal variants for edge use [S1][S14]. Realbits has already converged on a dual path where web users get dynamic browser execution and app users get native runtime packs, which reduces cloud dependency but increases packaging complexity. Treating each model binary as a platform-specific artifact rather than a single monolith is not just engineering detail; it is the core contract between the provider layer and distribution apps.
Realbits Context
Realbits currently has two active product surfaces for chat AI: a web path in packages/web and Flutter app paths in apps/realbits-chat-hub plus vertical apps. The web side exposes a Gemma runtime hook and model metadata logic that is consistent with an on-device-first contract: the app runtime state tracks model load, runtime warmup, and execution status rather than remote inference state. This is visible in the local code path where web inference is prepared through a client-side hook and status model.
The Flutter path is explicitly local-first. The chat app README describes a local-only model flow with speech and Gemma stacks and explicitly notes no cloud inference path for normal chat after setup. The same app surface also tracks platform variance: Android and macOS are in active integration while iOS Gemma readiness is still staged. This creates a realistic production asymmetry, where parity cannot be assumed across all mobile targets.
The repository also includes native LiteRT-LM helper tooling for macOS and a staged C API build flow, indicating that conversion, linking, and handoff into Flutter native layers are already expected operationally, not a future experiment. This means architecture review should optimize package metadata, preflight checks, and fallback behavior across platforms, not merely model quality.
So the practical question is no longer whether Realbits should choose web or native inference. It is how to turn one model strategy into two deterministic and observable bundles.
Related Work
A useful comparison starts with Google’s Gemma family paper, which establishes the design intent: smaller, strong open models with explicit release signals around performance and safety in text tasks [S1]. In engineering terms this justifies Gemma 2/3/4 family use for latency-sensitive UX where inference cost and deployability are first-class constraints.
RecurrentGemma introduces an explicit memory-efficiency lens: fixed-size state transitions and hybrid local-attention structures to improve efficiency for longer contexts. The claim matters to Realbits because chat products must hold persona state and prompt scaffolding across multi-turn sessions, and memory growth directly affects both latency and crash risk [S2]. If a runtime target can reuse context efficiently and avoid runaway memory expansion, the app remains usable under constrained mobile budgets.
LLMCad contributes a second idea: a compact first-pass generator can be paired with a validation stage to recover quality while reducing compute exposure [S3]. That is not a drop-in replacement for chat completion, but it is an architectural signal: Realbits can preserve speed by splitting responsibilities between lightweight first-pass and heavier guard/repair logic.
For Gemma-specific on-device evolution, the Gemma 3n documentation highlights parameter partitioning, conditional loading, and per-layer embedding caching as explicit techniques to lower active footprint while preserving capability [S14]. This validates a package strategy based on capability toggles and model mode variants, rather than fixed-size one-size binaries.
The Flutter side work in this repo aligns with these papers at a high level. Flutter’s platform model gives two integration routes: message-based platform channels and FFI for direct native calls. Platform channels are easier to reason about initially; FFI is often faster for binary-heavy paths [S10][S11]. Gemma generation is binary-heavy and often cache-sensitive, so the implementation pattern matters as much as model size.
Architecture Analysis
Track A: Browser runtime with ONNX + Transformers.js
The browser track can be viewed as a pull model: load Gemma as an ONNX graph, run with Transformers.js, cache outputs and weights locally, and execute in-process with optional WebGPU acceleration. Official docs show that WebGPU support is selectable via device: 'webgpu', while browser defaults often use CPU/WASM when GPU support is absent or unstable [S6]. ONNX Runtime confirms that this environment can choose among execution providers, while warning that all operators may not be supported equally across GPU EPs [S8][S9].
For Realbits, this means the web lane is best framed as an adaptive control plane:
- model artifact selection by runtime capabilities,
- explicit fallback ordering (WebGPU -> WebNN/WebGL -> WASM),
- aggressive local caching for weights and runtime files [S7],
- warmup telemetry and model load status persisted for progressive UX.
The benefit of this track is update speed. A model pack can be refreshed without app store submission, and deployment can be tied to manifest metadata. That helps Realbits run experiments and feature flags from web assets quickly.
The biggest risk is environment variance. Chrome made WebGPU default in a major desktop/mobile rollout phase, but support and stability are still uneven across browsers and hardware profiles [S12]. ONNX also reports execution-provider differences and operator support ceilings. These are not edge-case issues: they are routine production conditions for consumer phones and mixed user agents.
Track B: Native Flutter runtime with LiteRT-LM
The native lane is a push model: convert models to .litertlm, deploy via LiteRT-LM runtime, and execute close to native hardware. Google’s LiteRT docs frame LiteRT as an on-device framework for ML and GenAI with conversion, runtime, and optimization stages [S4]. The conversion guide describes conversion from Hugging Face through Torch export tooling, with quantization and graph optimization folded into a single workflow producing LiteRT-LM output for CPU/GPU and optional NPU paths [S5][S13].
Realbits already has a macOS staging pipeline under tooling/litertlm and a native helper flow for packaging model files. That is a strong signal that the real bottleneck is not conversion feasibility but orchestration:
- when to rebuild
.litertlm, - how to map model versions across OS builds,
- where to store local artifacts without exhausting user storage,
- how to validate compatibility with app upgrades.
From a Flutter integration standpoint, this track should avoid serializing too much through platform channels for every token. Platform channels are straightforward, but high-volume inference pipelines benefit from direct native calls where possible, and Flutter docs explicitly call out FFI speed benefits for C-based interfaces [S11]. This suggests a hybrid in Realbits where initialization and lifecycle go through Dart-native coordination, but generation hot paths use lightweight native bindings.
Contract-oriented packaging model
Given the two tracks, the practical architecture for Realbits is a manifest-first model pack contract:
- one canonical logical model ID and version,
- one or more on-device formats (
onnx,litertlm), - per-format checksums and minimum runtime compatibility,
- memory-tier metadata (e.g., lightweight vs full mode),
- security and licensing constraints,
- platform allow-lists.
The web runtime reads this contract and activates ONNX+WebGPU when available, otherwise WebAssembly. The Flutter runtime reads the same contract and requests a .litertlm pack with fallback to smaller quantization tiers. This avoids “two separate product trees” and keeps Realbits features testable by ID rather than by binary assumptions.
Limitations
The strongest current constraint is hardware asymmetry. The iOS native Gemma path is incomplete in the repository context while Android/macOS paths are more advanced. That means feature parity planning must be explicit: some features may be web-only, some Flutter-only, and some only on selected OS builds.
Second, browser inference is brittle across environments. WebGPU can be absent or partially supported, and ONNX EP support is not uniform, so fallback paths must be robust and measurable [S8][S9][S12]. If fallback lands in WASM too often, users experience higher first-token latency despite cached assets, which may break “chat responsiveness” as a product claim.
Third, runtime storage and app size are real constraints. Mobile binaries grow with native runtimes and embedded runtime assets. Flutter’s own app-size guidance highlights that larger installables affect adoption and update friction [S15]. This is not only a build concern; it is a channel strategy concern because each model variant increase may push users into larger downloads and delayed updates.
Fourth, Gemma documentation repeatedly couples practical deployment with responsible model usage constraints and licensing context. Even with open weights, production pipelines still require policy-aware handling and explicit controls over what is exposed to users [S1][S14]. Packaging should therefore carry not only weights and vocab but also policy metadata and feature gates.
Implications for This Repository
Realbits should lock in a repository-level “Model Pack Contract” layer first, then map each runtime implementation to it.
-
Make model manifests versioned and track both web and native fingerprints for every pack. Include explicit fields for format, minimum runtime, activation memory, and fallback mode. This is the only way to keep on-device experiments reproducible across Flutter and web.
-
Add a capability matrix in CI and release scripts that rejects an app release if manifest and runtime compatibility diverge. For example, if
litertlmgeneration changed, the iOS/macos plugin ABI and Flutter integration metadata should be validated together. -
Split first-run telemetry across lanes: web warmup latency, runtime load failures, cache miss rates, and native pack validation failures should be reported with stable tags so the same issue can be diagnosed independently of UI code.
-
Use track-specific adaptation, not track-specific products. Realbits should keep one logical model ID and expose runtime mode switches (WebGPU/WASM or LiteRT-LM-small/LiteRT-LM-int4) based on preflight checks. That allows incremental rollout of smaller models while preserving behavioral consistency.
-
Treat model conversion as a release artifact process, not ad-hoc local tooling. Since LiteRT conversion includes optimizations and graph transformation [S13], conversion failures should become build-gate failures, with deterministic
.litertlmoutputs stored and auditable. -
Keep a clear migration policy between packs. For example, if a user updates to a model with larger context window requirements, enforce compatibility checks before activation and expose one clear fallback when memory limits would be exceeded. This avoids silent performance cliffs.
The net outcome is architecture stability: users get faster startup with fewer server dependencies, while Realbits keeps control of quality and release velocity through explicit contract checks. That is the strongest engineering response to the on-device challenge in this repo.
References
- S1: https://arxiv.org/abs/2403.08295
- S2: https://arxiv.org/abs/2404.07839
- S3: https://arxiv.org/abs/2309.04255
- S4: https://developers.google.com/edge/litert/overview
- S5: https://developers.google.com/edge/litert/conversion/overview
- S6: https://huggingface.co/docs/transformers.js/en/guides/webgpu
- S7: https://huggingface.co/docs/transformers.js/main/api/env
- S8: https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html
- S9: https://onnxruntime.ai/docs/get-started/with-javascript/web.html
- S10: https://docs.flutter.dev/platform-integration/platform-channels?sid=y2sbpB
- S11: https://docs.flutter.dev/resources/architectural-overview
- S12: https://developer.chrome.com/blog/new-in-webgpu-113?hl=en
- S13: https://developers.google.com/edge/litert/conversion/pytorch/genai
- S14: https://ai.google.dev/gemma/docs/gemma-3n
- S15: https://docs.flutter.dev/perf/app-size
Source Ledger
- [S1] Gemma: Open Models Based on Gemini Research and Technology (arxiv): https://arxiv.org/abs/2403.08295 - Grounds model quality and size choices for Gemma-family deployments.
- [S2] RecurrentGemma: Moving Past Transformers for Efficient Open Language Models (arxiv): https://arxiv.org/abs/2404.07839 - Explains memory-aware stateful decoding patterns relevant to on-device long-sequence efficiency.
- [S3] LLMCad: Fast and Scalable On-device Large Language Model Inference (arxiv): https://arxiv.org/abs/2309.04255 - Provides a reference architecture for compact-on-device generators with collaborative validation.
- [S4] LiteRT overview (official-doc): https://developers.google.com/edge/litert/overview - Defines LiteRT as Google’s on-device ML and GenAI deployment stack.
- [S5] Model conversion overview (LiteRT) (official-doc): https://developers.google.com/edge/litert/conversion/overview - Defines the conversion step and required runtime format for LiteRT deployment.
- [S6] Running models on WebGPU (Transformers.js) (official-doc): https://huggingface.co/docs/transformers.js/en/guides/webgpu - Specifies WebGPU execution toggles and browser-side inference expectations.
- [S7] Transformers.js env configuration (official-doc): https://huggingface.co/docs/transformers.js/main/api/env - Documents cache and runtime options used for local model file reuse.
- [S8] ONNX Runtime Web execution providers (official-doc): https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html - Shows how WebGPU EP is enabled and when to prefer WASM CPU execution.
- [S9] ONNX Runtime for JavaScript web runtime (official-doc): https://onnxruntime.ai/docs/get-started/with-javascript/web.html - Defines import behavior and runtime options for browser inference.
- [S10] Flutter platform channels (official-doc): https://docs.flutter.dev/platform-integration/platform-channels?sid=y2sbpB - Defines message-based communication between Dart and native host code.
- [S11] Flutter architectural overview (official-doc): https://docs.flutter.dev/resources/architectural-overview - Explains FFI as a lower-overhead path for native C-based APIs.
- [S12] Chrome 113 WebGPU support (official-doc): https://developer.chrome.com/blog/new-in-webgpu-113?hl=en - Anchors platform baseline for browser-side GPU availability.