May 19, 202612 sources4 arXiv

Packaging On-Device Gemma for Flutter Chat Apps: A LiteRT-LM and Web Inference Architecture Analysis

A technical comparison of Realbits' two Gemma runtime tracks for Flutter chat: local ONNX/Transformers.js in web and native LiteRT-LM packaging on mobile and desktop, with deployment, compatibility, and maintainer burden tradeoffs.

on-device-aigemmaflutterlitert

Abstract

Realbits now has two distinct Gemma execution pathways in active development, and the architecture discussion is no longer about whether Gemma can run on device, but about how to package it so the same provider product stack can ship across web, iOS, Android, and macOS without multiplying operational risk. This article evaluates the tradeoff between the current web runtime lane based on ONNX + Transformers.js and the native lane moving toward LiteRT-LM on Flutter surfaces. The goal is a practical packaging strategy that preserves one provider identity layer, one character/policy plane, and stable user-facing behavior when the runtime format changes per platform. The central conclusion is that Realbits should treat packaging as a first-class distribution API: a normalized model manifest that drives both download and runtime selection, rather than separate app-specific assumptions for each client platform.

Realbits Context

Realbits already contains the key constraint pattern this analysis must respect. The chat app family is architecturally provider-first and app-aware: native apps and web surfaces share character logic but differ sharply in compute runtime. The web surface currently uses browser execution through Transformers.js against ONNX model assets, while mobile and macOS Flutter apps are moving a local-first stack around native native-model execution with staged LiteRT-LM tooling and macOS-specific native artifacts. The practical consequence is that engineering must support two artifact families that are not equivalent in format, runtime constraints, or trust boundaries.

For engineers maintaining this repository, this means the critical path is no longer "can we run Gemma" but "can we keep inference semantics stable across model formats and runtime engines without turning every release into manual integration work". A package-centric approach is the only sustainable path: model assets should be versioned independently from app binaries, and each platform wrapper should depend on an agreed manifest contract instead of direct assumptions about filenames, quantization, or hardware support.

Related Work

Gemma itself is positioned for this provider model because Google defines it as lightweight, commercially usable open model family, and the first paper reports strong relative benchmark results and explicit safety/reliability framing that matches a production pipeline emphasis beyond accuracy only [S1]. The newer Gemma family work expands the practical-size design space further, with architecture changes and scaling variants aimed at strong capability within tighter parameter budgets [S2]. For mobile app evaluation, these characteristics are not just nice to have; they directly affect whether a character persona and response policy remain predictable under compression and on-device limits.

The hard part on-device is not just raw matmul cost. KV cache behavior dominates practical latency and memory once context grows, and adaptive KV-cache quantization research in the on-device LLM space shows this pressure explicitly [S3]. This aligns with field behavior: a chat model that appears fast in short prompts can degrade during longer user conversations if cache growth, memory locality, and bandwidth are not managed. The result is that model selection, quantization level, and context policy need to be planned together, not independently.

Historically, mobile GPU inference work already suggested that the mobile chipset is a real performance target and that models must be organized for GPU-friendly execution patterns [S4]. In 2026 that premise is more true because both web and mobile runtime stacks rely on specialized accelerators. Flutter chat surfaces should therefore not be viewed as pure CPU compute targets; they are hybrid compute surfaces where runtime portability and fallback quality are as important as raw tokens per second.

Architecture Analysis

This section separates three layers: model representation, runtime binding, and distribution contract.

First, representation. Browser and native stacks do not share the same deployment unit. Real browser execution uses ONNX artifacts and pipeline semantics designed around Transformers.js, which is documented as an ONNX-based web runtime with pipeline API parity with Python Transformers workflows and explicit support for quantized execution [S12]. In practice this makes onboarding rapid and makes model evaluation easy, but it also means you inherit browser constraints and no longer own kernel scheduling directly. Transformers.js explicitly supports WebGPU when set in pipeline options, which provides large wins where available [S12].

Second, representation for native. LiteRT-LM introduces an LLM-focused layer on top of LiteRT for session state, cache behavior, and generative API ergonomics [S6]. LiteRT itself describes the high-level model path as a conversion-oriented mobile edge framework with both classic .tflite and generative model paths and a cross-platform execution focus [S5]. LiteRT-LM deployment then uses .litertlm artifacts and model catalogs often scoped by target SoC and quantization profile, especially in NPU pathways [S8]. For engineering, this is both a strength and a complexity driver: it yields better device utilization, but also turns model packaging into a matrix problem across hardware families.

Third, runtime binding for Flutter. Flutter supports native integration either through platform channels and method invocations, or through native binary bindings in C/C++ through FFI pathways and plugin packaging [S15, S16]. For macOS especially, the binding model requires explicit native compilation and dynamic library loading, which is why Realbits has local staging scripts for native SDK build outputs before app-side integration. That setup is consistent with Flutter guidance on how to package and load C/C++ libraries in plugins [S15] and with the standard mechanism for cross-language platform messages [S16]. This is a good match for LiteRT-LM C++ API expectations, because the runtime control surface can live in native code while Flutter remains UI and policy orchestrator.

A useful way to reason about this for Realbits is a two-axis matrix:

Surface axis: web vs native.

Web uses ONNX/Transformers.js and browser WebGPU fallback semantics.
Native uses LiteRT-LM with .litertlm, accelerator APIs, and compile-time toolchains.

Control axis: interactive latency vs deterministic packaging.

Web is easier to ship and easier to update through CDN-like model references, but it is bounded by browser support and secure-context restrictions.
Native has deeper hardware leverage through GPU/NPU paths, but release complexity increases (build toolchain, per-device model compatibility, multi-format bundles).

LiteRT docs reinforce this distinction by explicitly surfacing runtime acceleration options and CompiledModel-style workflows on Android [S7]. For a provider-owned chat platform, the best architecture is not to force a single path on all platforms, but to publish a model capability map at runtime. In that model map, each pack declares:

model family and revision
minimum OS/runtime version
memory envelope and context policy
fallback chain (for example, GPU -> CPU)
quantization family and cache-sensitive constraints

When the runtime starts, each app selects the highest-utility combination from the map.

The current risk in this design is hidden coupling: if the map includes only runtime flags but not behavioral metadata (temperature defaults, decoding caps, persona token budgets), then model-level behavior can diverge across surfaces. The mitigation is a provider-level runtime contract that ships with each model pack. Character cards already encode persona and greeting behavior at the app layer. The missing layer is a runtime behavior contract that pins down generation shape and cache policy so that app behavior remains comparable on web and native.

Distribution and packaging implications

For Realbits, packaging should be explicit by capability tier, not only by file.

Base tier (universal): .onnx assets for browser path with optional lower precision variants.
Native tier A (mobile GPU/NPU ready): .litertlm packs with vendor-specific model branches.
Native tier B (fallback universal): .litertlm CPU-capable packs for environments lacking accelerator paths.
Meta manifests: platform matrix, quantization, hash, minimum memory, expected prefill/decode rough ceilings.

LiteRT-LM distribution guidance shows that native stacks often depend on per-device or per-SOC packaging, especially when NPU acceleration is used [S8]. Treating these as separate model IDs in metadata prevents runtime failure cascades when a user installs a model built for an unsupported accelerator. The same strategy can be applied to web quantization: browsers with WebGPU can request larger variants while CPU-only sessions keep conservative sizes, matching current Transforme... [S12, S13].

Why this matters for chat UX

Token-level latency is not linear in user-perceived quality. Users care about startup to first token, then continuation stability under repeated turns. LiteRT-LM sessions and cache-aware APIs directly target those stateful constraints [S6], while browser ONNX execution relies more heavily on model-level quantization and GPU availability to preserve continuity [S12, S13]. So a realistic objective is not average throughput; it is minimizing variance across turns and recoverability when acceleration paths fail.

Limitations

The architecture still has real constraints. First, platform inconsistency is now a feature, not a bug: WebGPU remains uneven in baseline support and secure-context behavior, so browser fallback must remain robust and observable [S13]. Second, native deployment has a larger build-and-ship burden. The LiteRT-LM pipeline spans conversion, backend compatibility, and runtime linkage; each layer can shift independently across OS, compiler, and SoC updates [S11, S10]. Third, model artifacts diverge by precision and accelerator assumptions, creating a combinatorial test matrix unless automation and manifest-driven download checks are strict [S8]. Fourth, on-device reliability under long context remains a live engineering risk because cache growth continues to dominate practical cost [S3].

A second risk is semantic drift between model formats. A character policy that is stable in one runtime can shift when quantization changes or cache management differs. This is especially relevant for Realbits because the product vision emphasizes reusable assets across apps; policy drift across surfaces effectively fragments the catalog. Finally, iOS/desktop coverage for some paths can lag, so any packaging design must preserve a valid fallback path and never gate first-run onboarding on unavailable native features.

Implications for This Repository

Realbits should move from "model artifacts exist" to "model artifacts are contract-driven". Concretely:

Introduce a shared runtime manifest schema that includes fields for model_id, format, quant, min_memory_gb, supported_accelerators, and feature_profile.
Add a manifest-to-runtime resolver in both web and Flutter layers that selects runtime and acceleration based on capability probing, not static feature flags.
Keep ONNX and .litertlm as sibling delivery formats with aligned semantic metadata so app-level character behavior remains stable.
Add explicit acceptance tests for cross-surface parity: same character prompt, same input, multiple accelerators, same truncation policy.
Expand failure telemetry: measure warmup, compile/initialization, prefill latency, decode latency, and fallback reason counts in both web and native.
Require model pack validation checks before publish: hash integrity, accelerator target presence, and minimum runtime compatibility checks.

This does more than optimize latency. It operationalizes architecture alignment between the web and native paths and preserves what makes Realbits a provider platform: consistent character experience across channels with local compute as the default. The immediate technical win is fewer unreproducible runtime bugs. The strategic win is that model packs become API surfaces in their own right, which is exactly the right abstraction for multi-app distribution.

References

S1: Gemma: Open Models Based on Gemini Research and Technology - https://arxiv.org/abs/2403.08295
S2: Gemma 2: Improving Open Language Models at a Practical Size - https://arxiv.org/abs/2408.00118
S3: Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs - https://arxiv.org/abs/2604.04722
S4: On-Device Neural Net Inference with Mobile GPUs - https://arxiv.org/abs/1907.01989
S5: LiteRT overview - https://ai.google.dev/edge/litert/overview
S6: Deploy GenAI Models with LiteRT - https://ai.google.dev/edge/litert/genai/overview
S7: LiteRT for Android - https://ai.google.dev/edge/litert/android
S8: Run LLMs using LiteRT-LM NPU - https://ai.google.dev/edge/litert/next/litert_lm_npu
S9: LiteRT repository - https://github.com/google-ai-edge/LiteRT
S10: LiteRT-LM README - https://github.com/google-ai-edge/LiteRT-LM/blob/main/README.md
S11: LiteRT-LM build-and-run guide - https://github.com/google-ai-edge/LiteRT-LM/blob/main/docs/getting-started/build-and-run.md
S12: Transformers.js documentation - https://huggingface.co/docs/transformers.js/en/index
S13: WebGPU in Transformers.js guide - https://huggingface.co/docs/transformers.js/guides/webgpu
S14: WebGPU API - https://developer.mozilla.org/en-US/docs/Web/API/WebGPU_API
S15: Binding to native macOS code using dart:ffi - https://docs.flutter.dev/platform-integration/macos/c-interop
S16: Flutter platform channels - https://docs.flutter.dev/platform-integration/platform-channels
S17: onnx-community Gemma ONNX model page - https://huggingface.co/onnx-community/gemma-3-270m-it-ONNX

Source Ledger

[S1] Gemma: Open Models Based on Gemini Research and Technology (arxiv): https://arxiv.org/abs/2403.08295 - Defines the open Gemma family profile, relative performance, and safety framing used as the baseline model-family context for provider-side distribution.
[S2] Gemma 2: Improving Open Language Models at a Practical Size (arxiv): https://arxiv.org/abs/2408.00118 - Shows the architecture and scaling choices in later Gemma releases, useful for sizing and feature planning in Realbits' model catalog.
[S3] Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs (arxiv): https://arxiv.org/abs/2604.04722 - Quantifies KV-cache memory/bandwidth bottlenecks on device and motivates adaptive quantization strategies for mobile chat workloads.
[S4] On-Device Neural Net Inference with Mobile GPUs (arxiv): https://arxiv.org/abs/1907.01989 - Provides early but still relevant constraints for mobile GPU-oriented model design and deployment assumptions.
[S5] LiteRT overview (official-doc): https://ai.google.dev/edge/litert/overview - Primary source for LiteRT model formats, CompiledModel direction, on-device workflow, and cross-platform positioning.
[S6] Deploy GenAI Models with LiteRT (official-doc): https://ai.google.dev/edge/litert/genai/overview - Defines LiteRT-LM responsibilities including session cloning, kv-cache management, and LLM-specific orchestration.
[S7] LiteRT for Android (official-doc): https://ai.google.dev/edge/litert/android - Clarifies Android-side integration points including hardware acceleration and CompiledModel usage patterns.
[S8] Run LLMs using LiteRT-LM NPU guidance (official-doc): https://ai.google.dev/edge/litert/next/litert_lm_npu - Describes .litertlm model requirements, NPU vendor-specific workflows, and device-linked model selection constraints.
[S9] LiteRT project homepage (technical-report): https://github.com/google-ai-edge/LiteRT - Confirms conversion and deployment paths for classic models and LLMs, including generative conversion flow.
[S10] LiteRT-LM project README (technical-report): https://github.com/google-ai-edge/LiteRT-LM/blob/main/README.md - Summarizes supported engines, supported APIs, cross-platform scope, and model families available in practice.
[S11] LiteRT-LM build-and-run guide (technical-report): https://github.com/google-ai-edge/LiteRT-LM/blob/main/docs/getting-started/build-and-run.md - Shows concrete build/run workflow for litert_lm_main and .litertlm deployment steps.
[S12] Transformers.js documentation (official-doc): https://huggingface.co/docs/transformers.js/en/index - Defines browser runtime behavior, ONNX Runtime execution model, pipeline API, and quantized execution guidance.