May 27, 202611 sources3 arXiv

Model-Pack Contracts for Local Gemma in Flutter Chat Apps

A technical analysis of how Realbits should treat on-device Gemma as a versioned model-pack contract for Flutter chat apps, with LiteRT-LM as the native runtime boundary and provider manifests as the cross-app control surface.

on-device-aigemmaflutterlitert

Abstract

Realbits does not need another generic local-LLM integration story. The repository already points at a sharper engineering problem: how to package Gemma as a reusable, versioned model pack that Flutter apps can install, validate, warm, and invoke without turning Dart UI code into an inference runtime. Gemma is a credible base for this because the model line was explicitly released as lightweight open models, then refined through Gemma 2 and Gemma 3 for practical-size deployment, grouped-query attention, local-global attention, distillation, and lower long-context KV-cache pressure [S1][S2][S3]. LiteRT and LiteRT-LM are relevant because Google positions LiteRT as an on-device ML and GenAI deployment stack, while LiteRT-LM narrows that stack toward language-model execution across edge platforms [S4][S5][S6].

The architectural question for Realbits is therefore not simply model format. It is contract design. A model pack should include the model artifact, tokenizer and runtime metadata, compatibility constraints, hashes, evaluation probes, prompt-window policy, and install-state semantics. Flutter should present chat, progress, cancellation, and error recovery. Native code should own accelerator selection, file mapping, memory pressure, streaming decode, and runtime-specific failures. That split turns on-device Gemma from an app-specific experiment into provider inventory that can move across Realbits Chat Hub, Situational English Coach, Character Chat, and future vertical surfaces.

Realbits Context

Realbits is now framed as a model and app provider: on-device models, reusable character cards, themed Flutter apps, and a web creator surface. The current repository context includes a local-only Realbits Chat Hub Flutter app, local speech and Gemma runtime wiring, macOS LiteRT-LM helper scripts, and a web Gemma runtime that uses Transformers.js and ONNX model artifacts. That mix matters. Realbits is already operating two different local inference environments: native app runtimes for mobile and desktop, and browser-side inference for web creation and preview. The product unit that can unify them is not a widget, plugin, or chat page. It is the model pack.

For Flutter apps, the pack boundary should be stricter than a normal asset folder. Flutter can package static assets, but large model binaries make base package size and update behavior a product concern [S8]. Android also has platform-native delivery tools such as Play Asset Delivery, where large assets can be delivered install-time, fast-follow, or on-demand rather than always living in the base install [S9]. Realbits may still choose its provider-hosted download pipeline for cross-store consistency, but the same lesson applies: Gemma weights should be treated as managed runtime assets, not ordinary UI resources.

The repository already leans in this direction. The Flutter app README describes local runtime installation and pack delivery as setup concerns, not cloud inference calls. The LiteRT-LM helper directory under the Flutter app tree treats macOS native artifacts as staged SDK material. The web package contains a Gemma runtime hook that loads an ONNX community Gemma model through Transformers.js, which itself runs transformer models in browser or JavaScript environments via ONNX Runtime [S10][S11]. These are not identical execution stacks, but they can share catalog metadata, availability checks, and evaluation expectations.

Related Work

The Gemma papers are useful because they describe a model family built for smaller, more accessible deployments rather than only frontier-scale serving. The original Gemma report released 2B and 7B pretrained and instruction-tuned models and included benchmark and responsibility evaluations [S2]. Gemma 2 expanded the line to practical sizes up to 27B and described architectural and training changes, including grouped-query attention and distillation for smaller variants [S3]. Gemma 3 pushed the family toward multimodal inputs, longer context, and architectural changes intended to reduce the memory growth associated with long-context KV caches [S1]. For Realbits, these details are not abstract model trivia. They directly affect pack size, prompt-window policy, mobile memory budgets, and whether a character-card prompt can stay resident during longer chats.

LiteRT and LiteRT-LM provide the runtime side of the related work. LiteRT's official documentation describes an on-device framework for ML and GenAI deployment using conversion, runtime, and optimization machinery [S4]. The GenAI documentation lists Gemma-family deployment paths and positions LiteRT around edge language models [S5]. LiteRT-LM's own project documentation describes a cross-platform language-model inference framework with Android, iOS, web, desktop, and IoT ambitions, plus language bindings and community Flutter support [S6]. This is important for Realbits because Flutter is not itself a model runtime. Flutter's official platform-channel design exists to let Dart call host-platform code asynchronously, which is the right mental model for invoking native LiteRT-LM without leaking native complexity into the chat UI [S7].

Architecture Analysis

A robust Realbits Gemma pack should have four contracts.

First, the artifact contract. The pack manifest should identify the model family, model id, quantization, context limit, tokenizer files, runtime format, expected accelerator class, minimum runtime version, license terms, byte size, hashes, and unpacked disk footprint. The manifest should not merely say Gemma installed. It should say which Gemma artifact is installed and which runtime has proved it can open it. This matters because LiteRT-LM, ONNX, and future native formats may all represent Gemma-family models, but they will not share binary compatibility or identical performance behavior.

Second, the runtime contract. Flutter should ask the native host for capabilities, not assume them. A native bridge can report text generation, streaming support, cancellation support, maximum safe prompt tokens, preferred batch or decode settings, accelerator used, and whether the current model supports multimodal inputs or tool calling. LiteRT-LM documentation now describes a broad platform and API surface, including native and community Flutter paths [S6]. That breadth is useful, but it also means Realbits should insulate app code from runtime churn. Dart should depend on a stable Realbits interface such as prepare, generate, cancel, unload, and healthCheck. Kotlin, Swift, Objective-C, C++, or platform plugins can absorb upstream API movement.

Third, the interaction contract. Chat is not one blocking function call. The UI needs token streaming, partial text, cancellation, retry, cold-start progress, and clear failure states. Flutter platform channels are asynchronous and are intended for passing messages between Dart and host code [S7], but large token streams and long-running native work require discipline. Realbits should avoid pushing giant prompt or response blobs through poorly typed channels on every frame. A practical design is a command channel for lifecycle operations and an event stream for decode events, errors, and timing metrics. The model host should own context construction, but the app should own conversation policy: which messages are retained, when a summary is inserted, and when a character card is re-applied.

Fourth, the evaluation contract. A model pack is not ready because the file exists. It is ready when it passes open, tokenize, short-generate, cancellation, warmup, and persona-smoke checks on the target device class. Gemma 3's long-context and KV-cache changes are relevant, but a Realbits chat app still needs empirical thresholds for first-token latency, tokens per second, memory ceiling, and failure recovery [S1]. These probes should be shipped with the pack manifest or generated by the provider pipeline. That makes release gating local and repeatable instead of depending on manual chat impressions.

This contract approach also clarifies the relationship between model packs and character cards. A character card is a semantic input: persona prompt, greeting, examples, scenario, voice reference, tags, and ownership metadata. A Gemma pack is an execution dependency. Keeping those separate lets Realbits reuse the same character across multiple model versions, or test the same model pack against multiple character cards. It also avoids putting model-specific prompt assumptions into creator-owned assets.

Limitations

The main limitation is that local inference maturity varies by platform. Android may be the first stable target for native LiteRT-LM integration, while macOS can use staged native SDK artifacts and iOS may require separate Swift or Objective-C binding work. LiteRT-LM's documentation shows a broad support matrix, but broad support is not the same as uniform product readiness across accelerators, operating systems, and app-store constraints [S6].

A second limitation is artifact volatility. Quantized model variants can differ in quality, memory footprint, and supported runtime path. The Hugging Face ONNX Gemma repository exposes multiple ONNX data and quantized variants, which is useful for web experimentation but also a warning against treating a model name as a complete deployment identifier [S11]. Realbits needs pack-level hashes and runtime compatibility probes, not just a catalog title.

A third limitation is user experience. On-device privacy and cost control do not remove install friction. Large downloads, disk pressure, warmup time, and device incompatibility all surface directly to the user. Flutter app-size guidance and Android asset-delivery documentation both make clear that binary and asset placement affects download and update behavior [S8][S9]. A local-only app must make those states legible without falling back to cloud inference.

Implications for This Repository

The immediate implication is to standardize a Realbits model-pack manifest before adding more runtime adapters. The manifest should be produced by the web provider surface and consumed by Flutter apps. It should describe artifact identity, runtime compatibility, hashes, size, prompt-window policy, evaluation probes, and rollout status. This keeps app code from learning provider internals and keeps provider tooling from guessing what a native host can run.

Second, the Flutter layer should remain runtime-agnostic. Realbits should expose one Dart interface for local chat generation and implement platform adapters beneath it. Android can bind to LiteRT-LM through native host code. macOS can continue maturing the staged LiteRT-LM SDK path. Web can keep using Transformers.js and ONNX for creator preview where appropriate [S10]. The shared contract is capability and behavior, not a single binary format.

Third, pack readiness should be observable. The app should track discovered, downloading, verified, compatible, warming, ready, generating, degraded, and failed states. Those states should feed both UX and provider analytics. If a pack fails because of checksum mismatch, unsupported accelerator, runtime ABI mismatch, or out-of-memory during warmup, that failure should be classified. Otherwise Realbits cannot know whether the issue is packaging, device coverage, or model choice.

Fourth, character evaluation should sit above the runtime contract. Tactic C02 and C09 in the repository's operating model point toward schema-driven character binding and release gates. For Gemma packs, that means every candidate pack should run a small suite of persona and safety probes before it becomes eligible for vertical apps. The model pack should prove that it can serve character cards, not merely generate text.

The practical direction is conservative: make LiteRT-LM the native execution boundary, make Flutter the product surface, and make provider manifests the durable contract between them. That gives Realbits a path to ship local Gemma as reusable infrastructure rather than re-solving inference packaging separately in every chat app.

References

Source Ledger

[S1] Gemma 3 Technical Report (arxiv): https://arxiv.org/abs/2503.19786 - Primary arXiv source for Gemma 3 model scale, multimodal scope, long-context support, and KV-cache-oriented architecture changes.
[S2] Gemma: Open Models Based on Gemini Research and Technology (arxiv): https://arxiv.org/abs/2403.08295 - Primary arXiv source for the original Gemma open model family and its safety and benchmark framing.
[S3] Gemma 2: Improving Open Language Models at a Practical Size (arxiv): https://arxiv.org/abs/2408.00118 - Primary arXiv source for practical-size Gemma improvements such as grouped-query attention, local-global attention, and distillation.
[S4] LiteRT overview (official-doc): https://ai.google.dev/edge/litert/overview - Official Google AI Edge documentation describing LiteRT as an on-device ML and GenAI deployment framework with conversion, runtime, and optimization support.
[S5] Deploy GenAI Models with LiteRT (official-doc): https://ai.google.dev/edge/litert/genai/overview - Official Google documentation for LiteRT GenAI deployment and supported Gemma-family model tracks.
[S6] LiteRT-LM README (web): https://github.com/google-ai-edge/LiteRT-LM/blob/main/README.md - Project documentation for LiteRT-LM platform support, language APIs, and its role as a language-model inference framework for edge devices.
[S7] Writing custom platform-specific code in Flutter (official-doc): https://docs.flutter.dev/platform-integration/platform-channels - Official Flutter documentation for platform channels, the likely bridge between Dart chat UI and native LiteRT-LM host code.
[S8] Measuring your app's size (official-doc): https://docs.flutter.dev/perf/app-size - Official Flutter documentation explaining why bundled assets and self-contained app packages affect app size.
[S9] Play Asset Delivery (official-doc): https://developer.android.com/guide/app-bundle/asset-delivery - Official Android documentation for delivering large asset packs separately from the base app bundle.
[S10] Transformers.js documentation (official-doc): https://huggingface.co/docs/transformers.js/ - Official Hugging Face documentation for browser-side transformer inference using ONNX Runtime, relevant to Realbits' web Gemma runtime path.
[S11] onnx-community/gemma-3-1b-it-ONNX-GQA files (web): https://huggingface.co/onnx-community/gemma-3-1b-it-ONNX-GQA/tree/main/onnx - Technical model repository showing ONNX and quantized file variants for the Gemma 3 1B instruction model used by browser-oriented inference stacks.