May 29, 20269 sources2 arXiv

Manifest-First Voice Packs for Local Mobile Chat

A technical analysis of how Realbits should treat local STT and TTS as versioned voice pack contracts for mobile chat, with emphasis on manifests, runtime boundaries, and release evidence.

voicesttttsmodel-packs
<!-- Generated by the Realbits daily technology blog cron. Review before public publishing. -->

Abstract

Local voice in Realbits should be specified as a pack contract rather than as UI sugar around a microphone button and a speaker icon. The repository already points that way: the web package models gemma, stt, and tts as downloadable packs with versions, byte counts, checksums, install paths, and renewable URLs, while the mobile shared voice packages define reusable client and DTO layers. The important engineering problem is therefore not simply whether speech recognition and synthesis can run on device. sherpa-onnx already targets offline STT, TTS, VAD, and related speech tasks without an Internet dependency, and its documentation includes mobile and Flutter surfaces [S1][S2]. The harder problem is preserving one product-level voice capability while engines, graphs, acceleration paths, and app targets keep changing.

This article argues for a manifest-first architecture. In that architecture, the Realbits provider surface publishes stable pack metadata, each app verifies and installs artifacts, and engine-specific runtimes load internal manifests below the chat layer. That lets Realbits continue using sherpa-onnx where it is mature, introduce MOSS-TTS-Nano where it improves TTS quality or latency, and still keep character chat orchestration independent from speech runtime details. The proposed contract is practical: it must cover file integrity, disk and memory budgets, locale, cancellation, streaming events, fallback behavior, and release-gate measurements.

Realbits Context

Realbits is now a model and app provider. Its apps distribute local model packs, reusable character cards, and themed chat experiences. Voice belongs in that same provider layer. A voice pack should be a reusable asset consumed by Realbits Chat Hub, Daily Wisdom Guide, Situational English Coach, and Character Chat, not a private implementation detail inside one Flutter app.

The current manifest shape in packages/web/src/lib/mobile-voice-models.ts is a useful start. It treats voice models as required packs with id, displayName, engine, version, artifactKind, contentType, bytes, sha256, locale, install location, and a direct download description. That is exactly the information a mobile client needs before it allocates storage, prompts the user, downloads a large artifact, or starts native inference. The global fields for required and recommended free bytes are also important because voice installation is an app-store-scale operational concern, not just a runtime call.

The repository also documents a MOSS-TTS-Nano migration path for Realbits Chat Hub. That plan correctly frames the move as a backend migration rather than a simple file swap. Existing sherpa lanes can represent TTS through sherpa-specific configuration. A MOSS-style path needs tokenizer support, prefill and decode-step graphs, a sampled-frame model, codec graphs, chunking rules, and voice metadata. ONNX Runtime Extensions matter here because tokenizer and preprocessing behavior may need custom operators on Android, iOS, and web targets [S4].

The architectural implication is straightforward: the app should depend on voice sessions and pack capabilities, not on graph filenames. Chat code should receive transcripts, emit text responses, and request synthesis. Runtime code should own the details of sherpa, MOSS, ONNX Runtime sessions, external data files, and device-specific execution providers.

Related Work

sherpa-onnx is the closest operational precedent for Realbits. Its project documentation describes speech-to-text, text-to-speech, diarization, enhancement, source separation, and VAD using ONNX Runtime without an Internet connection, with support across embedded systems, Android, iOS, macOS, Linux, Raspberry Pi, and Flutter-related surfaces [S1][S2]. For Realbits, that makes sherpa-onnx a credible local speech baseline because it already packages speech functionality in the direction the product wants: on-device by default, cross-platform, and reusable across apps.

ONNX Runtime provides the lower-level deployment lens. Its mobile guidance emphasizes that models must be in ONNX format, must fit on disk and in memory, and must be tested against the target mobile package and execution provider [S3]. It also recommends starting with CPU-oriented paths for consistency, then trying acceleration when CPU performance is insufficient. This matters for Realbits because voice packs are not only ML artifacts. They are installable product units that can fail from storage pressure, unsupported operators, memory spikes, or acceleration partitioning.

The TTS research landscape has moved toward more generative and token-based systems. The MOSS-TTS technical report describes a speech generation family built around discrete audio tokens, autoregressive modeling, and large-scale pretraining, including variants designed for shorter time to first audio [S5]. The public MOSS-TTS-Nano project further emphasizes small footprint, streaming inference, CPU friendliness, and ONNX CPU deployment, which makes it unusually relevant to a mobile local-chat stack [S8]. Realbits should still treat this as a production engineering task, not a model announcement. The pack contract must describe how several cooperating graphs become one stable TTS capability.

For STT, Zipformer is a useful reference because it focuses directly on ASR encoder efficiency. The paper describes a faster and more memory-efficient encoder than common Conformer baselines, with structural changes intended to improve both performance and recognition quality [S6]. Realbits does not need to expose Zipformer to users, but engineers should internalize the constraint: local STT quality and latency are encoder and streaming problems as much as app UI problems.

Platform APIs are helpful but insufficient as the provider contract. Android documentation lists on-device inference benefits such as latency, availability, privacy, and cost, while also warning about compute load, battery usage, application size, and the deprecation status of NNAPI in Android 15 [S7]. On the web, the Web Speech API exposes speech recognition and speech synthesis, but the recognition service may be platform-provided or local depending on browser capability and policy [S9]. Those APIs can support experiments, but they should not replace Realbits-controlled pack identity.

Architecture Analysis

A Realbits voice pack should be described at two levels. The provider manifest should remain stable and app-facing. It answers product and installation questions: which task is this pack for, which engine family loads it, which locale does it support, how large is it, what checksum validates it, where should it be installed, and what app build can consume it. The internal engine manifest should be runtime-facing. It answers loading questions: which model files exist, which external data files are required, which tokenizer or extension library is needed, which provider hints are allowed, and which synthesis or recognition modes are supported.

This separation prevents engine churn from spreading through the app slate. A sherpa STT bundle, a sherpa TTS bundle, and a MOSS-TTS-Nano bundle can all be published through the same provider channel while loading through different runtime adapters. ONNX Runtime mobile guidance supports this split because the application must choose packages, operators, model formats, and execution providers according to target constraints [S3]. The app catalog should not need to know those details beyond compatibility flags.

STT and TTS also need different session models. STT starts before the language model turn. It owns microphone capture, endpointing, partial hypotheses, final transcripts, cancellation, and permission errors. TTS starts after the language model turn. It owns text normalization, chunking, voice selection, time to first audio, playback, interruption, and generated file cleanup. Keeping separate stt and tts pack ids is therefore the right choice. A higher-level app can still present a single voice readiness state, but runtime behavior should remain task-specific.

The MOSS TTS path makes manifest precision especially important. The Realbits-hosted INT8 MOSS-TTS-Nano model card describes separate prefill, decode-step, local sampled-frame graphs, shared external weight data, and use of ONNX Runtime Mobile plus ONNX Runtime Extensions for a SentencePiece tokenizer graph [S8]. A top-level archive checksum is useful, but it is not enough after extraction. The installer should verify the expanded internal manifest, every referenced graph, every external data file, tokenizer presence, codec files, and declared voice metadata before marking the pack ready.

Streaming should be part of the contract even when the first product path ships one-shot behavior. For STT, the facade should allow interim transcript events and endpoint notifications. For TTS, the facade should allow time-to-first-audio measurement, chunk completion, playback interruption, and final audio completion. The reason is not theoretical neatness. Efficient ASR and modern TTS systems are both latency pipelines [S5][S6]. If Realbits exposes only blocking calls such as transcribeFile and synthesizeText, future streaming work will require invasive changes to chat orchestration.

Acceleration should remain a runtime hint, not a product promise. ONNX Runtime supports execution providers for CPU and mobile accelerators, but provider choice is model and device dependent [S3]. Android also warns that NNAPI is deprecated and that model execution can increase battery use and application size concerns [S7]. A voice pack should declare a default CPU-compatible path, then allow platform adapters to try XNNPACK, CoreML, NNAPI, or other providers under telemetry. Failure should degrade to slower local voice, not silently switch to a cloud service.

Limitations

The pack model increases release and operations burden. Realbits must publish artifacts, keep checksums current, support range downloads, handle interrupted installs, retire incompatible packs, and preserve older app behavior. minimumAppBuild is necessary but not sufficient. The provider should also track engine version, manifest schema version, ABI or platform constraints, installed byte size, and expected initialization memory.

Local voice quality will vary across devices and acoustic conditions. STT can fail because of accent, background noise, microphone routing, endpoint timing, or domain vocabulary. TTS can fail through pronunciation errors, unstable prosody, long-text drift, and voice mismatch. The US/EN-first scope is reasonable, but it should not be confused with solved evaluation. Realbits needs fixtures that include character names, app-specific phrases, short interruptions, and conversational replies.

Privacy also needs operational discipline. Local inference helps keep user audio and transcripts on device, matching the provider positioning. But apps can still leak sensitive content through logs, crash payloads, temporary files, analytics events, or support bundles. Voice DTOs should distinguish content from metrics. Telemetry should record pack ids, engine versions, timings, device class, and error codes, not raw transcript text or generated speech content.

MOSS-style TTS adds complexity. The research and model-card direction is promising, but multi-graph autoregressive synthesis has more moving parts than a simple system TTS wrapper [S5][S8]. Built-in voices are the correct first scope. Voice cloning, creator-uploaded voices, or prompt-generated speaker identities would introduce consent, safety, review, and abuse risks that belong in a separate tactic.

Implications for This Repository

The near-term implementation goal should be a hardened voice pack contract. VoiceModelPackConfig can remain the provider-facing schema, while each artifact carries an internal engine manifest. The mobile realbits_voice_contract package should model session events for STT and TTS explicitly: started, partial result, final result, chunk ready, first audio, completed, cancelled, and failed. The realbits_voice_client package can then adapt those events to Flutter UI without leaking engine details.

Release gates should install and exercise voice packs, not only check that files exist. A minimal gate should download or locate the pack, verify SHA-256, expand it, validate internal references, initialize the runtime, run one STT fixture where available, synthesize one character greeting, and record latency plus memory. For TTS, time to first audio and total synthesis time per 100 characters are more useful than a single pass/fail result. For STT, endpoint latency and final transcript accuracy on short chat utterances should be tracked.

The repository should also keep sherpa and MOSS adapters side by side until Realbits has device evidence. sherpa-onnx remains the practical baseline for local speech [S1][S2]. MOSS-TTS-Nano may become the higher-quality or more controllable TTS path, especially with the Realbits INT8 ONNX variant [S8]. The manifest boundary lets both coexist while the provider decides through measurements rather than enthusiasm.

The larger payoff is cross-app reuse. Once voice packs are validated assets, every Realbits app can consume the same local STT and TTS capability with app-specific UI and character context. That keeps voice aligned with the Realbits provider thesis: the durable unit is not a single chat screen, but a reusable local runtime pack that can be shipped, measured, replaced, and trusted across the app slate.

References

Source Ledger