Local Voice Model Packs for Speech Recognition and Text-to-Speech in Mobile Chat
A technical analysis of how Realbits can package local STT and TTS runtimes as portable model packs for mobile chat while balancing latency, bundle size, and cross-platform runtime differences.
Abstract
Realbits is converging on a provider model in which model packs are the reusable product unit, and voice is part of that same contract. The repository excerpts for this task show a concrete shape already forming: a shared manifest service in packages/web exposes distinct gemma, stt, and tts packs, while the chat-hub documentation describes a TTS migration from a sherpa-specific runtime toward a MOSS-based ONNX backend. That combination is more important than it first appears. It means local speech in Realbits is not just an inference problem; it is a packaging problem, a runtime-compatibility problem, and a product-surface problem.
The core architectural question is therefore not merely which speech model is best. It is whether Realbits can define a stable voice-pack contract that survives backend changes across Android, iOS, macOS, and web. Current speech research suggests that the answer is yes, but only if the pack boundary is separated from the engine boundary. Large ASR models such as Whisper remain strong quality baselines, while newer edge-oriented ASR systems such as Moonshine argue for lower-compute specialization on device [S1][S2]. On the TTS side, efficient designs such as Matcha-TTS and newer tokenizer-based systems such as MOSS-TTS show two different routes toward fast local synthesis [S3][S4]. The engineering implication for Realbits is that model-pack architecture should treat STT and TTS as independently versioned assets with explicit runtime metadata, not as opaque blobs coupled to one inference library.
Realbits Context
Based on the repository context supplied with this task, Realbits already treats on-device inference as a product principle rather than an optimization. The provider vision explicitly places LLM, STT, and TTS packs in one shared manifest and download pipeline, and packages/web/src/lib/mobile-voice-models.ts exposes a concrete manifest shape with pack IDs, versions, bytes, checksums, install paths, free-space thresholds, and renewable download URLs. That is a meaningful design decision: it turns local speech into a downloadable asset lifecycle rather than an application-binary lifecycle.
The same excerpts also show why this matters operationally. TTS is not stable at the backend layer. The chat-hub implementation plan describes a migration away from the current sherpa-onnx TTS path toward MOSS-TTS-Nano, while keeping one logical tts pack across Android, macOS, and web. In other words, Realbits already has a live example of the exact problem this article studies: the public pack name should remain stable even when the internal runtime graph, tokenizer, and codec pipeline change.
That is a sound direction. sherpa-onnx is a strong current fit because it is explicitly local-first, supports offline recognition without internet access, and spans desktop and mobile targets [S7]. It also exposes a wide range of TTS families behind one toolkit [S8]. But the same breadth creates pressure to avoid overfitting the pack contract to sherpa-specific configuration. Once a repo needs to support a second backend with different graph composition and tokenizer requirements, the pack manifest becomes the abstraction layer that decides whether the migration is incremental or disruptive.
Related Work
For STT, Whisper remains the most useful reference baseline because it demonstrates how far a large weakly supervised model can go in multilingual robustness and zero-shot transfer [S1]. In practice, however, Whisper also represents the tension Realbits must manage: strong recognition quality often arrives with heavier compute and memory costs than mobile chat apps can comfortably hide. That tradeoff is why smaller edge-focused models matter, even if they do not dominate benchmarks.
Moonshine is relevant here because it targets live transcription and voice commands directly, and reports a significant compute reduction relative to Whisper tiny-en at similar error rates in its stated evaluation [S2]. The practical lesson is not that Realbits should immediately replace one STT model with Moonshine. The lesson is that mobile speech stacks benefit from task-shaped model packs. A flagship chat surface may tolerate a broader transcription model, while a push-to-talk command path may want a smaller, lower-latency pack.
For TTS, Matcha-TTS is a useful reference because it shows a route to lower-memory, fast synthesis through a non-autoregressive design [S3]. By contrast, the MOSS-TTS line takes a tokenizer-plus-autoregressive-generator approach with long-context and control-oriented deployment goals [S4]. Realbits' current migration plan aligns more naturally with the second family, because the provided implementation notes emphasize ONNX sessions, codec graphs, tokenizer assets, chunked long-text synthesis, and browser portability. The official MOSS-TTS-Nano repository further strengthens that fit by publishing an ONNX CPU path, browser-facing deployment story, streaming decode, and built-in multilingual support claims [S11].
The broader pattern across these systems is clear: modern local speech is less about one universal engine and more about choosing where to freeze complexity. Whisper freezes complexity into a large general ASR model [S1]. Moonshine moves some of that complexity into model specialization for edge constraints [S2]. Matcha-TTS reduces inference burden through architectural efficiency [S3]. MOSS-TTS shifts deployment focus toward tokenizer-based generation plus lighter ONNX CPU execution [S4][S11]. Realbits should respond by freezing complexity at the pack interface, not at the backend implementation.
Architecture Analysis
The strongest part of the Realbits approach is the explicit manifest layer. The supplied code snippet already includes versioning, bytes, SHA-256 hashes, install paths, locale, engine, and renewable URLs. That is the right primitive for local voice packs because it supports three operational needs at once: integrity verification, resumable distribution, and runtime dispatch.
First, integrity is not optional for speech assets. STT and TTS packs are large, often multi-file, and expensive to redownload. Checksums and version identifiers should remain mandatory because they let the app distinguish a corrupt pack from an outdated one before a user ever opens a chat session. This is especially important once TTS stops being a single-model file and becomes a collection of tokenizer, decoder, and codec artifacts.
Second, runtime dispatch should be metadata-driven. The sherpa stack demonstrates how much can be hidden behind one toolkit [S7][S8], but the MOSS plan shows the boundary where that convenience stops. ONNX Runtime mobile guidance explicitly recommends aggressive size management, including quantization and custom runtime builds when mobile footprint matters [S5]. That means a pack manifest should carry enough capability data to let the app decide, without hardcoded assumptions, whether a pack expects sherpa, plain ONNX Runtime, additional tokenizer extensions, or browser-specific loading behavior.
Third, web parity is a first-class design constraint, not a nice-to-have. ONNX Runtime Web documents a specific problem with large models and external data: browser code cannot rely on normal filesystem access the way native runtimes can [S6]. That matters directly for any Realbits plan that wants one logical TTS pack on macOS and web. A manifest that is sufficient for Android file paths may still be insufficient for browser hydration, streamed fetch, or external-data coordination. The provided MOSS plan appears to recognize this by calling out dedicated web runtime work rather than assuming native behavior will transfer automatically.
The distribution layer also deserves attention. Android and Apple both offer platform-native asset-pack mechanisms, but they are not a complete answer for Realbits. Play Asset Delivery is built around asset packs with flexible download timing, yet it is explicitly an asset system rather than an executable-code distribution path [S9]. Apple's on-demand resources system offers large hosted packs on iOS, but Apple now labels it legacy and notes that macOS does not support it [S10]. From those sources, the reasonable inference is that a custom cross-platform manifest service is the safer architectural center for Realbits than relying on per-platform pack semantics. Platform delivery can still be used tactically, but it should not define the provider contract.
This is also where the TTS migration becomes strategically useful. The official MOSS-TTS-Nano repository claims a standalone ONNX CPU path, streaming decode, long-text handling, and lightweight local deployment characteristics [S11]. Even if those claims need independent benchmarking in Realbits, they point toward a cleaner separation between pack and engine. If the tts pack can describe tokenizer assets, graph files, voice aliases, and chunking limits, then Realbits can swap the backend without changing the product concept of a downloadable voice pack.
Limitations
There are still hard constraints. Local voice quality is not free. Whisper-class ASR quality often pushes compute and battery budgets upward [S1]. Smaller models improve latency, but usually through specialization, narrower scope, or tradeoffs hidden outside benchmark summaries [S2]. TTS has a similar split: efficient architectures help, yet expressive synthesis, multilingual coverage, and low time-to-first-audio rarely improve together by accident [S3][S4].
There are also packaging limits. ONNX Runtime's own documentation makes clear that mobile size reduction often requires model-specific custom builds rather than generic binaries [S5]. On the web, external-data handling remains more awkward than on native platforms [S6]. And while the MOSS repository presents encouraging deployment claims, those are implementation claims from the project itself, not neutral cross-device measurements [S11]. Realbits would still need device-level evidence for memory spikes, thermal throttling, startup latency, and long-session stability.
Finally, one logical tts pack can hide meaningful internal diversity. A pack that works well for English short replies may not be the right pack for long-form narration, multilingual code-switching, or future voice-cloning features. The abstraction is useful, but it should not obscure the need for runtime capabilities, locale coverage, and quality tiers inside the manifest.
Implications for This Repository
The immediate implication is that Realbits should keep treating stt and tts as provider assets, but it should strengthen the manifest so that the engine can change without forcing app-level rewrites. In practice, that means adding explicit capability fields for tokenizer requirements, streaming support, external-data requirements, voice inventory, and web compatibility. The repository excerpts suggest this process has already started for TTS; it should become the default pattern for all speech packs.
A second implication is measurement discipline. If the pack is the product unit, then performance should also be measured at the pack boundary: cold download time, install success rate, first transcription latency, first audio latency, sustained decode rate, battery cost, and storage churn after updates. Those metrics matter more to a provider architecture than isolated benchmark numbers from a model paper.
A third implication is product modularity. Realbits does not need one universal STT pack and one universal TTS pack forever. The better long-term architecture is probably a stable top-level contract with multiple pack families underneath it: low-latency command STT, conversational STT, lightweight built-in-voice TTS, and richer narration TTS. The manifest format shown in packages/web is already close to supporting that future.
The larger conclusion is straightforward. In Realbits, local speech should be designed as a portable asset contract backed by replaceable runtimes. That is the architecture that best matches the repository's current direction, the realities of ONNX-based deployment, and the state of modern local speech systems [S5][S6][S7][S11].
References
- S1: https://arxiv.org/abs/2212.04356
- S2: https://arxiv.org/abs/2410.15608
- S3: https://arxiv.org/abs/2309.03199
- S4: https://arxiv.org/abs/2603.18090
- S5: https://onnxruntime.ai/docs/tutorials/mobile/
- S6: https://onnxruntime.ai/docs/tutorials/web/large-models.html
- S7: https://k2-fsa.github.io/sherpa/onnx/index.html
- S8: https://k2-fsa.github.io/sherpa/onnx/tts/index.html
- S9: https://developer.android.com/guide/playcore/asset-delivery
- S10: https://developer.apple.com/help/app-store-connect/reference/app-uploads/on-demand-resources-size-limits
- S11: https://github.com/OpenMOSS/MOSS-TTS-Nano
Source Ledger
- [S1] Robust Speech Recognition via Large-Scale Weak Supervision (arxiv): https://arxiv.org/abs/2212.04356 - Baseline ASR paper for Whisper, useful as the large-model reference point for offline speech recognition tradeoffs.
- [S2] Moonshine: Speech Recognition for Live Transcription and Voice Commands (arxiv): https://arxiv.org/abs/2410.15608 - Edge-oriented ASR paper showing why smaller STT models matter for real-time and resource-constrained devices.
- [S3] Matcha-TTS: A fast TTS architecture with conditional flow matching (arxiv): https://arxiv.org/abs/2309.03199 - Fast, compact TTS architecture reference for comparing latency and memory goals in local voice synthesis.
- [S4] MOSS-TTS Technical Report (arxiv): https://arxiv.org/abs/2603.18090 - Recent TTS foundation-model report relevant to Realbits' planned move toward MOSS-based local synthesis.
- [S5] ONNX Runtime - Deploy on mobile (official-doc): https://onnxruntime.ai/docs/tutorials/mobile/ - Documents mobile deployment, quantization, and custom-runtime size reduction for ONNX-based inference.
- [S6] ONNX Runtime Web - Working with Large Models (official-doc): https://onnxruntime.ai/docs/tutorials/web/large-models.html - Explains browser constraints around large ONNX models and external data, directly relevant to web parity for model packs.
- [S7] sherpa-onnx documentation overview (official-doc): https://k2-fsa.github.io/sherpa/onnx/index.html - Primary documentation for the offline speech stack Realbits already uses for on-device speech processing.
- [S8] sherpa-onnx Text-to-speech documentation (official-doc): https://k2-fsa.github.io/sherpa/onnx/tts/index.html - Shows the breadth of TTS backends and packaging assumptions in the current sherpa-based lane.
- [S9] Android Developers - Play Asset Delivery (official-doc): https://developer.android.com/guide/playcore/asset-delivery - Official Android guidance for shipping large asset packs and understanding platform-native delivery tradeoffs.
- [S10] Apple Developer - On-demand resources size limits (official-doc): https://developer.apple.com/help/app-store-connect/reference/app-uploads/on-demand-resources-size-limits - Documents current Apple resource-pack limits and notes about legacy status and macOS support constraints.
- [S11] OpenMOSS MOSS-TTS-Nano official repository (official-doc): https://github.com/OpenMOSS/MOSS-TTS-Nano - Official implementation notes for the MOSS-TTS-Nano ONNX CPU path, streaming behavior, language support, and deployment claims.