May 20, 20269 sources3 arXiv

Character Cards as Testable Runtime Inputs

A technical analysis of reusable character-card schemas and persona-evaluation gates for Realbits' Gemma-backed chat surfaces.

character-cardsevaluationpersonagemma

Abstract

Reusable character cards are not just content records. In Realbits, they should become testable runtime inputs: structured objects that produce deterministic prompts, drive Gemma-backed conversations, and enter a release gate before they reach app users. The current repository direction already points there: the web creator surface owns character creation, the app slate consumes provider inventory, and tactics C02 and C09 tie card schema work to persona consistency and eval-gated releases. The architectural question is whether a character card is merely a portable prompt container or a small behavioral contract. This article argues for the latter. A Realbits card should have a schema version, validation rules, prompt-rendering semantics, safety constraints, and an eval artifact that records how the card behaves with a chosen model pack. That framing aligns with role-playing research, which treats character behavior as a measurable ability rather than an aesthetic preference [S1][S2][S3].

Realbits Context

Realbits is a provider stack for on-device models, reusable character cards, and themed chat apps. The important constraint is cross-app reuse. A character created in packages/web cannot be tuned only for a single UI surface, because the same asset may later run in Realbits Chat Hub, Character Chat, Daily Wisdom Guide, or Situational English Coach. That makes the card schema a compatibility layer between creator tooling, catalog storage, entitlement checks, prompt construction, and the on-device runtime.

The repository context describes a v1 card with fields such as name, avatar reference, personaPrompt, greeting, scenario, exampleDialogue, tags, voiceId, creator wallet, and optional NFT token id. Those fields are not all equal. Some are presentation metadata, some are marketplace metadata, and some affect generation behavior. The engineering failure mode is mixing those concerns until every app has its own prompt builder. C02 correctly identifies prompt drift as a product bug: if each consumer formats the same character differently, persona consistency becomes untestable.

Gemma and LiteRT make the issue more concrete. Google documents Gemma as a family where deployment choice depends on hardware, model variant, framework, and prompt formatting; instruction-tuned variants are intended to respond to natural-language instructions, but formatting requirements still matter [S4]. LiteRT-LM then adds runtime concerns such as stateful inference, prompt caching, scoring, and KV-cache management [S5]. A Realbits card therefore should not be treated as arbitrary text appended to a chat. It is input to a renderer that must target a specific model family, runtime, context budget, and safety policy.

Related Work

The character-card ecosystem already has a useful precedent: the Character Card V2 specification separates core fields such as name, description, scenario, first message, example dialogue, tags, creator, system prompt, post-history instructions, alternate greetings, and extensions [S7]. Realbits should not copy that format blindly, because Realbits has provider-specific needs around SFW policy, wallet-backed inventory, voice ids, mobile model packs, and creator publishing. Still, the spec proves that portable character definitions benefit from explicit fields rather than a single prompt blob.

Research on role-playing LLMs points in the same direction. RoleLLM builds role profiles and role-specific data, then evaluates role-playing ability with a benchmark rather than relying only on manual inspection [S1]. RPEval narrows evaluation to dimensions that are close to product risk for Realbits: in-character consistency, emotional understanding, decision-making, and moral alignment [S2]. Persona-aware contrastive learning work further suggests that persona consistency is an alignment problem, not only a prompt-writing problem; models can drift unless the training, prompting, or evaluation procedure rewards role-specific behavior [S3].

Developer tooling is also moving toward repeatable evals. GitHub Models documents evaluation workflows with stored prompt configurations, test inputs, structured metrics, and command-line execution for CI [S8]. That matters because C09 should not be a dashboard-only concept. A gate only has architectural force when it can fail a publish action, release script, or scheduled validation run.

Architecture Analysis

The clean boundary is a three-step pipeline: validate the card, render the runtime prompt, then evaluate generated behavior. Validation should answer whether the card is structurally acceptable. Rendering should answer how the card becomes model input. Evaluation should answer whether the rendered card and selected model pack meet Realbits' behavioral thresholds.

Validation is the easiest to make deterministic. A JSON Schema Draft 2020-12 contract can express required fields, enum values, minimum text lengths, array limits, URL formats, extension namespaces, and migration rules [S6]. For Realbits, schema validation should be split into at least three layers. The base card schema checks portable identity and prompt fields. A publish schema checks creator, moderation, catalog, and ownership metadata. A runtime schema checks the subset that can be safely loaded offline by Flutter apps. This prevents mobile clients from inheriting web-only assumptions.

Prompt rendering is where many systems lose reproducibility. The renderer should be a pure function over card, model profile, app profile, and policy profile. It should not read UI state or hidden application text. For example, the same card might render differently for a Gemma web preview and a LiteRT-LM mobile package, but those differences should be declared in model profiles, not scattered through app code. Gemma documentation explicitly warns that instruction-tuned variants have model-specific formatting requirements [S4]. That makes the prompt builder a compatibility adapter, not just a string template.

A practical Realbits prompt format can keep stable sections: safety preamble, character identity, scenario anchor, dialogue examples, first-turn greeting, app-specific task constraints, and response-style constraints. The order matters. Example dialogue should be bounded so it cannot crowd out safety and current conversation context. Scenario should be treated as a setting anchor, not as a second persona. creatorAddress and nftTokenId should not enter the prompt unless there is a direct user-facing reason; they are provenance and entitlement metadata, not behavior.

Evaluation gates should test the card-renderer-model combination, not the card in isolation. A card that works on a larger cloud model may fail on a quantized on-device Gemma variant because the context budget, instruction following, and decoding behavior differ. LiteRT-LM's runtime features around prompt caching and stateful inference are useful, but they also mean the evaluation harness should run against the same packaging path that production uses [S5]. A browser-only or server-only judge can catch gross prompt errors, but it cannot prove mobile runtime parity.

The gate itself should have several outputs. First, a structural result: schema valid, moderation fields present, no forbidden placeholders, and no unsupported extension keys. Second, a behavioral result: persona consistency, setting fidelity, refusal behavior, empathy, and unsafe-content handling. Third, a regression result: current scores versus the last accepted card or model-pack baseline. Fourth, an artifact: card version, renderer version, model pack id, decoding settings, judge rubric, test prompts, scores, and failure examples. Without this artifact, a pass or fail result is hard to debug.

The strongest design choice is per-character gating. A release should fail if one shipped character regresses below threshold, even when aggregate quality looks acceptable. Character apps are inventory businesses: users select a specific persona, not an average catalog. This also helps creators. A creator studio can show actionable failures before publish, while the release pipeline can protect the public catalog.

Limitations

Persona evaluation remains partly subjective. RPEval and related research make the dimensions clearer, but automated judges can still reward fluent stereotypes or penalize creative but valid responses [S2]. LLM-as-judge workflows are useful for scale, yet they need calibrated rubrics, adversarial cases, and spot human review. This is especially true for relationship-oriented chat, where emotional tone and boundary setting are part of safety rather than polish.

Schemas also underfit behavior. JSON Schema can prove that personaPrompt exists and has reasonable length; it cannot prove that the persona is coherent, SFW, or suitable for a smaller local model [S6]. The evaluation layer must cover those semantic properties. Conversely, the eval layer should not become the only safety boundary. Bad content should be rejected before it reaches prompt rendering.

On-device inference adds another limitation. Gemma variants and LiteRT packaging choices create a quality-latency-memory tradeoff [S4][S5]. If Realbits evaluates only against a high-capability desktop model, it may ship cards that degrade on older phones. If it evaluates only against the smallest mobile model, it may block cards that would work well on stronger devices. The better answer is tiered certification by model pack.

Finally, evaluation gates are governance tools, not guarantees. NIST's AI Risk Management Framework emphasizes measuring and managing AI risks throughout the lifecycle, which implies documentation, monitoring, and post-deployment feedback as well as pre-release checks [S9]. Realbits still needs telemetry for user corrections, drift reports, refusal complaints, and creator-side updates.

Implications for This Repository

The repository should treat the character card as a versioned contract shared by web, app, and eval code. packages/web can remain the canonical creator and publish surface, but the schema should live where Flutter and evaluation tooling can consume it without copy-paste. If TypeScript owns the authoring type, generate JSON Schema from it or validate both against a shared fixture suite. The key is to prevent the web schema, prompt builder, and mobile loader from becoming three informal standards.

The prompt builder should be covered by snapshot-style tests that assert stable rendering for representative cards. Those tests should include empty optional fields, long example dialogue, missing voice ids, wallet-owned premium cards, and app-specific constraints. More importantly, prompt snapshots should include the model profile. A prompt that is correct for a web Gemma preview may not be correct for a .litertlm mobile runtime.

C09 should become a publish and release gate with explicit thresholds. Suggested first gates are structural validity, persona consistency, setting fidelity, SFW safety, refusal quality, and regression versus main. The gate should write an artifact that can be attached to creator review and CI output. GitHub's documented model-evaluation flow is a useful pattern because it treats evals as repeatable engineering assets rather than occasional manual experiments [S8].

The practical outcome is a cleaner provider architecture. Cards become portable assets because their shape is validated. Personas become testable because prompt rendering is deterministic. Releases become safer because regressions block before store submission. Creator tooling becomes more useful because failures can be reported at authoring time. For Realbits, that is the difference between a catalog of attractive prompt files and a provider-grade inventory system for reusable on-device characters.

References

Source Ledger

[S1] RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models (arxiv): https://arxiv.org/abs/2310.00746 - Shows that role-playing quality can be decomposed into role profiles, prompting, role-specific data, and evaluation rather than treated as generic chat quality.
[S2] Role-Playing Evaluation for Large Language Models (arxiv): https://arxiv.org/abs/2505.13157 - Defines role-playing evaluation dimensions including emotional understanding, decision-making, moral alignment, and in-character consistency.
[S3] Enhancing Persona Consistency for LLMs' Role-Playing using Persona-Aware Contrastive Learning (arxiv): https://arxiv.org/abs/2503.17662 - Frames persona consistency as an alignment problem and discusses automatic and human evaluation for role consistency.
[S4] Run Gemma content generation and inferences (official-doc): https://ai.google.dev/gemma/docs/run - Official Gemma documentation on choosing variants, local and edge frameworks, instruction-tuned models, and prompt formatting requirements.
[S5] Deploy GenAI Models with LiteRT (official-doc): https://ai.google.dev/edge/litert/genai/overview - Official LiteRT documentation describing LiteRT-LM concepts such as stateful inference, KV cache management, prompt caching, and on-device acceleration.
[S6] JSON Schema Draft 2020-12 (standards): https://json-schema.org/draft/2020-12 - Provides the schema-validation foundation for treating character cards as versioned, machine-checkable data contracts.
[S7] Character Card V2 Specification (web): https://github.com/malfoyslastname/character-card-spec-v2/blob/main/spec_v2.md - Community technical specification for portable AI character cards, useful as prior art for Realbits' own reusable card model.
[S8] Evaluating AI models - GitHub Docs (official-doc): https://docs.github.com/en/github-models/use-github-models/evaluating-ai-models - Documents a practical developer workflow for structured, repeatable model evaluation and CI-friendly prompt testing.
[S9] AI Risk Management Framework - NIST (technical-report): https://www.nist.gov/itl/ai-risk-management-framework - Provides a risk-management basis for documenting, measuring, and monitoring AI system behavior across development and deployment.