May 28, 20269 sources3 arXiv

Schema-Observable Personas: Character Cards as Regression Fixtures for Local Chat

A technical analysis of reusable character-card schemas and persona-evaluation gates in Realbits, focusing on schema observability, local Gemma prompt binding, and release-grade regression evidence.

character-cardsevaluationpersonagemma

Abstract

Reusable character cards look simple when treated as content: a name, an avatar, a persona prompt, a greeting, a scenario, and a few example turns. In Realbits, they are more important than that. They are the portable contract between the web creator studio, the app slate, and the local Gemma-backed conversation runtime. A character card should therefore be evaluated as a structured runtime input whose behavior can regress, not as copy that merely renders in a catalog.

This article takes a narrower angle than the earlier Realbits post on character cards: schema observability. The core question is how Realbits can make persona behavior traceable from a card version to prompt assembly, model-pack binding, and a release gate. Persona-agent research shows that role fidelity is not one property. It can be tested across role facts, preferences, behavior under varied situations, and repeated responses [S1][S2]. LLM-as-judge work suggests automated scoring can help when prompts, rubrics, and outputs are structured, though evaluator outputs still need repeatability checks and human review for risky cases [S3]. For Realbits, the implication is pragmatic: the v1 character-card schema should become a regression fixture with stable identifiers, validation rules, prompt-rendering snapshots, and per-character quality gates.

Realbits Context

Realbits is positioned as a model and app provider. The repository context makes local model packs, reusable character cards, themed apps, a web-first studio, and provider-managed ownership the durable product layer. That makes character cards cross-app assets rather than app-local preferences. The same card may be authored in packages/web, published through provider manifests, unlocked through ownership checks, and consumed by Realbits Chat Hub, Daily Wisdom Guide, Situational English Coach, or Character Chat.

The active C02 tactic defines the v1 card shape around name, avatar, personaPrompt, greeting, scenario, exampleDialogue, tags, voiceId, creatorAddress, and optional token ownership metadata. It also states the intended binding: schema-driven prompts should reduce prompt drift and target high persona consistency in the in-house evaluation harness. The C09 tactic turns that quality harness into a release gate with dimensions such as persona consistency, setting fidelity, empathy, multi-turn safety, refusal behavior, and scenario floors. Those tactics make the card schema operational: every shipped character can be loaded, rendered into a prompt, exercised through test conversations, and compared against a threshold.

Gemma matters because the runtime is local and constrained. The Gemma report describes a family of open models derived from Gemini research, while the official model card documents intended uses, limitations, and safety considerations [S4][S5]. Realbits should not assume that a larger hosted model will repair weak card structure at inference time. When local inference is the default, card quality, prompt construction, and evaluation coverage carry more of the quality burden.

Related Work

PersonaGym is useful because it treats persona agents as systems that must be evaluated across environments and tasks, not merely as chatbots with colorful biographies [S1]. That framing maps well to Realbits: a character card is a role specification that must survive different user intents, app contexts, and conversation lengths. A character that stays consistent only for the first greeting has not passed a provider-grade gate.

PICon adds a second useful lens: chained multi-turn questioning. It evaluates persona agents for internal consistency, external consistency, and retest consistency [S2]. Realbits can adapt this idea by deriving probes from card fields. personaPrompt can produce identity and preference probes, scenario can produce setting probes, and exampleDialogue can produce style-continuation probes. The point is not to interrogate every character like a courtroom witness; it is to expose contradictions before a character ships into multiple apps.

G-Eval is relevant because Realbits' quality framework already leans toward reference-free conversation evaluation. In open-ended character chat, there is no single canonical answer. G-Eval shows that structured evaluation forms and reasoning-style criteria can improve agreement with human judgments for generated text evaluation, while also surfacing evaluator bias as a risk [S3]. Realbits should borrow the discipline, not blindly copy one implementation: explicit criteria, stable dimensions, multiple runs, and retained judge prompts matter more than any one evaluator model.

Outside research papers, JSON Schema supplies the clearest standard vocabulary for validating card documents before they become runtime inputs [S6]. SillyTavern is useful as an ecosystem precedent because character cards commonly include descriptions, scenarios, greetings, and examples that overlap with Realbits' schema [S7]. HELM is instructive because it emphasizes multi-metric evaluation instead of one leaderboard number [S8]. NIST's AI Risk Management Framework adds the governance vocabulary: map, measure, manage, and monitor AI system risks over time [S9]. For Realbits, persona evaluation is both product quality work and provider governance.

Architecture Analysis

The architecture should separate four contracts: card schema, prompt projection, runtime execution, and evaluation evidence. Keeping these contracts separate avoids a common failure mode in character systems: the card grows into an untyped bundle of prompt fragments, UI metadata, entitlement data, and hidden policy notes. That may work for one app, but it does not scale to a provider layer.

The first contract is the card schema. JSON Schema is a good fit for the portable wire format because it can express required fields, primitive types, maximum lengths, URI formats, and schema versioning rules [S6]. Realbits should treat validation as both publish-time and load-time work. Publish-time validation protects the catalog. Load-time validation protects apps that may receive old, cached, or migrated manifests. The schema should distinguish behavioral fields from operational fields. personaPrompt, scenario, greeting, and exampleDialogue affect model behavior; avatars, tags, and creator metadata affect presentation, discovery, and ownership.

The second contract is prompt projection. A character card should not be concatenated into a system prompt independently by every app. Instead, a deterministic prompt builder should transform a card plus app context plus model-pack capabilities into a prompt transcript or runtime instruction bundle. Determinism matters because the evaluation harness needs to know exactly what was tested. A card version should produce a stable prompt-rendering snapshot for a given prompt-template version. If the template changes, old cards can be re-evaluated without pretending that only the model changed.

The third contract is runtime execution. Gemma's official documentation helps set expectations: open local models are useful, but behavior still depends on deployment context, prompting, model size, and safety constraints [S4][S5]. Realbits should encode runtime compatibility in the provider manifest. A card may be valid but not approved for every model pack. A character that relies on long examples may exceed a small local context budget, while a voice-heavy character may require a compatible voiceId. Evaluation gates should therefore bind character_schema_version, prompt_template_version, model_pack_id, and app profile together.

The fourth contract is evaluation evidence. A persona gate should not only say pass or fail. It should store enough metadata for engineers to debug regressions: card ID, card version, probe set version, model pack, app profile, prompt template, sampling parameters where available, judge configuration, per-dimension scores, and raw conversation artifacts subject to privacy rules. PersonaGym and PICon both imply that persona consistency is multidimensional [S1][S2]. A single average can hide regressions where a character preserves tone but loses setting, or follows setting but violates safety boundaries.

A useful Realbits gate can combine generated probes and fixed scenario probes. Generated probes come from the card itself: identity and preference checks from personaPrompt, setting checks from scenario, and style checks from exampleDialogue. Fixed scenario probes come from product risk: crisis signals, romantic overattachment, medical or legal pressure, age ambiguity, and app-specific vertical tasks. HELM's multi-metric posture supports this shape: coverage across dimensions is more robust than one compressed score [S8].

This suggests three gate levels. A schema gate checks whether the card is valid and portable. A projection gate checks whether prompt rendering is deterministic, within budget, and free of forbidden structural patterns. A behavior gate runs multi-turn conversations and scores persona consistency, setting fidelity, empathy, refusal behavior, and safety. C02 mostly concerns the first two gates plus persona binding. C09 makes the third gate release-blocking.

Limitations

The main limitation is evaluator uncertainty. LLM-as-judge approaches can be useful, but they are not ground truth. G-Eval-style methods depend on the judge model, rubric wording, and output parsing [S3]. For Realbits, repeated-run averages and scenario floors are sensible mitigations, but they do not remove the need for human review when a character is high-risk, high-revenue, or near a threshold.

A second limitation is schema incompleteness. A card schema can validate shape, not intent. A valid personaPrompt may still be vague, contradictory, unsafe, or too long for a target local model. This is why projection and behavior gates must exist beside JSON validation. JSON Schema can ensure that exampleDialogue is an array of strings; it cannot know whether those examples create unstable role behavior [S6].

A third limitation is local-model variance. Gemma deployments may differ by size, quantization, context window, runtime backend, and decoding settings. A character passing under one model pack does not automatically pass under another [S4][S5]. Realbits should avoid labeling cards as globally approved. Approval should be scoped to model-pack and app-profile combinations.

A fourth limitation is ecosystem compatibility. SillyTavern-style cards are useful as an import reference, but Realbits has different constraints: SFW distribution, local inference, creator ownership, provider manifests, and cross-app reuse [S7]. Compatibility should be a migration feature, not the internal source of truth.

Implications for This Repository

The repository already points in the right direction: C02 defines a reusable character-card schema and Gemma binding goal, while C09 defines eval-gated releases. The next architectural step is to make the connection observable. Every publishable card should carry a schema version. Every prompt-builder change should be versioned. Every eval result should identify the card, prompt template, model pack, app profile, and gate profile that produced it.

For packages/web, the creator studio should preview not only the display card but also validation status and eval readiness. A creator-facing system does not need to expose raw judge prompts, but it should explain which fields are missing, which probes failed, and whether the card is approved for each app or model pack. For the mobile apps, the runtime should reject or quarantine cards whose schema or manifest compatibility cannot be verified.

For the eval harness, Realbits should prefer reproducible evidence over broad but opaque scoring. Store prompt-rendering snapshots. Keep named gate profiles. Track scenario floors separately from averages. Add regression diffs against main or the last approved card version. Preserve failed conversations for engineering review where privacy policy allows. This makes persona quality debuggable instead of mystical.

The key architectural stance is that persona is not a string. It is a provider-managed behavioral contract projected from a schema into a local model runtime and verified by gates. That is the difference between reusable character inventory and app-local prompt craft.

References

Source Ledger

[S1] PersonaGym: Evaluating Persona Agents and LLMs (arxiv): https://arxiv.org/abs/2407.18416 - Defines persona-agent evaluation as a dynamic benchmark problem, useful for separating persona fidelity from general conversational quality.
[S2] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency (arxiv): https://arxiv.org/abs/2603.25620 - Describes chained multi-turn persona interrogation with internal, external, and retest consistency dimensions.
[S3] G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (arxiv): https://arxiv.org/abs/2303.16634 - Provides evidence for structured LLM-as-judge evaluation, relevant to automated scoring of character behavior.
[S4] Gemma: Open Models Based on Gemini Research and Technology (technical-report): https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf - Technical report for Gemma open models, relevant because Realbits binds character cards to a Gemma-backed local chat runtime.
[S5] Gemma model card (official-doc): https://ai.google.dev/gemma/docs/model_card - Official model documentation describing Gemma intended use, limitations, safety, and model-family context.
[S6] JSON Schema Core Specification (standards): https://json-schema.org/draft/2020-12/json-schema-core - Defines semantics for machine-validatable JSON schemas, relevant to portable character-card validation.
[S7] SillyTavern Character Design Documentation (official-doc): https://docs.sillytavern.app/usage/core-concepts/characterdesign/ - Documents a widely used character-card concept with fields such as scenarios, greetings, and dialogue examples.
[S8] HELM: Holistic Evaluation of Language Models (web): https://crfm.stanford.edu/helm/latest/ - High-quality technical benchmark documentation emphasizing multi-metric language-model evaluation.
[S9] NIST AI Risk Management Framework 1.0 (standards): https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf - Risk-management framework supporting governance, measurement, management, and monitoring practices for AI systems.