Insights · Neural Avatars
Neural Avatars: Building LLM-Driven Digital Humans

How we stack speech models, emotion classifiers and large language models to make spatial avatars that actually listen, think and respond in character.
The avatar stack
A believable digital human is never one model. It is a real-time orchestration of voice-to-text, emotion classification, an LLM reasoner, a personality prompt, a text-to-speech voice clone and a gesture model that drives the rig.
At Metaverze we run this stack on-device for Apple Vision Pro and Quest 3, with latency-critical pieces (VAD, emotion, lipsync) on the local NPU and the LLM streamed from edge.
Persona before intelligence
The biggest mistake teams make is bolting a generic chatbot onto a 3D model. The avatar should be a character first — backstory, opinions, vocal tics, memory — and an AI second. We use system prompts of 2,000 to 4,000 tokens to give every avatar a stable identity that survives long conversations.
Where this is going
Multi-avatar scenes. Avatars that remember you across sessions. Avatars trained on a brand's full archive of writing and design. Avatars that improvise music. The next two years are about giving spatial worlds inhabitants, not just NPCs.