Insights · Neural Avatars

Neural Avatars: Building LLM-Driven Digital Humans

18 April 2026·7 min read·Metaverze AI Lab

How we stack speech models, emotion classifiers and large language models to make spatial avatars that actually listen, think and respond in character.

The avatar stack

A believable digital human is never one model. It is a real-time orchestration of voice-to-text, emotion classification, an LLM reasoner, a personality prompt, a text-to-speech voice clone and a gesture model that drives the rig.

At Metaverze we run this stack on-device for Apple Vision Pro and Quest 3, with latency-critical pieces (VAD, emotion, lipsync) on the local NPU and the LLM streamed from edge.

Persona before intelligence

The biggest mistake teams make is bolting a generic chatbot onto a 3D model. The avatar should be a character first — backstory, opinions, vocal tics, memory — and an AI second. We use system prompts of 2,000 to 4,000 tokens to give every avatar a stable identity that survives long conversations.

Where this is going

Multi-avatar scenes. Avatars that remember you across sessions. Avatars trained on a brand's full archive of writing and design. Avatars that improvise music. The next two years are about giving spatial worlds inhabitants, not just NPCs.

Neural Avatars: Building LLM-Driven Digital Humans

The avatar stack

Persona before intelligence

Where this is going

Generative Worlds: How Diffusion Models Are Rewriting 3D

Spatial AI Agents: Autonomy Inside the Headset