Insights · Neural Avatars

Neural Avatars: Building LLM-Driven Digital Humans

·7 min read·Metaverze AI Lab
Neural Avatars: Building LLM-Driven Digital Humans

How we stack speech models, emotion classifiers and large language models to make spatial avatars that actually listen, think and respond in character.

The avatar stack

A believable digital human is never one model. It is a real-time orchestration of voice-to-text, emotion classification, an LLM reasoner, a personality prompt, a text-to-speech voice clone and a gesture model that drives the rig.

At Metaverze we run this stack on-device for Apple Vision Pro and Quest 3, with latency-critical pieces (VAD, emotion, lipsync) on the local NPU and the LLM streamed from edge.

Persona before intelligence

The biggest mistake teams make is bolting a generic chatbot onto a 3D model. The avatar should be a character first — backstory, opinions, vocal tics, memory — and an AI second. We use system prompts of 2,000 to 4,000 tokens to give every avatar a stable identity that survives long conversations.

Where this is going

Multi-avatar scenes. Avatars that remember you across sessions. Avatars trained on a brand's full archive of writing and design. Avatars that improvise music. The next two years are about giving spatial worlds inhabitants, not just NPCs.