NVIDIA Releases PersonaPlex-7B: Full-Duplex Voice AI Debuts, Ending the Era of "Walkie-Talkie" Conversations

NVIDIA Research has open-sourced PersonaPlex-7B, a 7-billion parameter AI model designed for real-time voice interaction with full-duplex capability. The model enables natural interruptions and simultaneous listening and speaking, marking a significant advancement in conversational AI.

【Silicon Valley, February 16, 2026】NVIDIA Research has just dropped a bombshell with the official open-source release of their latest AI model named PersonaPlex-7B. This is not just another 7-billion parameter language model, but an end-to-end system specifically designed for real-time voice interaction. Its emergence may signal the end of the clumsy "you speak, then I speak" AI dialogue mode we've grown accustomed to.

Core Breakthrough: AI That Can "Interrupt"

The biggest highlight of PersonaPlex-7B is its Full-Duplex capability.

Many current mainstream voice assistants (such as early Siri or typical speech-to-text systems) operate in "half-duplex" mode: user speaks -> AI records -> silent processing -> AI responds. This is like using a walkie-talkie, where one party must remain silent while the other speaks.

PersonaPlex-7B breaks this limitation. It employs a dual-stream architecture capable of simultaneously processing "listening" and "speaking." This means:

  • It can be interrupted at any time: When the AI is in the middle of a lengthy response, you can directly interject with
    "Wait, what does that mean?"
    , and it will stop and react immediately like a real person, with a latency of only about 240 milliseconds.
  • Natural verbal cues: It can produce natural feedback sounds (Backchanneling) like
    "mm-hmm"
    ,
    "yes"
    ,
    "I'm listening"
    while you speak, making the conversation feel less like reading to a machine.

Technical Breakdown: From Patchwork to Unity

Before PersonaPlex, building a voice robot typically required cobbling together three separate models:

  • Automatic Speech Recognition (ASR): Converting speech to text.
  • Large Language Model (LLM): Thinking and generating text responses.
  • Text-to-Speech (TTS): Reading the response aloud.

This "cascaded" approach is not only slow but also loses emotional information contained in the voice. PersonaPlex-7B, based on the Moshi architecture, completes all tasks within a single model. It uses the Mimi codec to convert speech into tokens, combined with the Helium language model backbone, achieving direct mapping from "audio input" to "audio output."

According to NVIDIA's benchmarks, PersonaPlex has a Time to First Token of only 170 milliseconds, faster than average human reaction speed.

Unique Personalities: Fully Controllable "Personas"

The "Persona" in the model's name is well-deserved. NVIDIA has introduced a unique Hybrid Prompting mechanism, allowing developers to precisely control the AI through two dimensions:

  • Voice Prompt: By providing a few seconds of audio sample, the AI can clone that voice tone and speaking style.
  • Text Prompt: Using text to define the character's background, profession, and personality (e.g.,
    "You are an irritable but professional physics teacher"
    ).

This capability makes PersonaPlex-7B particularly suitable for game NPCs, virtual customer service, personalized companion assistants, and similar applications.

Open Source and Future

NVIDIA has released PersonaPlex-7B's code (MIT license) and model weights (NVIDIA Open Model License) on Hugging Face and GitHub.

Industry Impact:

While OpenAI's GPT-4o and Google's Gemini Live have demonstrated similar real-time voice capabilities, they are mostly closed-source and paid services. NVIDIA's decision to open-source this high-end experience achievable with just 7B parameters will significantly lower the barrier for developers, even allowing regular users to run their own "Jarvis" on consumer-grade graphics cards (like the RTX 4090).

Related Links:

Hugging Face Model Page