01 | Background Introduction to Original Audio Video Translation
Hello everyone, we are the bilibilli Index team.
Recently, we've launched a new capability: supporting the translation of certain Chinese videos into foreign languages with a dubbed voice that preserves the original speaker's characteristics. This means viewers can now hear "this person speaking in another language," while their voice, tone, rhythm, and even personal expression remain almost identical to the original video. It's no longer the uniform "voice-over artist" sound typical of traditional dubbing, but sounds as natural as if the speaker themselves were speaking a foreign language. Behind this is actually a comprehensive upgrade in cross-modal, multilingual collaborative generation systems.
This series of technological explorations originated from an increasingly urgent need: as video content globalization deepens, multilingual dissemination has become a key medium for connecting cultures and communities. Audiences are no longer satisfied with merely "understanding the content"; they pursue "authenticity" and "presence"—hoping to hear the emotional nuances of the original voice and see natural alignment between lip movements and speech. Creators are also increasingly realizing that voice is not just a carrier of information, but a core medium for personal expression and emotional resonance.
To achieve a truly immersive cross-language experience, we must overcome key limitations present in current localization workflows. The most representative challenges fall into the following three categories:
-
Loss of Vocal Identity: While traditional dubbing solves language barriers, it erases the creator's unique timbre, intonation, and accent—precisely the core identifiers of "who is speaking." In an era of personality-driven content, voice is a vital component of an IP. Once replaced by standardized voiceovers, the emotional connection breaks, and influence diminishes accordingly.
-
Avoiding the Cognitive Load of Subtitles: Subtitles reduce audio to text, losing tone, emotion, and rhythm, thereby weakening the expressiveness of content. Additionally, the dual-input mode of "listening + reading" splits attention, significantly impairing comprehension efficiency and viewing experience, especially in knowledge-dense or immersive content.
-
Reducing the Cost Barrier of Localization: Multilingual production relies on complex manual processes: voiceover recording, timeline synchronization, audio mixing, proofreading... With each additional language, costs rise exponentially. This makes globalization unaffordable for small and medium creators, turning it into a privilege for only a few.
In this article, we will systematically introduce the technical architecture and core challenges behind this capability, and share how we have progressively achieved these goals in practice.
02 Speech Generation Modeling for Perceptual Consistency
Traditional Text-to-Speech (TTS) systems typically optimize for speech naturalness, intelligibility, and voice similarity, lacking multidimensional modeling capabilities for the original auditory scene. In contrast, video-level speech translation is fundamentally about reconstructing perceptual consistency, requiring coordinated modeling across three key dimensions—speaker identity characteristics, acoustic spatial attributes, and multi-source time-frequency structures—to achieve a complete transfer of auditory experience.
-
Reconstruction of Speaker Identity Characteristics:
Traditional dubbing often uses fixed voice actors or generic voice libraries, causing a mismatch between the synthesized voice and the original speaker's vocal characteristics. This "voice distortion" weakens the original speaker's tone, personality, and expressiveness. To address this, our self-developed bilibilli IndexTTS2 focuses on high-precision voice cloning within the video speech translation scenario. It can accurately reconstruct the original speaker's vocal texture and pragmatic style using only minimal information from the original audio.
-
Preservation of Acoustic Spatial Attributes:
Humans subconsciously perceive spatial audio cues such as reverberation characteristics, microphone distance, and ambient noise, which together form auditory spatial clues. This acoustic environmental information—including reverb, spatial echoes, mic distance, and background noise—is crucial for creating auditory authenticity. A key feature of bilibilli IndexTTS2 is its ability to preserve the original soundfield characteristics. This spatial consistency significantly enhances auditory coherence and avoids a sense of "disconnection from the scene."
-
Fusion of Multi-Source Time-Frequency Structures:
In the original audio track, the interplay between speech, background music, and ambient sounds creates dynamic auditory rhythms and emotional tension. To prevent perceptual breaks from simple voice replacement, our audio synthesis incorporates perceptual weighting during reconstruction—integrating vocals, background sounds, and music—to closely match the original video's auditory feel.
2.1 An Integrated Solution to Cross-Lingual Challenges: Voice Consistency, Emotion Transfer, and Speech Rate Control
In real-world video translation scenarios, achieving a complete and natural "original voice style" experience involves far more than just translating the content into the target language. The real challenge lies in preserving the speaker's "vocal individualit

最低0.47元/天 解锁文章
1250

被折叠的 条评论
为什么被折叠?



