VoxCPM2 技術卡片：開源 TTS 正從 voice cloning 走向 voice design 與可控語音生成

這篇 Threads 介紹 OpenBMB 開源的 VoxCPM2。Allen KB 之前已經整理過 VoxCPM2 對語音 AI commodity 化的影響；這篇作為技術卡片，補上官方 GitHub README / Hugging Face model card 的可驗證資訊。

官方 repo：OpenBMB/VoxCPM

查 GitHub API 時的狀態：

Repo：OpenBMB/VoxCPM
描述：VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning
授權：Apache-2.0
stars：約 16,875
forks：約 2,007
建立時間：2025-09-16
model card pipeline_tag：text-to-speech

VoxCPM2 的官方定位：

VoxCPM is a tokenizer-free Text-to-Speech system that directly generates continuous speech representations via an end-to-end diffusion autoregressive architecture, bypassing discrete tokenization to achieve highly natural and expressive synthesis.

VoxCPM2 是最新主要版本：

2B parameters
trained on over 2 million hours of multilingual speech data
supports 30 languages
Voice Design
Controllable Voice Cloning
Ultimate Cloning
48kHz studio-quality audio output
built on MiniCPM-4 backbone

支援語言包含：Arabic、Burmese、Chinese、Danish、Dutch、English、Finnish、French、German、Greek、Hebrew、Hindi、Indonesian、Italian、Japanese、Khmer、Korean、Lao、Malay、Norwegian、Polish、Portuguese、Russian、Spanish、Swahili、Swedish、Tagalog、Thai、Turkish、Vietnamese。中文方言包含四川話、粵語、吳語、東北話、河南話、陝西話、山東話、天津話、閩南話。

幾個功能重點：

Voice Design：文字捏聲音

不用提供 reference audio，只要在 text 前面放自然語言描述，例如「A young woman, gentle and sweet voice」，就能生成新聲線。這代表 TTS 不再只是「把文字念出來」，而是可以把角色設定、年齡、性別、語氣、情緒、語速放進生成條件。

Controllable Voice Cloning：可控聲音複製

提供短 reference clip 後，VoxCPM2 可以保留音色，同時用文字指令控制 emotion、pace、expression。例如「slightly faster, cheerful tone」。這比單純 clone 更接近可製作的 voice asset。

Ultimate Cloning：高保真延續

提供 reference audio 與 transcript，模型可做 audio-continuation cloning，保留 timbre、rhythm、emotion、style 等細節。這對配音、角色延續、長音檔一致性很重要。

48kHz Studio-Quality Output

官方說明是接受 16kHz reference audio，透過 AudioVAE V2 asymmetric encode/decode 與 built-in super-resolution，直接輸出 48kHz，不需要外部 upsampler。

Streaming / serving

官方 README 提到 Real-Time Streaming，RTX 4090 上 RTF as low as ~0.3，透過 Nano-vLLM 或 vLLM-Omni 可到約 0.13；vLLM-Omni 提供 PagedAttention 與 OpenAI-compatible API。

快速開始也很簡單：

pip install voxcpm

基本需求：Python ≥ 3.10、PyTorch ≥ 2.5.0、CUDA ≥ 12.0。

這件事對語音 AI 的意義：

過去 TTS / voice cloning 的競爭重點是「像不像」、「自然不自然」。VoxCPM2 這類模型把競爭推向下一層：

能不能用文字設計聲音，而不是只 clone 現有聲音。
能不能在 clone 後控制情緒、語速、表情，而不是被 reference clip 綁死。
能不能多語言 / 方言 / 高音質 / streaming 同時成立。
能不能本地或自架，降低對 ElevenLabs 這類閉源 API 的依賴。

不過實務導入要注意幾件事：

Voice cloning 涉及授權與聲紋濫用風險，不能只看 Apache 2.0 就忽略人格權 / 肖像聲音權 / 合約授權。
2B TTS 模型雖然比大型 LLM 小，但 production streaming 仍需要 GPU 與 latency tuning。
多語言支援不等於每種語言都達商用品質，尤其口音、韻律、方言要實測。
如果要做商業語音產品，最值錢的會是授權治理、內容審核、聲音資產管理、工作流整合，而不只是模型本身。

對 BigIntTech 的應用想像：

Telegram / LINE assistant 的語音回覆，改用可控聲線。
教學、知識庫文章、產品 demo 自動生成旁白。
客戶品牌聲音設計：用文字定義品牌 voice persona。
遊戲 / 互動角色：不同角色以 voice design 生成原型聲線。
內部工具：把文件、KB、會議紀錄轉成語音摘要。

我的判斷：

VoxCPM2 的重點不是「又一個免費 ElevenLabs 替代品」，而是開源語音模型已經開始從 clone 走向 design。當聲音可以被 prompt 化，語音產品的核心會從模型能力，轉向：如何安全地管理聲音權利、如何穩定服務、如何讓聲音成為產品體驗的一部分。

參考來源：

Threads 原文：https://www.threads.com/@bing_sunzhi/post/DX0gWf6D0S7
GitHub：OpenBMB/VoxCPM：https://github.com/OpenBMB/VoxCPM
Hugging Face：openbmb/VoxCPM2 model card

原始來源：https://www.threads.com/@bing_sunzhi/post/DX0gWf6D0S7?xmt=AQF0OQwphm4GmlljTt1ecTWm6M6mwZQVfEKxiUM5-wDFTjm-W2eVEnULroOKjKAeyTiL1oia&slof=1