Alibaba releases two new speech models

robot
Abstract generation in progress

Shanghai Securities News China Securities Network News (Reporter Yang Xiangfei): On March 2nd, Alibaba released two new speech models, based on reference audio, the voice cloning model Fun-CosyVoice3.5, and the tone design model Fun-AudioGen-VD without reference audio. Both models incorporate “instruction following” capabilities, allowing free control over emotion, speech rate, scene, and more. They can be customized in freestyle mode to create characters, suitable for audiobooks, gaming, customer service, podcasts, education, live streaming, and other scenarios.

These two models achieved multiple SOTA results in benchmarks for models of similar size. In the Seed-TTS benchmark’s difficult Chinese cases, Fun-CosyVoice3.5 performed outstandingly, with the lowest Word Error Rate (WER) and Speaker Similarity (SSIM). Additionally, by optimizing pronunciation in difficult cases, the error rate for rare characters dropped from 15.2% to 5.3%.

The models show significant improvements in speech accuracy, speaker similarity, prosody naturalness, and sound quality, mainly due to training process optimizations. Using DiffRO and GRPO in reinforcement learning, rewards for duration and prosody multi-channel aspects were increased. DiffRO (Differentiable Reward Optimization), developed by Alibaba Tongyi Laboratory, is designed specifically for optimizing TTS models. GRPO (Group Relative Policy Optimization) compares different answers to determine superiority and assign rewards. GRPO is also used in Flow Matching (converting noise distribution to real data distribution) reinforcement learning, marking its first application in voice cloning models industry-wide.

Additionally, the tokenizer used in Fun-CosyVoice3.5 halves the frame rate, improving training efficiency, and reduces initial packet latency by 35%, greatly enhancing real-time interaction experience.

Starting today, users can access these two latest models on Alibaba Cloud Bailing.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin