Alibaba Cloud has open-sourced its Qwen3-ASR and Qwen3-ForcedAligner AI models, delivering state-of-the-art speech recognition and forced alignment performance across multiple languages and challenging acoustic conditions.
Alibaba Cloud announced that it has made its Qwen3-ASR and Qwen3-ForcedAligner AI models open-source, offering advanced tools for speech recognition and forced alignment
The Qwen3-ASR family includes two all-in-one models, Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, which support language identification and transcription across 52 languages and accents, leveraging large-scale speech data and the Qwen3-Omni foundation model
Internal testing indicates that the 1.7B model delivers state-of-the-art accuracy among open-source ASR systems, while the 0.6B version balances performance and efficiency, capable of transcribing 2,000 seconds of speech in one second with high concurrency
The Qwen3-ForcedAligner-0.6B model uses a non-autoregressive LLM approach to align text and speech in 11 languages, outperforming leading force-alignment solutions in both speed and accuracy
Alibaba Cloud has also released a comprehensive inference framework under the Apache 2.0 license, supporting streaming, batch processing, timestamp prediction, and fine-tuning, aimed at accelerating research and practical applications in audio understanding.
Qwen3-ASR And Qwen3-ForcedAligner Models Demonstrate Leading Accuracy And Efficiency
Alibaba Cloud has released performance results for its Qwen3-ASR and Qwen3-ForcedAligner models, demonstrating leading accuracy and efficiency across diverse speech recognition tasks
The Qwen3-ASR-1.7B model achieves state-of-the-art results among open-source systems, outperforming commercial APIs and other open-source models in English, multilingual, and Chinese dialect recognition, including Cantonese and 22 regional variants
It maintains reliable accuracy in challenging acoustic conditions, such as low signal-to-noise environments, child or elderly speech, and even singing voice transcription, achieving average word error rates of 13.91% in Chinese and 14.60% in English with background music.
The smaller Qwen3-ASR-0.6B balances accuracy and efficiency, delivering high throughput and low latency under high concurrency, capable of transcribing up to five hours of speech in online asynchronous mode at a concurrency of 128
Meanwhile, the Qwen3-ForcedAligner-0.6B outperforms leading end-to-end forced alignment models including Nemo-Forced-Aligner, WhisperX, and Monotonic-Aligner, offering superior language coverage, timestamp accuracy, and support for varied speech and audio lengths.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Qwen Open-Sources Advanced ASR And Forced Alignment Models With Multi-Language Capabilities
In Brief
Alibaba Cloud has open-sourced its Qwen3-ASR and Qwen3-ForcedAligner AI models, delivering state-of-the-art speech recognition and forced alignment performance across multiple languages and challenging acoustic conditions.
Alibaba Cloud announced that it has made its Qwen3-ASR and Qwen3-ForcedAligner AI models open-source, offering advanced tools for speech recognition and forced alignment
The Qwen3-ASR family includes two all-in-one models, Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, which support language identification and transcription across 52 languages and accents, leveraging large-scale speech data and the Qwen3-Omni foundation model
Internal testing indicates that the 1.7B model delivers state-of-the-art accuracy among open-source ASR systems, while the 0.6B version balances performance and efficiency, capable of transcribing 2,000 seconds of speech in one second with high concurrency
The Qwen3-ForcedAligner-0.6B model uses a non-autoregressive LLM approach to align text and speech in 11 languages, outperforming leading force-alignment solutions in both speed and accuracy
Alibaba Cloud has also released a comprehensive inference framework under the Apache 2.0 license, supporting streaming, batch processing, timestamp prediction, and fine-tuning, aimed at accelerating research and practical applications in audio understanding.
Qwen3-ASR And Qwen3-ForcedAligner Models Demonstrate Leading Accuracy And Efficiency
Alibaba Cloud has released performance results for its Qwen3-ASR and Qwen3-ForcedAligner models, demonstrating leading accuracy and efficiency across diverse speech recognition tasks
The Qwen3-ASR-1.7B model achieves state-of-the-art results among open-source systems, outperforming commercial APIs and other open-source models in English, multilingual, and Chinese dialect recognition, including Cantonese and 22 regional variants
It maintains reliable accuracy in challenging acoustic conditions, such as low signal-to-noise environments, child or elderly speech, and even singing voice transcription, achieving average word error rates of 13.91% in Chinese and 14.60% in English with background music.
The smaller Qwen3-ASR-0.6B balances accuracy and efficiency, delivering high throughput and low latency under high concurrency, capable of transcribing up to five hours of speech in online asynchronous mode at a concurrency of 128
Meanwhile, the Qwen3-ForcedAligner-0.6B outperforms leading end-to-end forced alignment models including Nemo-Forced-Aligner, WhisperX, and Monotonic-Aligner, offering superior language coverage, timestamp accuracy, and support for varied speech and audio lengths.