模型介绍
https://arxiv.org/pdf/2407.10759
https://zhuanlan.zhihu.com/p/712987238
We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. We introduce two distinct audio interaction modes:
- voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input;
- audio analysis: users could provide audio and text instructions for analysis during the interaction;
We've released two models of the Qwen2-Audio series: Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct.
根据 Qwen2-Audio 技术报告,它在语音聊天和音频分析之间实现了无缝切换,不需要明确的系统提示。这两种模式是联合训练的,用户可以自然地与模型交互,模型会根据输入(语音或文本)智能地理解用户的意图,并自动选择适合的模式。
- 语音聊天模式:允许用户进行自由的语音对话,可以直接通过语音与模型互动并获取实时响应。
- 音频分析模式:用户可以通过音频或文本输入,要求模型对音频内容进行分析,例如检测声音、对话或其他音频信息。
这种设计使得用户无需手动切换模式,模型会根据交互内容自动适应两种模式的需求,提供流畅的用户体验。
- 2024.8.9 🎉 We released the checkpoints of both
Qwen2-Audio-7BandQwen2-Audio-7B-Instructon ModelScope and Hugging Face. - 2024.7.15 🎉 We released the paper of Qwen2-Audio, introducing the relevant model structure, training methods, and model performance. Check our report for details!
- 2023.11.30 🔥 We released the Qwen-Audio series.
预训练
训练策略
Model Architecture The training process of Qwen2-Audio is depicted in Figure 2, which contains an audio
encoder and a large language model. Given the paired data (a,x), where the a and x denote the audio
sequences and text sequences, the training objective is to maximize the next text token probability as
Pθ(xt|x<t,Encoderϕ(a)),(1) conditioning on audio representations and previous text sequences x<t, where θ and ϕ denote the trainable parameters of the LLM and audio encoder respectively.
Different from Qwen-Audio, the initialization of the audio encoder of Qwen2-Audio is based on the Whisper large-v3 model (Radford et al., 2023). To preprocess the audio data, we resamples it to a frequency of 16kHz and converts the raw waveform into 128-channel mel-spectrogram using a window size of 25ms and a hop size of 10ms. Additionally, a pooling layer with a stride of two is incorporated to reduce the length of the audio repr

最低0.47元/天 解锁文章
8721

被折叠的 条评论
为什么被折叠?



