一、前言
前段时候阿里开源了千问Omni多模态模型,根据官方介绍以及效果演示视频可以了解到,这个模型是多模态模型,可以支持多种不同的输入形式,包括了文本、文本+音频、文本+图片以及文本+视频的多模态输入。 并且模型输出形式不局限于文本了,同时输出语音合成结果,省去了再去集成其他语音模型的流程,并且模型反馈的速度非常的快,整体的体验效果相当出色。
从阿里云百炼官方平台查看了关于Omni模型的文档,那目前阿里官方是通过了商业版以及开源两个可用版本,开源版虽然只有7B的参数量,但是实际上,这个模型是多模态模型,本地部署要跑起来所需要的显存要求的比较大,据其他网友提供的资料,大致需要70GB的显存容量,才能够跑起来全部的功能。 对普通用户来说,使用官方提供的商业版API接口,就能够获得稳定高效的接口服务,这是比较推荐的方式,具体的接口加入如下所示:
二、Omni接口规范
本节内容将根据阿里官方文档,说明Qwen Omni模型的接口规范。
1、基本报文
根据官方接口文档的说明,常规的文本输入接口是兼容OpenAI API的,其中,发送的报文结果与openAI的接口报文格式一样,如下所示:
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-omni-turbo",
"messages": [
{
"role": "user",
"content": "你是谁?"
}
],
"stream":true,
"stream_options":{
"include_usage":true
},
"modalities":["text","audio"],
"audio":{"voice":"Cherry","format":"wav"}
}'
这里我们注意一下,stream选项,默认是true,即表示采用流式输出。
modalities配置里,默认值是["text","audio"],表示模型返回的数据,包含文本以及音频数据,如果不需要音频,只保留["text"]就可以了。
audio配置,定义了返回的音频的音色,以及音频格式。根据官方文档内容,输出音频的音色与文件格式(只支持设定为"wav"
)通过audio
参数来配置,如:audio={"voice": "Cherry", "format": "wav"}
,其中商业版模型voice
参数可选值为:["Cherry", "Serena", "Ethan", "Chelsie"]
,开源版模型voice
参数可选值为:["Ethan", "Chelsie"]
2、图片/音频/视频报文
这里我们不使用SDK,因此发送的报文中,需要把多媒体资源转换为base64编码格式,并拼接到报文中,进行传输。
1)文本+图片的报文格式
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-omni-turbo",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{base64_image}"}
},
{
"type": "text",
"text": "图中描绘的是什么景象?"
}
]
}
],
"stream":true,
"stream_options":{
"include_usage":true
},
"modalities":["text","audio"],
"audio":{"voice":"Cherry","format":"wav"}
}'
其中,{base64_image}修改为你上传的图片的base64编码。
2)文本+音频的报文格式
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-omni-turbo",
"messages": [
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": f"data:;base64,{base64_audio}",
"format": "mp3"
}
},
{
"type": "text",
"text": "这段音频在说什么"
}
]
}
],
"stream":true,
"stream_options":{
"include_usage":true
},
"modalities":["text","audio"],
"audio":{"voice":"Cherry","format":"wav"}
}'
其中,{base64_audio}修改为需要输入的音频文件的base64编码,需要注意的是,音频的格式需要为mp3.
3)文本+视频的报文格式
输入视频的报文支持两种方式,①传送视频文件数据;②传输图片序列数据。
以下是传输视频文件数据的报文格式:
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-omni-turbo",
"messages": [
{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {"url": f"data:;base64,{base64_video}"},
},
{
"type": "text",
"text": "视频的内容是什么"
}
]
}
],
"stream":true,
"stream_options": {
"include_usage": true
},
"modalities":["text","audio"],
"audio":{"voice":"Cherry","format":"wav"}
}'
其中,{base64_video}修改为你需要导入的视频文件的base64编码数据。
传输图片序列的报文格式:
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-omni-turbo",
"messages": [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
f"data:image/jpeg;base64,{base64_image_1}",
f"data:image/jpeg;base64,{base64_image_2}",
f"data:image/jpeg;base64,{base64_image_3}",
f"data:image/jpeg;base64,{base64_image_4}",
],
},
{
"type": "text",
"text": "描述这个视频的具体过程"
}
]
}
],
"stream": true,
"stream_options": {
"include_usage": true
},
"modalities": ["text", "audio"],
"audio": {
"voice": "Cherry",
"format": "wav"
}
}'
其中,video的列表应存储为图片序列的base编码,{base64_image_1}对用各个图片的base64编码数据。
3、返回的报文结果
使用音频模式时,模型只能返回流式数据,返回的报文结构如下所示:
data:
{"choices":
[{"finish_reason":null,
"delta":
{"audio":
{"transcript":"我是"}},
"index":0,"logprobs":null}],
"object":"chat.completion.chunk",
"usage":null,"created":1747031484,
"system_fingerprint":null,
"model":"qwen-omni-turbo",
"id":"chatcmpl-92c43996-1c01-90c0-86a6-197f7e00ec5a"
}
我们需要逐条处理返回的结果,获取结果数据后进行解析与处理。需要注意,最后一条返回的数据结构为:data: [DONE] 我们可以根据解析结果是否包含[DONE] 来判断是否处理完成。
三、Unity端的代码实现
本例是使用Unity实现的二次元小姐姐交互项目,通过集成Omni模型的接口,实现纯文本、文本+实时语音互动、文本+本地音频文件、文本+本地图片以及文本+本地视频文件的AI互动交流效果的代码实现,项目演示效果可以查看文末的视频。
以下是针对不同类型媒体资源的关键代码实现:
1、纯文本输入
/// <summary>
/// 只发送文本
/// </summary>
/// <param name="_postWord"></param>
/// <param name="_callback"></param>
/// <returns></returns>
public IEnumerator OnTxtRequest(string _postWord, System.Action<string> _callback)
{
stopwatch.Restart();
using (UnityWebRequest request = new UnityWebRequest(url, "POST"))
{
// 初始化请求头和数据
var _sendWord = new SendData();
_sendWord.role = "user";
TextContentData _textContent= new TextContentData();
_textContent.text = _postWord;
_sendWord.content.Add(_textContent);
PostTextData _postData = new PostTextData();
_postData.model = m_ModelName;
_postData.messages.Add(_sendWord);
_postData.stream = true;
_postData.modalities.Add("text");
_postData.modalities.Add("audio");
_postData.audio.voice = m_VoiceType.ToString();//音色
string _jsonText = JsonConvert.SerializeObject(_postData);
//string _jsonText = JsonUtility.ToJson(_postData);
byte[] data = Encoding.UTF8.GetBytes(_jsonText);
request.uploadHandler = new UploadHandlerRaw(data);
request.downloadHandler = new DownloadHandlerBuffer();
request.SetRequestHeader("Content-Type", "application/json");
request.SetRequestHeader("Authorization", $"Bearer {api_key}");
// 异步发送请求
request.SendWebRequest();
int bytesReceived = 0;
// 实时处理流数据
while (!request.isDone)
{
// 获取最新接收的字节数
int newBytes = request.downloadHandler.data != null ? request.downloadHandler.data.Length : 0;
if (newBytes > bytesReceived)
{
// 提取新增数据并转换
byte[] newData = new byte[newBytes - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, newData, 0, newData.Length);
string chunk = Encoding.UTF8.GetString(newData);
// 处理数据块
ProcessChunk(chunk, _callback);
bytesReceived = newBytes;
}
yield return null;
}
// 处理剩余数据
if (request.downloadHandler.data != null && bytesReceived < request.downloadHandler.data.Length)
{
byte[] remainingData = new byte[request.downloadHandler.data.Length - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, remainingData, 0, remainingData.Length);
ProcessChunk(Encoding.UTF8.GetString(remainingData), _callback);
}
// 错误处理
if (request.result != UnityWebRequest.Result.Success)
{
Debug.LogError($"Error: {request.error}");
}
}
stopwatch.Stop();
Debug.Log($"Total time: {stopwatch.Elapsed.TotalSeconds}s");
}
2、文本+音频输入
/// <summary>
/// 发送音频以及文本
/// </summary>
/// <param name="_postWord"></param>
/// <param name="_base64"></param>
/// <param name="_callback"></param>
/// <returns></returns>
public IEnumerator OnVoiceAndTextRequest(string _postWord, string _base64,System.Action<string> _callback)
{
stopwatch.Restart();
using (UnityWebRequest request = new UnityWebRequest(url, "POST"))
{
if (_postWord == "") { _postWord = "请根据语音内容进行回答"; }
// 初始化请求头和数据
var _sendWord = new SendData();
_sendWord.role = "user";
VoiceContentData _voiceContent=new VoiceContentData();
_sendWord.content.Add(_voiceContent);
_voiceContent.input_audio.data += _base64;
//添加文本
TextContentData _textContent=new TextContentData();
_textContent.text = _postWord;
_sendWord.content.Add(_textContent);
PostTextData _postData = new PostTextData();
_postData.model = m_ModelName;
_postData.messages.Add(_sendWord);
_postData.stream = true;
_postData.modalities.Add("text");
_postData.modalities.Add("audio");
_postData.audio.voice = m_VoiceType.ToString();//音色
string _jsonText = JsonConvert.SerializeObject( _postData );
//string _jsonText = JsonUtility.ToJson(_postData);
byte[] data = Encoding.UTF8.GetBytes(_jsonText);
request.uploadHandler = new UploadHandlerRaw(data);
request.downloadHandler = new DownloadHandlerBuffer();
request.SetRequestHeader("Content-Type", "application/json");
request.SetRequestHeader("Authorization", $"Bearer {api_key}");
// 异步发送请求
request.SendWebRequest();
int bytesReceived = 0;
// 实时处理流数据
while (!request.isDone)
{
// 获取最新接收的字节数
int newBytes = request.downloadHandler.data != null ? request.downloadHandler.data.Length : 0;
if (newBytes > bytesReceived)
{
// 提取新增数据并转换
byte[] newData = new byte[newBytes - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, newData, 0, newData.Length);
string chunk = Encoding.UTF8.GetString(newData);
// 处理数据块
ProcessChunk(chunk, _callback);
bytesReceived = newBytes;
}
yield return null;
}
// 处理剩余数据
if (request.downloadHandler.data != null && bytesReceived < request.downloadHandler.data.Length)
{
byte[] remainingData = new byte[request.downloadHandler.data.Length - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, remainingData, 0, remainingData.Length);
ProcessChunk(Encoding.UTF8.GetString(remainingData), _callback);
}
// 错误处理
if (request.result != UnityWebRequest.Result.Success)
{
Debug.LogError($"Error: {request.error}");
}
}
stopwatch.Stop();
Debug.Log($"Total time: {stopwatch.Elapsed.TotalSeconds}s");
}
3、文本+图片输入
/// <summary>
/// 发送图片与文本
/// </summary>
/// <param name="_postWord"></param>
/// <param name="_img_base64"></param>
/// <param name="_callback"></param>
/// <returns></returns>
public IEnumerator OnImageAndTextRequest(string _postWord,string _img_base64, System.Action<string> _callback)
{
stopwatch.Restart();
using (UnityWebRequest request = new UnityWebRequest(url, "POST"))
{
// 初始化请求头和数据
var _sendWord = new SendData();
_sendWord.role = "user";
ImageContentData _imgContent = new ImageContentData();
_sendWord.content.Add(_imgContent);
_imgContent.image_url.url += _img_base64;
//添加文本
TextContentData _textContent = new TextContentData();
_textContent.text = _postWord;
_sendWord.content.Add(_textContent);
PostTextData _postData = new PostTextData();
_postData.model = m_ModelName;
_postData.messages.Add(_sendWord);
_postData.stream = true;
_postData.modalities.Add("text");
_postData.modalities.Add("audio");
_postData.audio.voice = m_VoiceType.ToString();//音色
string _jsonText = JsonConvert.SerializeObject(_postData);
//string _jsonText = JsonUtility.ToJson(_postData);
byte[] data = Encoding.UTF8.GetBytes(_jsonText);
request.uploadHandler = new UploadHandlerRaw(data);
request.downloadHandler = new DownloadHandlerBuffer();
request.SetRequestHeader("Content-Type", "application/json");
request.SetRequestHeader("Authorization", $"Bearer {api_key}");
// 异步发送请求
request.SendWebRequest();
int bytesReceived = 0;
// 实时处理流数据
while (!request.isDone)
{
// 获取最新接收的字节数
int newBytes = request.downloadHandler.data != null ? request.downloadHandler.data.Length : 0;
if (newBytes > bytesReceived)
{
// 提取新增数据并转换
byte[] newData = new byte[newBytes - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, newData, 0, newData.Length);
string chunk = Encoding.UTF8.GetString(newData);
// 处理数据块
ProcessChunk(chunk, _callback);
bytesReceived = newBytes;
}
yield return null;
}
// 处理剩余数据
if (request.downloadHandler.data != null && bytesReceived < request.downloadHandler.data.Length)
{
byte[] remainingData = new byte[request.downloadHandler.data.Length - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, remainingData, 0, remainingData.Length);
ProcessChunk(Encoding.UTF8.GetString(remainingData), _callback);
}
// 错误处理
if (request.result != UnityWebRequest.Result.Success)
{
Debug.LogError($"Error: {request.error}");
}
}
stopwatch.Stop();
Debug.Log($"Total time: {stopwatch.Elapsed.TotalSeconds}s");
}
4、文本+视频输入
/// <summary>
/// 发送视频与文本
/// </summary>
/// <param name="_postWord"></param>
/// <param name="_video_base64"></param>
/// <param name="_callback"></param>
/// <returns></returns>
public IEnumerator OnVideoAndTextRequest(string _postWord, string _video_base64, System.Action<string> _callback)
{
stopwatch.Restart();
using (UnityWebRequest request = new UnityWebRequest(url, "POST"))
{
// 初始化请求头和数据
var _sendWord = new SendData();
_sendWord.role = "user";
VideoContentData _videoContent = new VideoContentData();
_sendWord.content.Add(_videoContent);
_videoContent.video_url.url += _video_base64;
//添加文本
TextContentData _textContent = new TextContentData();
_textContent.text = _postWord;
_sendWord.content.Add(_textContent);
PostTextData _postData = new PostTextData();
_postData.model = m_ModelName;
_postData.messages.Add(_sendWord);
_postData.stream = true;
_postData.modalities.Add("text");
_postData.modalities.Add("audio");
_postData.audio.voice = m_VoiceType.ToString();//音色
string _jsonText = JsonConvert.SerializeObject(_postData);
//string _jsonText = JsonUtility.ToJson(_postData);
byte[] data = Encoding.UTF8.GetBytes(_jsonText);
request.uploadHandler = new UploadHandlerRaw(data);
request.downloadHandler = new DownloadHandlerBuffer();
request.SetRequestHeader("Content-Type", "application/json");
request.SetRequestHeader("Authorization", $"Bearer {api_key}");
// 异步发送请求
request.SendWebRequest();
int bytesReceived = 0;
// 实时处理流数据
while (!request.isDone)
{
// 获取最新接收的字节数
int newBytes = request.downloadHandler.data != null ? request.downloadHandler.data.Length : 0;
if (newBytes > bytesReceived)
{
// 提取新增数据并转换
byte[] newData = new byte[newBytes - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, newData, 0, newData.Length);
string chunk = Encoding.UTF8.GetString(newData);
// 处理数据块
ProcessChunk(chunk, _callback);
bytesReceived = newBytes;
}
yield return null;
}
// 处理剩余数据
if (request.downloadHandler.data != null && bytesReceived < request.downloadHandler.data.Length)
{
byte[] remainingData = new byte[request.downloadHandler.data.Length - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, remainingData, 0, remainingData.Length);
ProcessChunk(Encoding.UTF8.GetString(remainingData), _callback);
}
// 错误处理
if (request.result != UnityWebRequest.Result.Success)
{
Debug.LogError($"Error: {request.error}");
}
}
stopwatch.Stop();
Debug.Log($"Total time: {stopwatch.Elapsed.TotalSeconds}s");
}
5、文本+图片序列输入
/// <summary>
/// 发送图片序列
/// </summary>
/// <param name="_postWord"></param>
/// <param name="_img_base64"></param>
/// <param name="_callback"></param>
/// <returns></returns>
public IEnumerator OnImageFrameAndTextRequest(string _postWord, List<string> _img_base64, System.Action<string> _callback)
{
stopwatch.Restart();
using (UnityWebRequest request = new UnityWebRequest(url, "POST"))
{
// 初始化请求头和数据
var _sendWord = new SendData();
_sendWord.role = "user";
ImageFrameContentData _imageFrameContent = new ImageFrameContentData();
_sendWord.content.Add(_imageFrameContent);
foreach(var item in _img_base64)
{
string _val = "data:image/jpeg;base64," + item;
_imageFrameContent.video.Add(_val);
}
//添加文本
TextContentData _textContent = new TextContentData();
_textContent.text = _postWord;
_sendWord.content.Add(_textContent);
PostTextData _postData = new PostTextData();
_postData.model = m_ModelName;
_postData.messages.Add(_sendWord);
_postData.stream = true;
_postData.modalities.Add("text");
_postData.modalities.Add("audio");
_postData.audio.voice = m_VoiceType.ToString();//音色
string _jsonText = JsonConvert.SerializeObject(_postData);
//string _jsonText = JsonUtility.ToJson(_postData);
byte[] data = Encoding.UTF8.GetBytes(_jsonText);
request.uploadHandler = new UploadHandlerRaw(data);
request.downloadHandler = new DownloadHandlerBuffer();
request.SetRequestHeader("Content-Type", "application/json");
request.SetRequestHeader("Authorization", $"Bearer {api_key}");
// 异步发送请求
request.SendWebRequest();
int bytesReceived = 0;
// 实时处理流数据
while (!request.isDone)
{
// 获取最新接收的字节数
int newBytes = request.downloadHandler.data != null ? request.downloadHandler.data.Length : 0;
if (newBytes > bytesReceived)
{
// 提取新增数据并转换
byte[] newData = new byte[newBytes - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, newData, 0, newData.Length);
string chunk = Encoding.UTF8.GetString(newData);
// 处理数据块
ProcessChunk(chunk, _callback);
bytesReceived = newBytes;
}
yield return null;
}
// 处理剩余数据
if (request.downloadHandler.data != null && bytesReceived < request.downloadHandler.data.Length)
{
byte[] remainingData = new byte[request.downloadHandler.data.Length - bytesReceived];
Array.Copy(request.downloadHandler.data, bytesReceived, remainingData, 0, remainingData.Length);
ProcessChunk(Encoding.UTF8.GetString(remainingData), _callback);
}
// 错误处理
if (request.result != UnityWebRequest.Result.Success)
{
Debug.LogError($"Error: {request.error}");
}
}
stopwatch.Stop();
Debug.Log($"Total time: {stopwatch.Elapsed.TotalSeconds}s");
}
6、报文结构定义类
// 以下为数据结构定义
[Serializable]
public class SendData
{
public string role;
public List<ContentData> content=new List<ContentData>();
}
[Serializable]
public class PostTextData
{
public string model;
public List<SendData> messages = new List<SendData>();
public bool stream = true;
public List<string> modalities = new List<string>();
public AudioSet audio= new AudioSet();
}
[Serializable]
public class ContentData
{
public string type = "";
}
/// <summary>
/// 文本类型报文
/// </summary>
[Serializable]
public class TextContentData : ContentData
{
public TextContentData() { type = "text"; }
public string text = "";
}
/// <summary>
/// 文本+音频
/// </summary>
[Serializable]
public class VoiceContentData: ContentData
{
public VoiceContentData() { type = "input_audio"; }
public AudioInput input_audio=new AudioInput();
}
[Serializable]
public class AudioInput
{
public string data = "data:;base64,";//,后添加base64音频编码
public string format = "wav";//wav,mp3
}
/// <summary>
/// 文本+图片
/// </summary>
[Serializable]
public class ImageContentData : ContentData
{
public ImageContentData() { type = "image_url"; }
public ImageInput image_url = new ImageInput();
}
public class ImageInput
{
public string url = "data:image/png;base64,";//,后添加base64图片编码
}
/// <summary>
/// 文本+视频
/// </summary>
[Serializable]
public class VideoContentData : ContentData
{
public VideoContentData() { type = "video_url"; }
public VideoInput video_url = new VideoInput();
}
[Serializable]
public class VideoInput
{
public string url = "data:;base64,";//,后添加base64视频编码
}
/// <summary>
/// 文本+图片序列
/// </summary>
[Serializable]
public class ImageFrameContentData : ContentData
{
public ImageFrameContentData() { type = "video"; }
public List<string> video = new List<string>();// "data:image/jpeg;base64,";//,后添加base64图片编码
}
[Serializable]
public class MessageBack
{
public List<Choice> choices = new List<Choice>();
}
[Serializable]
public class Choice
{
public Delta delta=new Delta();
}
[Serializable]
public class Delta
{
public string role= string.Empty;
public string content= string.Empty;
public Audio audio=new Audio();
}
[Serializable]
public class AudioSet
{
public string voice = "Cherry";//Cherry、Serena、Ethan、Chelsie
public string format = "wav";
}
[Serializable]
public class Audio
{
public string transcript="";
public string data="";
}
/// <summary>
/// 音色
/// </summary>
public enum VoiceType
{
Cherry,
Serena,
Ethan,
Chelsie
}
四、结束语
本文针对最新的阿里开源的Qwen Omni2.5 - 7B多模态模型的接口规范进行介绍,并提供了在Unity端与Omni官方接口对接的代码实现示例,本文包含的代码为接口对接的核心代码,多模态端到端交互效果可以查阅以下视频。
Unity+Qwen2.5 Omni 源码 | 多模态端到端交互的AI二次元小姐姐,文本/语音/图片/视频全模态交互实测与示例源码