StoryTTS、Music Style Transfer、Prompt Following、Code Execution

文章探讨了如何利用人类偏好数据改进文本到图像生成模型,提出VisionPrefer,一个高质量的多维度偏好标注数据集。此外,还涉及到了故事转述、音乐风格转换、文本到语音表达增强、图像质量评估的多模态提示学习方法以及小型、快速的解码器-onlyTransformer架构的研究。

本文首发于公众号:机器感知

StoryTTS、Music Style Transfer、Prompt Following、Code Execution

图片

Multimodal Large Language Model is a Human-Aligned Annotator for  Text-to-Image Generation

图片

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a rewa......

Multi-Modal Prompt Learning on Blind Image Quality Assessment

图片

Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. However, the generalist nature of these pre-trained Vision-Language (VL) models often renders them suboptimal for IQA-specific tasks. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. Existing prompt-based VL models overly focus on incremental semantic information from text, neglecting the rich insights available from visual data analysis. This imbalance limits their performance improvements in IQA tasks. This paper introduces an innovative multi-modal prompt-based methodology for IQA. Our approach ......

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual  Expressiveness Annotations

图片

While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and comprehensive labeling framework is proposed for textual expressiveness. We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc. Then we employ large language models and prompt them with a few manual annotation examples for batch annotation. The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations. Therefore, StoryTTS can aid future ETTS research to fully mine the abundant intrinsic textual ......

Music Style Transfer With Diffusion Model

图片

Previous studies on music style transfer have mainly focused on one-to-one style conversion, which is relatively limited. When considering the conversion between multiple styles, previous methods required designing multiple modes to disentangle the complex style of the music, resulting in large computational costs and slow audio generation. The existing music style transfer methods generate spectrograms with artifacts, leading to significant noise in the generated audio. To address these issues, this study proposes a music style transfer framework based on diffusion models (DM) and uses spectrogram-based methods to achieve multi-to-multi music style transfer. The GuideDiff method is used to restore spectrograms to high-fidelity audio, accelerating audio generation speed and reducing noise in the generated audio. Experimental results show that our model has good performance in multi-mode music style transfer compared to the baseline and can generate high-quality audio in real-......

Enhancing Prompt Following with Visual Control Through Training-Free  Mask-Guided Diffusion

图片

Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and obje......

NExT: Teaching Large Language Models to Reason about Code Execution

图片

A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of how programs execute at run-time. To address this issue, we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation. Experiments on program repair tasks based on MBPP and HumanEval demonstrate that NExT improves the fix rate of a PaLM 2 model, b......

Progressive Token Length Scaling in Transformer Encoders for Efficient  Universal Segmentation

图片

A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses ~50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. With this observation, we propose a strategy termed PROgressive Token Length SCALing for Efficient transformer encoders (PRO-SCALE) that can be plugged-in to the Mask2Former-style segmentation architectures to significantly reduce the computational cost. The underlying principle of PRO-SCALE is: progressively scale the length of the tokens with the layers of the encoder. This allows PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance (~52% GFLOPs reduction with no drop in performan......

OpenELM: An Efficient Language Model Family with Open-source Training  and Inference Framework

图片

The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo while requiring $2\times$ fewer pre-training tokens. Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release ......

Align Your Steps: Optimizing Sampling Schedules in Diffusion Models

图片

Diffusion models (DMs) have established themselves as the state-of-the-art generative modeling approach in the visual domain and beyond. A crucial drawback of DMs is their slow sampling speed, relying on many sequential function evaluations through large neural networks. Sampling from DMs can be seen as solving a differential equation through a discretized set of noise levels known as the sampling schedule. While past works primarily focused on deriving efficient solvers, little attention has been given to finding optimal sampling schedules, and the entire literature relies on hand-crafted heuristics. In this work, for the first time, we propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs, called $\textit{Align Your Steps}$. We leverage methods from stochastic calculus and find optimal schedules specific to different solvers, trained DMs and datasets. We evaluate our novel approach on several image, video as well as 2......

SnapKV: LLM Knows What You are Looking for Before Generation

图片

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when proc......

Towards smallers, faster decoder-only transformers: Architectural  variants and their implications

图片

Research on Large Language Models (LLMs) has recently seen exponential growth, largely focused on transformer-based architectures, as introduced by [1] and further advanced by the decoder-only variations in [2]. Contemporary studies typically aim to improve model capabilities by increasing both the architecture's complexity and the volume of training data. However, research exploring how to reduce model sizes while maintaining performance is limited. This study introduces three modifications to the decoder-only transformer architecture: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt). These variants achieve comparable performance to conventional architectures in code generation tasks while benefiting from reduced model sizes and faster training times. We open-source the model weights and codebase to support future research and development in this domain. ......

在使用 Anaconda 管理 Python 环境时,确保 VS Code 启动时使用的是正确的 Conda 环境至关重要。如果直接通过桌面快捷方式或其他方式启动 VS Code,它可能不会继承当前激活的 Conda 环境变量,从而导致模块导入失败。为了解决这一问题,可以通过 Anaconda Prompt 启动 VS Code,以确保其继承正确的环境变量。 具体操作如下:打开 Anaconda Prompt,输入 `code .` 命令,这将启动 VS Code 并使其继承当前激活的 Conda 环境变量。这样做的好处是 VS Code 会自动识别当前环境中的 Python 解释器和已安装的包,包括 `torch` 等依赖项[^1]。 如果希望在 VS Code 中使用 Anaconda Prompt 作为默认终端,可以在安装 Anaconda 时确保其路径已被添加到系统环境变量中。此外,还需确保在 Anaconda 安装目录中找到 `python.exe` 并检查其属性,确认“以管理员身份运行”选项未被勾选。否则,每次启动 Anaconda Prompt 时都可能因权限问题导致环境变量加载失败[^2]。 为了在 VS Code 中集成 Anaconda Prompt 终端,可以在设置中添加自定义终端配置。例如,在 `settings.json` 文件中添加以下配置: ```json { "terminal.integrated.shell.windows": "C:\\Windows\\System32\\cmd.exe", "terminal.integrated.shellArgs.windows": ["/K", "F:\\anaconda3\\Scripts\\activate.bat", "F:\\anaconda3"] } ``` 上述配置中的路径应根据实际安装位置进行调整。通过这种方式,可以在 VS Code 内部直接使用 Anaconda Prompt,从而确保环境变量的一致性[^3]。 在某些情况下,用户可能会遇到终端显示异常的问题,例如颜色渲染、粘贴功能失效等。这类问题通常与 Windows 控制台设置有关,启用旧版控制台模式可能是一个临时解决方案。然而,这种方式可能会影响其他终端程序的使用体验,因此需要权衡利弊后再做决定[^4]。 通过上述方法,可以确保在 Anaconda Prompt 中启动 VS Code 时使用的是正确的 Python 环境,从而避免因解释器路径不一致导致的模块导入错误。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值