Engaging Image Captioning via Personality

本文提出了一种结合个性特征的图像字幕生成方法,通过建立PERSONALITY-CAPTIONS数据集,利用retrieval和generative模型生成吸引人的图像描述。实验采用SCST训练策略,设计TransResNet进行caption检索。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Engaging Image Captioning via Personality

原文地址

时间:2019 CVPR arxiv 2018

模型结构图画的很清晰:trained、pretrained、frozen都标的很清楚

Intro

一般的image caption系统给出了caption都是显然的、无个性的结果,而人类考虑的是有吸引力的和高效的caption来避免说一些显然的东西,本文的工作是通过融入个性生成有吸引力的(engaging)caption,建立了PERSONALITY-CAPTIONS数据集,其中包括241858个captions,每一个caption都属于某个personality

PERSONALITY-CAPTIONS

数据集

Models

本文考虑了两个caption model,一个是retrieval model,一个是generative model

Image Encoders

两种image encoder,resnet152和resnext 32 × \times × 48d,预训练过的

caption generation models

本文复现了三个常用的SoTA image caption 模型,SHOWTELL,SHOWATTTELL和UPDOWN
Image and Personality Encoders,使用之前提到的image encoder编码成2048维向量给SHOWTELL,对于SHOWATTTELL和UPDOWN,编码7×7×2048给它们,最终,图片的特征会变成一个512维的向量,在SHOWTELL模型中,使用线性变换做到,在另外两个模型中,先用1×1卷积,得到7×7×512的张量,然后使用attention机制将7×7区域变成1×1,当使用personality trait的时候,每个personality trait被编码成512维的向量,类似word embedding

caption decoders,caption model和原来的模型有一些差别,详见论文原文

training and inference,本文使用SCST的两阶段训练方法

caption retrieval models

我们设计了一个简单的retrieval结构,称为TransResNet,它将图片、个性、caption映射到相同的空间 S S S

Image and Personality Encoders,使用2048的图片特征,然后输入到多层神经网络中得到500维的特征,将每个trait编码成500维的向量,然后将两个结果加起来

caption encoders,每个caption被编码为向量,使用Transformer结构,后面跟上两个全连接,通过点乘来匹配,使用log-likelihood和k个负样本来训练,为了对比,使用了一个简单的bag-of-words encoder,给定图片和personality trait ( I , P ) (I,P) (I,P)以及candidate C C C,得分计算为 s ( I , P , C ) = ( r I + r P ) r C s(I,P,C)=(r_I+r_P)r_C s(I,P,C)=(rI+rP)rC

training and inference,给定 I , P I,P I,P和candidates集合 ( c 1 , . . . , c N ) (c_1,...,c_N) (c1,...,cN),inference time选择score最大的 c c c,训练的时候我们将一系列得分传递给softmax层然后来最大化log-likelihood,整个结构如图所示
在这里插入图片描述

Experiments

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

结论

本文提出了同时理解图片内容并生成有吸引力的caption的模型,提出了新的数据集PERSONALITY-CAPTIONS,提出了新的retrieval模型,TranResNet

### Element UI Image Component Documentation and Examples Element UI provides a versatile `el-image` component designed specifically for handling images within applications efficiently. The following details cover its usage, properties, events, slots, and methods. #### Basic Usage Example To integrate the `el-image` into your project, ensure that Element UI is installed and imported correctly. Below demonstrates how to use this component: ```html <template> <div class="demo-image"> <div class="block" v-for="fit in fits" :key="fit"> <span>{{ fit }}</span> <el-image style="width: 100px; height: 100px" :src="url" :fit="fit"></el-image> </div> </div> </template> <script> export default { data() { return { url: 'https://fuss10.elemecdn.com/e/5d/4a731a90594a4af544c0c25941171jpeg.jpeg', fits: ['fill', 'contain', 'cover', 'none', 'scale-down'] }; } }; </script> ``` The above code snippet showcases different fitting options available when displaying an image through the `el-image` tag[^1]. #### Properties Explained Several attributes can customize the behavior of the displayed image: - **alt**: Alternative text description. - **src**: Source URL string pointing towards the desired image file location. - **lazy**: Boolean flag enabling lazy loading functionality which loads only visible elements initially improving performance especially on pages containing many images. - **scroll-container**: Selector identifying container element responsible for triggering scroll event used alongside lazy load feature. - **preview-src-list**: Array holding URLs intended for preview mode allowing users to view larger versions upon clicking smaller thumbnails. - **z-index**: Controls stacking order during previews ensuring proper layering among multiple overlapping components. - **referrerpolicy**: Specifies referrer information included while fetching resources adheres W3C specification guidelines enhancing security measures against potential leaks. These configurable parameters provide flexibility catering various requirements ranging simple static displays up till complex interactive galleries. #### Events Supported Event listeners attached directly onto `<el-image>` tags allow developers to react accordingly based on user interactions or internal state changes like successful fetch completion (`load`) or failure (`error`). Handling these scenarios gracefully ensures robustness under unpredictable network conditions maintaining good UX standards expected today's web apps. #### Slots Available Customization extends beyond mere attribute settings thanks to named slot support offering greater control over layout design aspects including but not limited to placeholder visuals shown before actual media becomes ready after being fetched remotely from servers across internet connections varying speeds reliability levels etcetera: - **default**: Content rendered inside custom placeholders until main object fully loaded replacing temporary markers seamlessly once completed without disrupting overall flow experience negatively impacting end-users perception quality service provided online platforms striving excellence every aspect interaction possible. In summary, mastering utilization possibilities offered via comprehensive set features embedded within `el-image` empowers creators build visually appealing interfaces capable delivering rich multimedia experiences effortlessly meeting modern expectations effectively addressing diverse needs audiences worldwide seeking engaging digital spaces fostering communication collaboration innovation alike.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值