Detecting Photoshopped Faces by Scripting Photoshop笔记

最新推荐文章于 2024-08-15 09:26:06 发布

原创最新推荐文章于 2024-08-15 09:26:06 发布 · 1.8k 阅读

0 ·

CC 4.0 BY-SA版权

CV人脸检测专栏收录该内容

1 篇文章

订阅专栏

本文提出一种使用深度学习模型检测Photoshop中流行的人脸扭曲操纵的方法，该模型完全通过自动生成的假图像训练而成，能预测编辑的具体位置，甚至在某些情况下还原原始未编辑图像。

在这里插入图片描述

Abstract

We present a method for detecting one very popular Photoshop manipulation – image warping applied to human faces – using a model trained entirely using fake images that were automatically generated by scripting Photoshop itself.

We show that our model outperforms humans at the task of recognizing manipulated images, can predict the specific location of edits, and in some cases can be used to “undo” a manipulation to reconstruct the original, unedited image. We demonstrate that the system can be successfully applied to real, artist-created image manipulations.

1.Introduction

In this work, we focus on one specific type of Photoshop manipulation – image warping applied to faces.

Our proposed approach is but one tool in a larger toolbox of techniques that together, could be used to help combat the spread of misinformation, and its effects.

Our approach consists of a CNN carefully trained to detect facial warping modifications in images.

Since there are no large-scale datasets of manually created visual fakes. In this work, we solve this problem by using Photoshop itself to automatically generate realistic-looking fake training data.

1.We first collect a large dataset of real face images, from different internet sources (Figure 2a).

2.We then directly script the Face-Aware Liquify tool in Photoshop, which abstracts facial manipulations into high level semantic operations, such as “increase nose width” and “decrease eye distance”.

3.By randomly sampling manipulations in this space (Figure 2b), we are left with a training set consisting of pairs of source images and realistic looking warped modifications.

We train both global classification and local warping field prediction networks on this dataset.

In particular, our local prediction method uses a combination of loss functions including flow warping prediction, relative warp preservation, and a pixel-wise reconstruction loss.

2.Related work

Image forensics, or forgery detection is an increasingly important area of research in computer vision.

In this section, we focus on works that are either trained from large amounts of data, or directly address(处理) the face domain.

Face manipulation

Researchers have proposed forensics methods to detect a variety of face manipulations.
Zhou et al. [42] and Roessler et al. [30, 31] propose neural net-work models to detect face swapping and face reenactment
Other work investigates detecting morphed (interpolated) faces [29] and inconsistencies in lighting from specular highlights on the eye [16].
In contrast, we consider facial warps which undergo subtle geometric deformations, rather than a complete replacement of the face, or the synthesis of new details.

Learning photo forensics

“Self-supervised” image forensics approaches that are trained on automatically-generated fake images.
Chen et al. [11] use a convolutional network to detect median filtering.
Zhou et al. [43] propose an object detection model, specifically using steganalysis features to reduce the influence of semantics. The model is pretrained on automatically created synthetic fakes using object segmentations, and subsequently fine-tuned on actual fake images.
A complementary approach is exploring unsupervised forensics models that learn only from real images, without explicitly modeling the fake image creation process.
For example, several models have been proposed to detect spliced images by identifying patches which come from different camera models [9, 24], by using EXIF metadata [15], or by identifying physical inconsistencies.

Hand-defined manipulation cues

Other image forensics work has proposed to detect fake images using hand- defined cues [14].
Early work detected resampling artifacts [28, 20] by finding periodic correlations between nearby pixels.
There has also been work that detects inconsistent quantization [4], double-JPEG artifacts [8, 5], and geometric inconsistencies [26].
However, the operations performed by interactive image editing tools are often complex, and can be difficult to model.
Our approach, by contrast, learns features appropriate for its task from a large dataset of manipulated images.

3.Datasets

We obtain a large dataset of real face images from the Open Images dataset [21] and Flickr, and create two datasets of fakes: 1.A large, automatically generated set of manipulated images for training a forensics model, 2.A smaller set of actual manipulations done by an artist for evaluation.

Generating manipulated face images

We script the Face-Aware Liquify (FAL) tool [1] in Adobe Photoshop to generate a variety of face manipulations.
We script the Face-Aware Liquify (FAL) tool [1] in Adobe Photoshop to generate a variety of face manipulations, using built-in support for JavaScript execution.
We modify each image from our real face dataset randomly 6 times.
In all, the data we used for training is 1.295M faces – 185K unmodified, and 1.1M modified. Additionally, we hold out 5K real faces each from Open Images and Flickr, leaving half of the images unmodified and the rest modified in the same way as the training data.

Test Set: Artist-created face manipulations

We test the generalization ability to “real” manipulations by contracting a professional artist to manipulate 50 real photographs.
Half are manipulated with the intent of “beautifying”, or increasing attractiveness, and the other half to change facial expression, positively or negatively. This covers two important use cases.

4.Methods

Our goal is to train a system to detect facial manipulations.
We present two models:

A global classification model, tasked with predicting whether a face has been warped.
A local warp predictor, which can be used to identify where manipulations occur, and reverse them.

4.1 Real-or-fake classification

We first address the question “has this image been manipulated?”

We train a binary classifier using a Dilated Residual Network variant (DRN-C-26)
We investigate the effect of resolution by training low and high-resolution models.

High-resolution models enable preservation of low-level details, potentially useful for identifying fakes.
Lower-resolution model potentially contains sufficient details to identify fakes and can be trained more ef- ficiently.
During training, the images are randomly left-right flipped and cropped to 384 and 640 pixels, respectively.
Real-world use cases may contain unexpected post- processing. Forensics algorithms are often sensitive to such operations [28].—— To increase robustness, we consider more aggressive data augmentation, including resizing methods (bicubic and bilinear), JPEG compression, brightness, contrast, and saturation.
We experimentally find that this increases robustness to perturbations at testing, even if they are not in the augmentation set.

4.2 Predicting what moved where

Upon detecting whether a face has been modified, a natural question for a viewer is how the image was edited:

To do this, we predict an optical flow field Uˆ ∈ R^{（H×W×2）} from the original image Xorig ∈ R^{（H×W×3）}to the warped image X, which we then use to try to “reverse” the manipulation and recover the original image.
We train a flow prediction model F to predict the per- pixel warping field, measuring its distance to an approximate “ground-truth” flow field U for each training example.
To remove erroneous flow values, we discard pixels that fail a forward-backward consistency test, resulting in binary mask M ∈ R^{（H×W×1）}

Undoing a warp

With the correct flow field predicted from the original image to the modified image, one can retrieve the original image by inverse warping. This leads to a natural reconstruction loss.
Applying only the reconstruction loss leads to ambiguities in low-texture regions, which often results in undesir-able artifacts. Instead, we jointly train with all three losses.

Architecture

We use a Dilated Residual Network variant (DRN-C-26) [39], pretrained on the ImageNet [32] dataset, as our base network for local prediction.
The DRN architecture was designed originally for semantic segmentation, and we found it to work well for the warp prediction task.
We found that directly training the flow regression network performed poorly. 1.Multinomial Classification.We first recast the problem into multinomial classification, commonly used in regression problems (e.g., colorization [22, 40], surface normal pre- diction [36], and generative modeling [27]), 2. Regression loss.and then finetune with a regression loss. We computed ground truth flow fields using PWC-Net [33].

5.Experiments

We evaluate our ability to detect and undo image manipulations, using both automatic and artist-created images.

5.1 Real-or-fake classification

We first investigate whether manipulated images can be detected by our global classifier on our validation set. We test the robustness of the classifier by perturbing the images, and measure its generalization ability to manipulations by a professional artist.

We evaluate several variants: (1) Low-res with aug.(2) Low-res no aug.(3) High-res with aug.

Baselines

FaceForensics++
Self-consistency

Evaluations

Evaluate our model’s raw accuracy
Ranking-based scores
Average Precision (AP)/Two Alternative Force Choice (2AFC)

Evaluation on auto-generated fakes

Artist test set
We collect data from a professional artist, making it more attractive, or changing the subject’s expression. Since the edits here are made to be more noticeable, and study participants were able to identify the modified image with 71.1% accuracy.

Baseline
Neither of these methods are designed for our application 这两种方法都不是为我们的应用而设计的

FaceForensics++ is split into three manipulation types: face swapping, “deepfakes” face replacement, and face2face reenactment FaceForensics++分为三种操作类型：面部交换、“深度伪造”面部替换和face2face重新生成
Self-consistency, on the other hand, is designed to detect low-level differences in image character- istics.
另一方面，Self-consistency被设计用来检测图像特征中的低级差异。
generalizing to facial warping manipulations is challenging.
将其推广到面部扭曲操作是一项挑战。

5.2 Localizing and undoing manipulations

Model variations
we ablate the loss functions
(1) Our full method: trained with endpoint error (EPE) (Eqn. 1), multiscale gradient (Eqn. 2), and reconstruction (Eqn. 3) losses.

Evaluations
(1) End Point Error (EPE)
(2) Intersection Over Union (IOU-τ)
(3) Delta Peak Signal-to-Noise Ratio (∆PSNR)

Analysis
we found that directly optimizing the reconstruction loss led to better im- age reconstructions

5.3. Out-of-distribution manipulations
While our model is trained to detect face warping ma- nipulations made by Photoshop, we also evaluate its ability to detect other kinds of image editing, and discuss its limitations.

. We apply our manipulation detection model to this video data, still able to make reasonable predictions.

We observe that our low-res model with augmentation produces more stable predictions over time than the one trained without augmentation.

Moreover, the high-res model doesn’t generalize to detect- ing such manipulations. We note that PSNR comparisons on this data are not possible, due to the addition of non- warping image details.

Social media post-processing pipeline
post-processing operations performed by Facebook (e.g., extra JPEG compression)
We note that the high-res model doesn’t generalize to such scenario, and both global and local models trained with augmentation perform better in this scenario.

Other image editing tools
We also tested our lo- cal detection model on facial warping by Facetune [2], and Snapchat Lens Studio [3].Notice that our model is able to perform reasonable recovery of the edits even if the model is not trained on these tools.

Generic Liquify filter
Warping edits that exist outside of this, such as warping applied to hair or body, cannot be detected by our method, cannot be detected by our method.
Despite this, our method can still predict with success well above chance (64.0 accuracy, 85.6 AP)