《Utilizing Ensemble Learning for Detecting Multi-Modal Fake News》

原创

已于 2024-07-27 23:43:08 修改 · 828 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #论文笔记 #人工智能

于 2024-07-27 23:39:44 首次发布

系列论文研读目录

文章目录

系列论文研读目录
论文题目含义
ABSTRACT
INDEX TERMS
I. INTRODUCTION
II. RELATED WORK
III. METHODOLOGY
IV. EVALUATION 评价
V. CONCLUSION AND FUTURE WORK

文章链接

论文题目含义

利用集成学习检测多模态假新闻

ABSTRACT

The spread of fake news has become a critical problem in recent years due extensive use of social media platforms. False stories can go viral quickly, reaching millions of people before they can be mocked, i.e., a false story claiming that a celebrity has died when he/she is still alive. Therefore, detecting fake news is essential for maintaining the integrity of information and controlling misinformation, social and political polarization, media ethics, and security threats. From this perspective, we propose an ensemble learning-based detection of multi-modal fake news. First, it exploits a publicly available dataset Fakeddit consisting of over 1 million samples of fake news. Next, it leverages Natural Language Processing (NLP) techniques for preprocessing textual information of news. Then, it gauges the sentiment from the text of each news. After that, it generates embeddings for text and images of the corresponding news by leveraging Visual Bidirectional Encoder Representations from Transformers (V-BERT), respectively. Finally, it passes the embeddings to the deep learning ensemble model for training and testing. The 10-fold evaluation technique is used to check the performance of the proposed approach. The evaluation results are significant and outperform the state-of-the-art approaches with the performance improvement of 12.57%, 9.70%, 18.15%, 12.58%, 0.10, and 3.07 in accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), and Odds Ratio (OR), respectively.近年来，由于社交媒体平台的广泛使用，假新闻的传播已成为一个严重问题。虚假的故事可以迅速传播，在被嘲笑之前就已经传播到数百万人手中，也就是说，一个虚假的故事，声称一个名人已经死了，而他/她还活着。因此，检测假新闻对于维护信息的完整性，控制错误信息，社会和政治两极分化，媒体道德和安全威胁至关重要。从这个角度出发，我们提出了一种基于集成学习的多模态假新闻检测方法。首先，它利用了一个由超过100万个假新闻样本组成的公开数据集Fakeddit。接下来，它利用自然语言处理（NLP）技术来预处理新闻的文本信息。然后，它从每条新闻的文本中衡量情绪。之后，它通过利用来自变压器的视觉双向编码器表示（V-BERT）分别为相应新闻的文本和图像生成嵌入。最后，它将嵌入传递给深度学习集成模型进行训练和测试。10倍评估技术被用来检查所提出的方法的性能。评价结果是显著的，并优于最先进的方法的性能提高12.57%，18.15%，12.58%，准确率，召回率，F1分数，马修斯修正相关系数（MCC）和比值比（OR）分别为12.57%，9.70%，0.10和3.07。

INDEX TERMS

Ensemble learning, convolutional neural network, multi-modal fake news, classification, boosted CNN, bagged CNN.
包围学习，卷积神经网络，多模态假新闻，分类，提升CNN，袋装CNN。

I. INTRODUCTION

The concept of fake news is not new. Its roots existed long ago in our society. It refers to false information which can be disseminated to mislead or deceive the Public. For example, fake news aboutCOVID-19 vaccines could discourage people from getting vaccinated, leading to increased rates of illness and death. In the past, every kind of distinct material was considered fake news, like satires, conspiracies, news manipulation, and click-bait. However, fake news is now becoming jargon [1] and has a huge impact on the critical events happening in our society, e.g., spreading fake news (false stories) on social media was very concerning in US presidential election 2016 [2].假新闻的概念并不新鲜。它的根源很久以前就存在于我们的社会。它是指可以传播误导或欺骗公众的虚假信息。例如，关于COVID-19疫苗的假新闻可能会阻碍人们接种疫苗，导致疾病和死亡率上升。在过去，每一种不同的材料都被认为是假新闻，比如讽刺、阴谋、新闻操纵和点击诱饵。然而，假新闻现在正成为行话[1]，并对我们社会中发生的重大事件产生巨大影响，例如，2016年美国总统大选期间，在社交媒体上传播假新闻（虚假故事）备受关注[2]。
Fake news can spread quickly through social media and other online platforms. It can have serious consequences, such as causing panic, influencing elections, and eroding public trust in legitimate news sources. Individuals need to distinguish real news and critically evaluate sources of information before sharing or responding to them. Additionally, news organizations and social media platforms are responsible for combating the spread of fake news by fact-checking and removing false content. The surveys show that about 70% of Americans use social media as a source of news and circulating information [3]. The accessibility of news and information on the Internet is very low-cost and convenient. However, spreading fake news on these carriers is straightforward and effortless [4]. Fake news can lead to false assumptions that drastically affect our society. Consequently, it is critical to design an automated fake news detection system.假新闻可以通过社交媒体和其他网络平台迅速传播。它可能会产生严重的后果，如引起恐慌，影响选举，以及侵蚀公众对合法新闻来源的信任。个人需要区分真实的新闻，并在分享或回应之前批判性地评估信息来源。此外，新闻机构和社交媒体平台有责任通过事实核查和删除虚假内容来打击假新闻的传播。调查显示，大约70%的美国人使用社交媒体作为新闻和传播信息的来源[3]。在互联网上获取新闻和信息的成本很低，也很方便。然而，在这些运营商上传播假新闻是直接和毫不费力的[4]。假新闻可能会导致错误的假设，严重影响我们的社会。因此，设计一个自动化的假新闻检测系统至关重要。
Many researchers are actively developing new and better methods for identifying and combating the spread of misinformation. Some of the key research areas and trends in this field include deep learning approaches, e.g., Convolutional Neural Network (CNN); linguistic features, e.g., sentiment analysis, topic modeling, and stylometric analysis; sourcebased approaches, e.g., analyzing the domain name, social media presence, or history of the news source, and ensemble approaches, e.g., combining linguistic, source-based, and deep learning models to create a more robust and accurate detection system. Although recent research has identified the issues of the said problem and proposed different solutions, e.g., pre-trained language models have shown their effectiveness in alleviating feature engineering efforts, such as Bidirectional Encoder Representations from Transformers (BERT) [5], OpenAI GPT [6], and Elmo [7], however; the problem requires significant performance improvement.许多研究人员正在积极开发新的和更好的方法来识别和打击错误信息的传播。该领域的一些关键研究领域和趋势包括深度学习方法，例如，卷积神经网络（CNN）;语言特征，例如，情感分析、主题建模和文体分析;基于源的方法，例如，分析新闻源的域名、社交媒体存在或历史，以及集成方法，例如，结合语言学、基于源代码和深度学习模型，以创建更强大、更准确的检测系统。尽管最近的研究已经确定了所述问题的问题并提出了不同的解决方案，例如，预训练的语言模型已经显示出它们在减轻特征工程工作方面的有效性，例如来自变压器的双向编码器表示（BERT）[5]，OpenAI GPT [6]和埃尔莫[7];然而，该问题需要显着的性能改进。
From this perspective, this paper proposes an ensemble learning-based detection of multi-modal fake news (ELDFN). It first exploits a publicly available dataset Fakeddit, a novel multi-modal dataset consisting of over 1 million samples from multiple categories of fake news. Second, it leverages Natural Language Processing (NLP) techniques for preprocessing textual information of news. Third, it gauges the sentiment from the text of each news. Fourth, it generates embeddings for text and images of the corresponding news by leveraging V-BERT [8], respectively. Finally, it passes the embeddings to the deep learning ensemble model for training and testing. The 10-fold evaluation technique is used to check the performance of ELD-FN. The evaluation results are significant and outperform the state-of-the-art approaches with the performance improvement of 12.57%, 9.70%, 18.15%, 12.58%, 0.10, and 3.07 in accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), and Odds Ratio (OR), respectively.从这个角度出发，本文提出了一种基于集成学习的多模态假新闻检测（ELDFN）。它首先利用了一个公开的数据集Fakeddit，这是一个新颖的多模态数据集，由来自多个类别的假新闻的100多万个样本组成。其次，它利用自然语言处理（NLP）技术对新闻文本信息进行预处理。第三，它从每条新闻的文本中衡量情绪。第四，它通过利用V-BERT [8]分别为相应新闻的文本和图像生成嵌入。最后，它将嵌入传递给深度学习集成模型进行训练和测试。10倍评估技术被用来检查ELD-FN的性能。评价结果是显著的，并优于最先进的方法的性能提高12.57%，18.15%，12.58%，准确率，召回率，F1分数，马修斯修正相关系数（MCC）和比值比（OR）分别为12.57%，9.70%，0.10和3.07。
The main contributions made in this paper are as follows.本文的主要贡献如下。
• The proposed approach integrates news sentiment as a crucial feature and employs ensemble learning to identify multi-modal fake news.·所提出的方法将新闻情感作为一个重要特征，并采用集成学习来识别多模态假新闻。
• It is evident from the evaluation results that ELD-FN is significant and outperforms the baseline approaches with the performance improvement of 12.57%, 9.70%, 18.15%, 12.58%, 0.10, and 3.07 in accuracy, precision, recall, F1-score, MCC, and OR, respectively.·从评估结果可以看出，ELD-FN是显著的，并且优于基线方法，在准确率、精确率、召回率、F1分数、MCC和OR方面的性能分别提高了12.57%、9.70%、18.15%、12.58%、0.10和3.07。
The organization of the rest of the paper is as follows. Section III describes the details of ELD-FN. Section IV describes the evaluation methods for ELD-FN, obtained results, and their threats to validity. Section II discusses the research background. Section V summarizes the paper and suggests future work.本文其余部分的组织结构如下。第三节描述了ELD-FN的详细信息。第四节描述了ELD-FN的评价方法、获得的结果及其对有效性的威胁。第二部分论述了本文的研究背景。第五节总结了本文件，并提出了今后的工作建议。

II. RELATED WORK

Although extensive research on fake news detection has been performed [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], most research is conducted on textual data or uni-modal features. However, two most relevant researches [24], [25] proposed deep learning-based solutions for detecting fake news. The proposed approach (ELD-FN) differs from baseline approaches as it does not work for the multi-modal features but also considers the sentiments involved in the textual information of news.[10][11][12][13][14][15][16][17][18][19][20][21][22][23]虽然对假新闻检测进行了广泛的研究[9]，[10][11][12][13][14][15][16][17][18][19][20][21][22][23]大多数研究都是在文本数据或单峰特征上进行的。然而，两个最相关的研究[24]，[25]提出了基于深度学习的解决方案来检测假新闻。所提出的方法（ELD-FN）不同于基线方法，因为它不适用于多模态特征，但也考虑了新闻文本信息中涉及的情感。
Most of the state-of-the-art fake news classification approaches can be categorized as follows: 1) fake news classification approaches for single-modality and 2) fake news classification approaches for multi-modality.大多数最先进的假新闻分类方法可以分类如下：1）单模态的假新闻分类方法和2）多模态的假新闻分类方法。

A. FAKE NEWS CLASSIFICATION APPROACHES FOR SINGLE-MODALITY 单模态虚假新闻分类方法

The fake news classification approaches for single-modality can be further divided into two categories based on the text/image features.
单模态的假新闻分类方法可以进一步分为基于文本/图像特征的两类。

1) SINGLE-MODALITY BASED CLASSIFICATION APPROACHES USING TEXTUAL FEATURES 基于单模态的文本特征分类方法

Textual features can be divided into generic and latent categories. Usually, traditional machine learning algorithms utilize Generic textual features. These algorithms analyze text based on linguistic levels such as lexicon, syntax, discourse, and semantics. Previous research has compiled a detailed table summarizing these features [10]. However, Latent textual features consist of the embeddings extracted from textual data of news at the word, sentence, or document level. Latent vectors are constructed from the textual news data. Furthermore, these latent vectors are used as input for classifiers, i.e., SVM.语篇特征可分为语类特征和潜在语类特征。通常，传统的机器学习算法利用通用文本特征。这些算法基于诸如词汇、句法、话语和语义等语言层面来分析文本。之前的研究已经编制了一个详细的表格，总结了这些特征[10]。然而，潜在文本特征是从新闻文本数据中提取的词、句或文档级的嵌入。从文本新闻数据中构造潜在向量。此外，这些潜在向量被用作分类器的输入，即，支持向量机。
Recurrent neural networks (RNNs) are potent in modeling and analyzing sequential data. For example, Ma et al. used RNNs to capture relevant information over time by learning hidden layer representations [11]. Meanwhile, Chen et al. proposed a CNN-based approach for the classification [12]. Moreover, a novel technique Attention-Residual Network (ARC) is introduced to acquire long-range features. Ma et al. introduced a Generative Adversarial Network (GAN)-based model that employs a Generator network based on Gated Recurrent Units (GRU) to generate contentious instances. Furthermore, a Discriminat

最低0.47元/天解锁文章