《Multi-Modal Features Representation-Based Convolutional Neural Network Model for Malicious Website》-优快云博客

本文链接：https://blog.youkuaiyun.com/buyaotutou/article/details/140726419

Multi-Modal系列论文研读目录

文章目录

Multi-Modal系列论文研读目录
1.论文题目含义
2.ABSTRACT
3.INDEX TERMS
4.INTRODUCTION
5.RELATED WORK
6.THE PROPOSED HF-CNN MODEL 提出的HF-CNN模型
7.PERFORMANCE EVALUATION 绩效评价
8.RESULTS AND DISCUSSION
9.CONCLUSION AND FUTURE WORKS
10.ACKNOWLEDGMENT

1.论文题目含义

基于多模态特征表示的卷积神经网络恶意网站检测模型

2.ABSTRACT

Web applications have proliferated across various business sectors, serving as essential tools for billions of users in their daily lives activities. However, many of these applications are malicious which is a major threat to Internet users as they can steal sensitive information, install malware, and propagate spam. Detecting malicious websites by analyzing web content is ineffective due to the complexity of extraction of the representative features, the huge data volume, the evolving nature of the malicious patterns, the stealthy nature of the attacks, and the limitations of traditional classifiers. Uniform Resource Locators (URL) features are static and can often provide immediate insights about the website without the need to load its content. However, existing solutions for detecting malicious web applications through web content analysis often struggle due to complex feature extraction, massive data volumes, evolving attack patterns, and limitations of traditional classifiers. Leveraging solely lexical URL features proves insufficient, potentially leading to inaccurate classifications. This study proposes a multimodal representation approach that fuses textual and image-based features to enhance the performance of the malicious website detection. Textual features facilitate the deep learning model’s ability to understand and represent detailed semantic information related to attack patterns, while image features are effective in recognizing more general malicious patterns. In doing so, patterns that are hidden in textual format may be recognizable in image format. Two Convolutional Neural Network (CNN) models were constructed to extract the hidden features from both textual and imagerepresented features. The output layers of both models were combined and used as input for an artificial neural network classifier for decision-making. Results show the effectiveness of the proposed model when compared to other models. The overall performance in terms of Matthews Correlation Coefficient (MCC) was improved by 4.3% while the false positive rate was reduced by 1.5%.Web应用程序在各个业务部门中激增，成为数十亿用户日常生活活动中的重要工具。然而，这些应用程序中有许多是恶意的，这是对互联网用户的主要威胁，因为它们可以窃取敏感信息，安装恶意软件和传播垃圾邮件。通过分析网页内容来检测恶意网站是无效的，由于提取的代表性特征的复杂性，巨大的数据量，恶意模式的演变性质，攻击的隐蔽性，以及传统分类器的局限性。统一资源定位器（URL）功能是静态的，通常可以提供有关网站的即时见解，而无需加载其内容。然而，现有的通过Web内容分析来检测恶意Web应用程序的解决方案往往由于复杂的特征提取、海量数据、不断演变的攻击模式以及传统分类器的局限性而难以实现。事实证明，仅仅利用词汇URL特征是不够的，可能会导致不准确的分类。本研究提出了一种融合文本和基于图像的特征的多模态表示方法，以增强恶意网站检测的性能。文本特征有助于深度学习模型理解和表示与攻击模式相关的详细语义信息，而图像特征可有效识别更一般的恶意模式。在这样做时，以文本格式隐藏的模式可以以图像格式识别。构造了两个卷积神经网络（CNN）模型，分别从文本和图像表示的特征中提取隐藏特征。两个模型的输出层相结合，并作为输入的人工神经网络分类器的决策。结果表明，所提出的模型相比，其他模型的有效性。在马修斯相关系数（MCC）方面的整体性能提高了4.3%，而假阳性率降低了1.5%。

3.INDEX TERMS

Convolutional neural network, malicious URL detection, malicious website detection, multi-modal features representation, URL image representation.
卷积神经网络，恶意网址检测，恶意网站检测，多模态特征表示，网址图像表示。

4.INTRODUCTION

According to the Siteefy website [1], there are over 1.11 billion websites in theWorld, and this number has been growing exponentially in recent years. Every day, T 252 thousand new websites are created (REF Please). As of May 9, 2023, it is estimated that the number of web pages is more than 50 billion pages. Although most of the websites are created for good purposes, many of these websites are malicious websites [2]. Malicious websites are designed to harm users in some way, such as by stealing their personal information or installing malware on their computers. They can be used to spread malware, phishing, spread spam, or conduct denial of service attacks [3]. According to Google’s in-depth research, there are an estimated 12.8 million malicious websites on the internet [4]. Furthermore, as stated by authors in [5], there are 18.5 million websites hosting malicious code. This number is constantly changing, as new malicious websites are created and old ones are taken down.根据Siteefy网站[1]的数据，世界上有超过11.1亿个网站，而且这个数字近年来呈指数级增长。每一天，T 252千新网站创建（REF请）.截至2023年5月9日，预计网页数量超过500亿页。虽然大多数网站都是出于良好的目的而创建的，但其中许多网站都是恶意网站[2]。恶意网站旨在以某种方式伤害用户，例如窃取他们的个人信息或在他们的计算机上安装恶意软件。它们可用于传播恶意软件、网络钓鱼、传播垃圾邮件或进行拒绝服务攻击[3]。根据Google的深入研究，互联网上估计有1280万个恶意网站[4]。此外，正如作者在[5]中所述，有1850万个网站托管恶意代码。这个数字在不断变化，因为新的恶意网站被创建，旧的网站被删除。
Malicious website detection has been the subject of much research and many solutions were suggested [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23]. The blacklist is the most common solution used by many organizations [24]. However, it is slow to update, as malicious actors can easily bypass blacklists by creating new websites or simply changing the URLs of their websites. This makes it difficult for blacklist-based systems to keep up with the ever-changing landscape of malicious websites [25], [26].恶意网站检测已经成为许多研究的主题，并提出了许多解决方案[6]、[7]、[8]、[9]、[10]、[11]、[12]、[13]、[14]、[15]、[16]、[17]、[18]、[19]、[20]、[21]、[22]、[23]。黑名单是许多组织最常用的解决方案[24]。但是，它的更新速度很慢，因为恶意行为者可以通过创建新网站或简单地更改其网站的URL来轻松绕过黑名单。这使得基于黑名单的系统很难跟上不断变化的恶意网站[25]，[26]。
To address the limitations of blacklisting, many researchers have employed machine learning techniques to detect malicious websites. These techniques extract features from web content [27], [28], [29], scripts [15], [16], HTTP/s response [29], [30], URLs [6], [7], [8], [9], [10], [11], [12], [13], [14], [31], [32], [33], domain names [25], [34], [35], network traffic data [34], [36], and digital certificates [26]. Many machine learning algorithms were used such as support vector machines, decision trees, logistic regression, and random forests to classify websites as malicious or benign [28], [32]. The effectiveness of machine learning methods depends on the choice of features [13], [14], [17], [18], [19], [20], [21], [22], [23]. However, extracting effective features is challenging due to the constant changing of malicious code, the use of obfuscation techniques by attackers, the huge volume of data that needs to be analyzed, and the complexity of the attack today. Unfortunately, traditional machine learning is ineffective in extracting useful patterns for classification from huge and complex datasets. However, effective feature engineering is required to improve detection performance.为了解决黑名单的局限性，许多研究人员采用机器学习技术来检测恶意网站。[27][28][29][2 [35]、网络流量数据[34]、[36]和数字证书[26]。使用了许多机器学习算法，如支持向量机，决策树，逻辑回归和随机森林来将网站分类为恶意或良性[28]，[32]。[13][14][17][18][19][20][21][22][23]]机器学习方法的有效性取决于特征的选择。然而，由于恶意代码的不断变化，攻击者使用混淆技术，需要分析的大量数据以及当今攻击的复杂性，提取有效特征具有挑战性。不幸的是，传统的机器学习在从庞大而复杂的数据集中提取有用的分类模式方面是无效的。然而，需要有效的特征工程来提高检测性能。
Deep learning models are effective in extracting representative features from huge and complex datasets. They can automatically extract effective features without the need for incentive manual feature engineering, as it can automatically learn features from webpage text data. Convolutional Neural Networks (CNN) [22], Recurrent Neural Networks (RNN) [23], and attention mechanisms were commonly reported methods for malicious malware detection. Many deep learning models are constructed based on features extracted from the website’s content. However, acquiring large and diverse datasets from website content for training deep learning models is challenging due to the dynamicity of the web content, the use of anti-scraping mechanisms to detect and block automated scrapers, and the evolving nature of online threats. Some websites require user sessions and authentication to access content. Scraping such websites may involve simulating user interactions, including logging in. Websites frequently change their structure and layout, necessitating ongoing maintenance and updates to scraping scripts to ensure they continue to work correctly. Moreover, extracting webpage representative features from the web content may be inefficient for limited resources devices such as IoT devices. Although content-based features can be used for detecting many types of threats, relying on web content features is neither effective nor efficient for detecting advanced malicious websites.深度学习模型可以有效地从庞大而复杂的数据集中提取代表性特征。它们可以自动提取有效的特征，而不需要激励手动特征工程，因为它可以自动从网页文本数据中学习特征。卷积神经网络（CNN）[22]，递归神经网络（RNN）[23]和注意力机制是恶意软件检测的常用方法。许多深度学习模型都是基于从网站内容中提取的特征构建的。然而，由于Web内容的动态性，使用反抓取机制来检测和阻止自动抓取器，以及在线威胁的不断演变，从网站内容中获取大量不同的数据集来训练深度学习模型具有挑战性。某些网站需要用户会话和身份验证才能访问内容。抓取此类网站可能涉及模拟用户交互，包括登录。网站经常更改其结构和布局，需要持续维护和更新以抓取脚本，以确保它们继续正确工作。此外，从web内容提取网页代表性特征对于诸如IoT设备之类的有限资源设备而言可能是低效的。虽然基于内容的特征可以用于检测许多类型的威胁，但依赖于Web内容特征对于检测高级恶意网站既不有效也不高效。
The URL-based features seem to be a good alternative to the web content features. Many researchers compare the performance of the models constructed using both features and, on all occasions, URL-based features always win. However, most of the existing studies rely solely on the lexical features extracted from URLs. Lexical features have limited semantics information which causes the construction of sparse feature vectors. Some studies combine URL features with digital certificates to improve the detection performance. Malicious websites often lack valid certificates or use self-signed certificates, making certificate analysis a useful indicator of trustworthiness. Analyzing digital certificates can reveal whether a website is employing encryption, which is a common practice among reputable sites. However, not all websites use digital certificates, and some may employ self-signed certificates or certificates issued by less reputable Certificate Authorities (CAs). Extracting relevant and meaningful features from certificates for machine learning models can be complex, and the selection of the right features is crucial for effective detection. In addition, digital certificates can be misconfigured, expired, and frequently change leading to high false alarms. To sum up, existing solutions for detecting malicious web applications through web content analysis often struggle due to complex feature extraction, massive data volumes, evolving attack patterns, and limitations of traditional classifiers. Relying solely on lexical URL features proves insufficient, potentially leading to inaccurate classifications.基于URL的功能似乎是Web内容功能的一个很好的替代品。许多研究人员比较了使用这两种特征构建的模型的性能，在所有情况下，基于URL的特征总是获胜。然而，现有的研究大多依赖于从URL中提取的词汇特征。词汇特征的语义信息有限，这就需要构造稀疏的特征向量。一些研究将联合收割机URL特征与数字证书相结合以提高检测性能。恶意网站通常缺少有效的证书或使用自签名证书，这使得证书分析成为可信度的有用指标。分析数字证书可以揭示网站是否使用加密，这是信誉良好的网站的常见做法。但是，并非所有网站都使用数字证书，有些网站可能使用自签名证书或由信誉较差的证书颁发机构（CA）颁发的证书。从机器学习模型的证书中提取相关和有意义的特征可能很复杂，选择正确的特征对于有效检测至关重要。此外，数字证书可能被错误配置、过期和频繁更改，从而导致高误报率。总而言之，通过web内容分析来检测恶意web应用的现有解决方案通常由于复杂的特征提取、海量数据量、不断演变的攻击模式以及传统分类器的局限性而难以实现。仅仅依靠词汇URL特征是不够的，可能会导致不准确的分类。
To address these challenges, this study proposes a novel multimodal representation approach that integrates textual and image-based features to enhance malicious website detection. This approach leverages the strengths of both modalities: textual features capture detailed semantic information related to attack patterns, and image features recognize broader malicious visual cues. Hidden patterns within textual content may become discernible through image analysis.为了解决这些挑战，本研究提出了一种新的多模态表示方法，集成了基于文本和图像的功能，以提高恶意网站的检测。这种方法利用了两种方式的优势：文本特征捕获与攻击模式相关的详细语义信息，图像特征识别更广泛的恶意视觉线索。文本内容中的隐藏模式可以通过图像分析变得可辨别。
The proposed approach employs two Convolutional Neural Networks (CNNs): one for textual features and another for image features. Their outputs are then combined and fed into an artificial neural network classifier for improved decision-making. Our results demonstrate the superiority of the proposed model compared to existing approaches. We achieve a 4.3% increase in Matthews Correlation Coefficient (MCC) and a 1.5% reduction in the false-positive rate, showcasing the effectiveness of our multimodal approach in accurately identifying malicious web applications.该方法采用了两个卷积神经网络（CNNs）：一个用于文本特征，另一个用于图像特征。然后将它们的输出组合并馈送到人工神经网络分类器中以改进决策。实验结果证明，相较于已有方法，该算法是有效和优越的。我们实现了4.3%的马修斯相关系数（MCC）的提高和1.5%的误报率的降低，展示了我们的多模态方法在准确识别恶意Web应用程序方面的有效性。
This study made the following contributions:这项研究作出了以下贡献：
（1）Integrating DNS-de