初识机器学习之DGA域名检测 - LSTM 决策树等

最新推荐文章于 2025-06-20 17:11:39 发布

fured

最新推荐文章于 2025-06-20 17:11:39 发布

阅读量1k

点赞数 1

CC 4.0 BY-SA版权

文章标签：机器学习 lstm 决策树 python 支持向量机随机森林

本文链接：https://blog.youkuaiyun.com/fured/article/details/132314208

本文介绍了DGA域名检测的背景，包括其使用场景和解决方案。重点探讨了使用决策树、随机森林及LSTM神经网络模型进行二分类检测的方法。通过对域名特征分析，如长度、唯一字符比例等，训练模型进行黑、白样本区分。实验结果显示，LSTM模型在不需要手动特征工程的情况下表现出较好的性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 背景

DGA是什么：Domain Generation Algorithm：域名生成算法

使用场景：

攻击者攻陷服务器后，需要和服务器（C&C）通信来控制服务器并下发命令，从而执行任务。通信通过域名进行，如果攻击者使用的是一个不变的域名来通信，运维人员很容易将这些域名加入黑名单进行拦截。为了绕过黑名单，攻击者在被控机和下发命令的设备上，使用同一个DGA算法生成随机域名来进行通信。导致无法拦截，简单的DGA算法生成出的域名是可以预测的，但是随着DGA算法的复杂度，生成域名预测难度越来越大。所以需要一个检测手段，检测哪些域名是DGA算法生成的，从而进行拦截。

解决方案：

（1）基于行为的检测：基于域名通信行为检测，如数据包内容就是很常见的黑客命令，那这个域名就是DGA生成的 -- 不是本次讨论范围

（2）基于域名文本来检测：基于域名字符串来检测，如：“q9b281081e53aa9a21120b41ca9cf5c31d.hk” 这个域名，我们一看就是DGA生成的域名 “baidu.com” 一看就是一个真实的域名

2 实现

样本：

95w 黑样本即95w个DGA域名

100w 白样本即100w个正常域名

2.1 决策树随机森林等机器学习模型

开始使用决策时，就是因为简单，好理解

本质：训练这些模型进行二分类，模型学习样本数据（特征）和样本所属类之间的关系，进而拥有后续进行二分类判断的能力。

2.1.1 特征分析

（1）域名长度

（2）唯一字符个数

（3）唯一字符比例

（4）字符串随机性即离散性 -- 字符串熵值：正常域名随机性低

（5）元音字母比例：正常的域名读起来都是很通顺的，元音字母占比大

import math

from collections import Counter


class Feature(object):
    # 元音
    vowels = "aeiou"

    def __init__(self, domain: str):
        self._domain = self.clean_domain(domain)
    
    @staticmethod
    def clean_domain(domain):
        """测试样本中存在 * : / 这样的数据，需要清理
            js5865.2011youxi.com:8080
            *.gmtel.net
            www.jibai.com/index
        """
        domain = domain.strip().lower()
        if ":" in domain:
            domain = domain.split(":")[0]
        if "/" in domain:
            domain = domain.split("/")[0]
        if "*" in domain:
            domain = domain.split("*")[1]
        return domain
    
    def get_features(self) -> list:
        feature_funcs = (
            self.domain_length, self.unique_char_ratio, 
            self.vowels_ratio, self.entropy, self.unique_char_count,
        )
        rslt = []
        for func in feature_funcs:
            rslt.append(func())
        return rslt

    def domain_length(self):
        """域名长度"""
        return len(self._domain)
    
    def unique_char_count(self):
        """唯一字符个数"""
        unique_char_set = {i for i in self._domain}
        return len(unique_char_set)

    def unique_char_ratio(self):
        """唯一字符比例"""
        return len({i for i in self._domain}) / len(self._domain)
    
    def entropy(self):
        """字符串的熵值：字符串混乱即随机程度"""
        freqs = Counter(self._domain)
        # 计算每个字符出现的概率
        probs = [f / len(self._domain) for f in freqs.values()]
        # 计算熵值
        return -sum(p * math.log2(p) for p in probs)

    def vowels_ratio(self):
        """元音字母比例"""
        domain = self._domain.strip().split(".")[0].lower()
        count_word = 0
        count_yuan = 0
        yuan_ratio = 0
        for i in domain:
            if ord(i) >= ord('a') and ord(i) <= ord('z'):
                count_word = count_word + 1
            if i in self.vowels:
                count_yuan = count_yuan + 1
        if count_word == 0:
            return yuan_ratio
        else:
            yuan_ratio = count_yuan / count_word
            return yuan_ratio

2.1.2 训练模型

2.1.2.1 代码

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import csv
import math
import random
import typing as t

import joblib
import numpy as np

from collections import Counter

from keras.models import Sequential
from keras.preprocessing import sequence
from keras.layers import Dense, Activation, Embedding, Dropout, LSTM
from sklearn import metrics
from sklearn import tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing


class DGADetectByML(object):
    def train(self, x, y):
        self._model.fit(x, y)
	
    def predict(self, *domains):
        x = []
        for domain in domains:
            x.append(Feature(domain).get_features())
        return self._model.predict(x)
    
    @staticmethod
    def get_train_test_data(test_size=50000):
        """Returns: x, x_test, y, y_test"""
        black_domains = ["xxxxxx.com", "dddddddd.org"]
        white_domains = ["baidu.com", "google.com"]
        tmp  = []
        for domain in black_domains:
            features = FeatureTool(domain).get_features()
            features.append(1)  # 黑样本 类型： 1
            tmp.append(features)

        for domain in white_dataset:
            features = FeatureTool(domain).get_features()
            features.append(0)  # 白样本 类型：0
            tmp.append(features)

        random.shuffle(tmp)  # 打乱样本
        x = []
        y = []
        for i in tmp:
            x.append(i[:-1])
            y.append(i[-1])
        x = preprocessing.scale(x)  # 预处理，序列化
        return train_test_split(x, y, test_size=test_size)

    def load(self, file_path):
        self._model = joblib.load(file_path)

    def dump(self, file_path):
        joblib.dump(self._model, file_path)


class DGADetectByDT(DGADetectByML):
    """决策树"""
    def __init__(self):
        self._model = tree.DecisionTreeClassifier()
    

class DGADetectBySVC(DGADetectByML):
    """向量机"""
    def __init__(self):
        self._model = SVC(kernel='linear')


class DGADetectByRF(DGADetectByML):
    """随机森林"""
    def __init__(self):
        self._model = RandomForestClassifier()


if __name__ == "__main__":
    clf = DGADetectByDT()
    x, x_test, y, y_test = clf.get_train_test_data()
    # 训练
    clf.train(x, y)
    # 预测
    clf.predict("cccccccccccccc.com")

2.1.2.2 结果

score = precision * recall * 2 / (precision + recall)

使用特征	模型	得分
域名长度、字符随机熵值、唯一字符个数	决策树 DGADetectByDT	73.3321
域名长度、字符随机熵值、唯一字符个数、唯一字符比例	决策树 DGADetectByDT	73.3346
域名长度、字符随机熵值、唯一字符比例、元音字母比例	决策树 DGADetectByDT	76.4449
域名长度、字符随机熵值、唯一字符比例、元音字母比例	向量机 DGADetectBySVC	78.0740	训练时间很长
域名长度、字符随机熵值、唯一字符比例、元音字母比例	随机森林 DGADetectByRF	76.7998
域名长度、字符随机熵值、唯一字符比例、元音字母比例	朴素贝叶斯上面的代码里面没有	76

可以看出来决策树最终的效果，对特征的依赖很大，特征质量越高效果越好
使用相同特征，不同的模型，最终的效果差别不大，所以这些模型都依赖特征

DGA域名，就是一个字符串，已经很难再挖掘到有价值特征了（还试了试顶级域名，发现白样本和黑样本中，各种顶级域名的占比基本是相同的，所以不能作为特征），为了提高模型效果，下面尝试神经网络模型

2.2 LSTM 神经网络模型

神经网络好的一点是，不需要自己挖掘特征，整个模型自己就可以进行特征分析，学习等

2.2.1 代码

模型不能直接输入字符串，所以将字符串转换为int序列，每个字符给一个特点的int数字（DGADetectByLSTM.VALID_CHARS）

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import math
import random

import joblib
import numpy as np

from keras.models import Sequential
from keras.preprocessing import sequence
from keras.layers import Dense, Activation, Embedding, Dropout, LSTM
from sklearn.model_selection import train_test_split


class DGADetectByLSTM(object):
    VALID_CHARS = {
        'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 
        'i': 9, 'j': 10, 'k': 11, 'l': 12, 'n': 13, 'm': 14, 'o': 15, 
        'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 
        'w': 23, 'x': 24, 'y': 25, 'z': 26, '0': 27, '1': 28, '2': 29, 
        '3': 30, '4': 31, '5': 32, '6': 33, '7': 34, '8': 35, '9': 36, 
        '-': 37, '_': 38, '.': 39,
    }  # 样本中域名中可能出现的字符，域名不区分大小写 所以统一用小写
    MAX_DOMAIN_LEN = 74  # 样本中最长的域名 长度为74

    def __init__(self):
        model = Sequential()
        # embedding层将编码后的样本向量转换为固定大小，如[[4],[20]]->[[0.25,0.1],[0.6,-0.2]]
        model.add(Embedding(len(self.VALID_CHARS) + 1, 128, input_length=self.MAX_DOMAIN_LEN)) 
        # lstm层是训练模型的核心，将从样本中学习特征
        model.add(LSTM(128)) 
        # dropout层是为了训练的神经网络过拟合，随机断开一定比例的神经元连接
        model.add(Dropout(0.5)) 
        # dense层是为了将学习到的特征映射到样本空间
        model.add(Dense(1)) 
        # activation激活层将权值转换成二分类结果
        model.add(Activation('sigmoid'))

        model.compile(loss='binary_crossentropy',
                    optimizer='rmsprop')

        self._model = model

    
    @classmethod
    def get_train_test_data(cls, test_size=30000):
        """x, x_test, y, y_test"""
        black_dataset = [["xxxxxx.com", 1], ["ccccccc.com", 1]]
        white_dataset = [["baidu.com", 0], ["google.com", 0]]
        tmp  = []
        for domain, label in black_dataset:
            tmp.append(
                [Feature.clean_domain(domain), int(label)],
            )

        for domain, label in white_dataset:
            tmp.append(
                [Feature.clean_domain(domain), int(label)],
            )

        random.shuffle(tmp)


        x = [[cls.VALID_CHARS[j] for j in i[0]] for i in tmp]
        x = sequence.pad_sequences(x, maxlen=cls.MAX_DOMAIN_LEN)
        y = [i[1] for i in tmp]
        y = np.array(y)
        return train_test_split(x, y, test_size=test_size)

    def train(self, x, y, batch_size=128):
        # 需要进行多次训练 已达到最佳效果 过多或过少的训练次数 都会导致模型效果差
        self._model.fit(x, y, batch_size=batch_size)
    
    def predict(self, x_test):
        """
        Args:
            x_test: 待预测数据
        Returns:
            y_pred: 预测结果即每个数据成为正类的概率，一般大于0.5 认为就是正类
        """
        y_pred = self._model.predict(x_test)
        # return np.round(y_pred).astype(int)  # prob > 0.5 认为就是1
        return y_pred
	
	def load(self, file_path):
        self._model = joblib.load(file_path)

    def dump(self, file_path):
        joblib.dump(self._model, file_path)


if __name == "__main__":
    clf = DGADetectByLSTM()
    x, x_test, y, y_test = clf.get_train_test_data()
    # 第一次训练
    clf.train(x, y)
    # 第二次训练
    clf.train(x, y)

2.2.2 结果

score = precision * recall * 2 / (precision + recall)

这个模型输出的是每个输入数据是黑样本（即分类为1 ）的概率，所以我们计算得分时任务概率大于多少的是黑样本算出来的得分是不同的， prob > 0.5 表示概率大于0.5的认为是黑样本

roc_auc_socre 计算方式：sklearn.metrics.roc_auc_score(y_test,y_pred)

loss: 是指模型预测结果与实际结果之间的差异或误差。也就是拟合程度

epoch	score	roc_auc_score	loss
第一次训练后	prob > 0.5：84.5244	0.993934938207379	0.15
第二次训练后	prob > 0.5: 87.2339	0.9958744855838554	0.09
第三次训练后	prob > 0.5: 89.2756 prob > 0.6: 88.7147 prob > 0.4: 89.8823 prob > 0.35: 90.1995	0.9967410750580266	0.07
第四次训练后	prob > 0.5: 89.3029	0.9969791104698198	0.0664
第五次训练后		0.9973556760715847	0.0610
第六次训练后		0.9975680149333765	0.0574
第七次训练后	prob > 0.5: 89.6376 prob > 0.35: 90.4406	0.9977849529441264	0.0574
第八次训练后	prob > 0.35: 90.6403	0.9976811419941605	0.0526
第九次训练后	prob > 0.35: 89.9612	0.997626276747869	0.0509
第十次训练后		0.9976578795031775	0.0497