16、自动入侵检测技术全解析

perl8

于 2025-07-30 13:18:09 发布

阅读量52

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习赋能网络安全实战文章标签：网络安全入侵检测内部威胁

本文链接：https://blog.youkuaiyun.com/perl8/article/details/149877673

机器学习赋能网络安全实战专栏收录该内容

20 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

自动入侵检测技术全解析

在当今数字化的时代，网络安全问题日益严峻，自动入侵检测技术成为了保障系统安全的重要手段。本文将详细介绍几种常见的入侵检测场景及相应的处理方法，包括内部威胁检测、DDoS 攻击检测以及信用卡欺诈检测。

1. 内部威胁检测

内部威胁可能来自组织内部的员工，他们可能会在非工作时间登录系统、向外部发送邮件或者复制敏感文件等。为了检测这些威胁，我们可以通过特征工程和异常检测的方法来实现。

1.1 准备工作

安装 pandas 库：

pip install pandas

从以下链接下载 CERT 内部威胁数据集：

ftp://ftp.sei.cmu.edu/pub/cert-data/r4.2.tar.bz2

更多关于数据集的信息可以访问：

https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099

1.2 特征工程步骤

导入必要的库并指定数据集路径 ：

import numpy as np
import pandas as pd
path_to_dataset = "./r42short/"

指定要读取的 .csv 文件及其列 ：

log_types = ["device", "email", "file", "logon", "http"]
log_fields_list = [
    ["date", "user", "activity"],
    ["date", "user", "to", "cc", "bcc"],
    ["date", "user", "filename"],
    ["date", "user", "activity"],
    ["date", "user", "url"],
]

创建特征编码字典 ：

features = 0
feature_map = {}
def add_feature(name):
    """Add a feature to a dictionary to be encoded."""
    if name not in feature_map:
        global features
        feature_map[name] = features
        features += 1

添加要使用的特征 ：

add_feature("Weekday_Logon_Normal")
add_feature("Weekday_Logon_After")
add_feature("Weekend_Logon")
add_feature("Logoff")
add_feature("Connect_Normal")
add_feature("Connect_After")
add_feature("Connect_Weekend")
add_feature("Disconnect")
add_feature("Email_In")
add_feature("Email_Out")
add_feature("File_exe")
add_feature("File_jpg")
add_feature("File_zip")
add_feature("File_txt")
add_feature("File_doc")
add_feature("File_pdf")
add_feature("File_other")
add_feature("url")

定义文件特征函数 ：

def file_features(row):
    """Creates a feature recording the file extension of the file used."""
    if row["filename"].endswith(".exe"):
        return feature_map["File_exe"]
    if row["filename"].endswith(".jpg"):
        return feature_map["File_jpg"]
    if row["filename"].endswith(".zip"):
        return feature_map["File_zip"]
    if row["filename"].endswith(".txt"):
        return feature_map["File_txt"]
    if row["filename"].endswith(".doc"):
        return feature_map["File_doc"]
    if row["filename"].endswith(".pdf"):
        return feature_map["File_pdf"]
    else:
        return feature_map["File_other"]

定义邮件特征函数 ：

def email_features(row):
    """Creates a feature recording whether an email has been sent externally."""
    outsider = False
    if not pd.isnull(row["to"]):
        for address in row["to"].split(";"):
            if not address.endswith("dtaa.com"):
                outsider = True
    if not pd.isnull(row["cc"]):
        for address in row["cc"].split(";"):
            if not address.endswith("dtaa.com"):
                outsider = True
    if not pd.isnull(row["bcc"]):
        for address in row["bcc"].split(";"):
            if not address.endswith("dtaa.com"):
                outsider = True
    if outsider:
        return feature_map["Email_Out"]
    else:
        return feature_map["Email_In"]

定义设备特征函数 ：

def device_features(row):
    """Creates a feature for whether the user has connected during normal hours or otherwise."""
    if row["activity"] == "Connect":
        if row["date"].weekday() < 5:
            if row["date"].hour >= 8 and row["date"].hour < 17:
                return feature_map["Connect_Normal"]
            else:
                return feature_map["Connect_After"]
        else:
            return feature_map["Connect_Weekend"]
    else:
        return feature_map["Disconnect"]

定义登录特征函数 ：

def logon_features(row):
    """Creates a feature for whether the user logged in during normal hours or otherwise."""
    if row["activity"] == "Logon":
        if row["date"].weekday() < 5:
            if row["date"].hour >= 8 and row["date"].hour < 17:
                return feature_map["Weekday_Logon_Normal"]
            else:
                return feature_map["Weekday_Logon_After"]
        else:
            return feature_map["Weekend_Logon"]
    else:
        return feature_map["Logoff"]

定义 HTTP 特征函数 ：

def http_features(row):
    """Encodes the URL visited."""
    return feature_map["url"]

定义日期转换函数 ：

def date_to_day(row):
    """Converts a full datetime to date only."""
    day_only = row["date"].date()
    return day_only

读取数据并处理 ：

log_feature_functions = [
    device_features,
    email_features,
    file_features,
    logon_features,
    http_features,
]
dfs = []
for i in range(len(log_types)):
    log_type = log_types[i]
    log_fields = log_fields_list[i]
    log_feature_function = log_feature_functions[i]
    df = pd.read_csv(
        path_to_dataset + log_type + ".csv", usecols=log_fields, index_col=None
    )
    date_format = "%m/%d/%Y %H:%M:%S"
    df["date"] = pd.to_datetime(df["date"], format=date_format)
    new_feature = df.apply(log_feature_function, axis=1)
    df["feature"] = new_feature
    cols_to_keep = ["date", "user", "feature"]
    df = df[cols_to_keep]
    df["date"] = df.apply(date_to_day, axis=1)
    dfs.append(df)
joint = pd.concat(dfs)
joint = joint.sort_values(by="date")

1.3 异常检测步骤

准备工作 ：

pip install sklearn pandas matplotlib

具体步骤 ：
1. 列出威胁行为者 ：

threat_actors = [
    "AAM0658",
    "AJR0932",
    "BDV0168",
    # ...
    "MSO0222",
]

日期索引 ：

start_date = joint["date"].iloc[0]
end_date = joint["date"].iloc[-1]
time_horizon = (end_date - start_date).days + 1
def date_to_index(date):
    """Indexes dates by counting the number of days since the starting date of the dataset."""
    return (date - start_date).days

定义提取用户时间序列的函数 ：

def extract_time_series_by_user(user_name, df):
    """Filters the dataframe down to a specific user."""
    return df[df["user"] == user_name]

定义向量化用户时间序列的函数 ：

def vectorize_user_time_series(user_name, df):
    """Convert the sequence of features of a user to a vector-valued time series."""
    user_time_series = extract_time_series_by_user(user_name, df)
    x = np.zeros((len(feature_map), time_horizon))
    event_date_indices = user_time_series["date"].apply(date_to_index).to_numpy()
    event_features = user_time_series["feature"].to_numpy()
    for i in range(len(event_date_indices)):
        x[event_features[i], event_date_indices[i]] += 1
    return x

定义向量化整个数据集的函数 ：

def vectorize_dataset(df):
    """Takes the dataset and featurizes it."""
    users = set(df["user"].values)
    X = np.zeros((len(users), len(feature_map), time_horizon))
    y = np.zeros((len(users)))
    for index, user in enumerate(users):
        x = vectorize_user_time_series(user, df)
        X[index, :, :] = x
        y[index] = int(user in threat_actors)
    return X, y

向量化数据集 ：

X, y = vectorize_dataset(joint)

划分训练集和测试集 ：

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

重塑数据 ：

X_train_reshaped = X_train.reshape(
    [X_train.shape[0], X_train.shape[1] * X_train.shape[2]]
)
X_test_reshaped = X_test.reshape([X_test.shape[0], X_test.shape[1] * X_test.shape[2]])

划分正常和威胁子集 ：

X_train_normal = X_train_reshaped[y_train == 0, :]
X_train_threat = X_train_reshaped[y_train == 1, :]
X_test_normal = X_test_reshaped[y_test == 0, :]
X_test_threat = X_test_reshaped[y_test == 1, :]

定义并实例化隔离森林分类器 ：

from sklearn.ensemble import IsolationForest
contamination_parameter = 0.035
IF = IsolationForest(
    n_estimators=100, max_samples=256, contamination=contamination_parameter
)

拟合模型 ：

IF.fit(X_train_reshaped)

绘制正常子集的决策分数 ：

normal_scores = IF.decision_function(X_train_normal)
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8, 4), dpi=600, facecolor="w", edgecolor="k")
normal = plt.hist(normal_scores, 50, density=True)
plt.xlim((-0.2, 0.2))
plt.xlabel("Anomaly score")
plt.ylabel("Percentage")
plt.title("Distribution of anomaly score for non threats")

绘制威胁子集的决策分数 ：

anomaly_scores = IF.decision_function(X_train_threat)
fig = plt.figure(figsize=(8, 4), dpi=600, facecolor="w", edgecolor="k")
anomaly = plt.hist(anomaly_scores, 50, density=True)
plt.xlim((-0.2, 0.2))
plt.xlabel("Anomaly score")
plt.ylabel("Percentage")
plt.title("Distribution of anomaly score for threats")

选择截断分数 ：

cutoff = 0.12

观察训练集的截断结果 ：

from collections import Counter
s = IF.decision_function(X_train_reshaped)
print(Counter(y_train[cutoff > s]))

输出：

Counter({0.0: 155, 1.0: 23})

观察测试集的截断结果 ：

s = IF.decision_function(X_test_reshaped)
print(Counter(y_test[cutoff > s]))

输出：

Counter({0.0: 46, 1.0: 8})

2. DDoS 攻击检测

DDoS 攻击是指来自不同源的流量淹没受害者，导致服务中断。常见的 DDoS 攻击类型包括应用层攻击、协议攻击和流量攻击。为了检测 DDoS 攻击，我们可以使用随机森林分类器。

2.1 准备工作

安装必要的库：

pip install sklearn pandas

解压 ddos_dataset.7z 压缩包。

2.2 具体步骤

导入必要的库并指定特征和数据类型 ：

import pandas as pd
features = [
    "Fwd Seg Size Min",
    "Init Bwd Win Byts",
    "Init Fwd Win Byts",
    "Fwd Seg Size Min",
    "Fwd Pkt Len Mean",
    "Fwd Seg Size Avg",
    "Label",
    "Timestamp",
]
dtypes = {
    "Fwd Pkt Len Mean": "float",
    "Fwd Seg Size Avg": "float",
    "Init Fwd Win Byts": "int",
    "Init Bwd Win Byts": "int",
    "Fwd Seg Size Min": "int",
    "Label": "str",
}
date_columns = ["Timestamp"]

读取数据 ：

df = pd.read_csv("ddos_dataset.csv", usecols=features, dtype=dtypes, parse_dates=date_columns, index_col=None)

按日期排序数据 ：

df2 = df.sort_values("Timestamp")

删除日期列 ：

df3 = df2.drop(columns=["Timestamp"])

划分训练集和测试集 ：

l = len(df3.index)
train_df = df3.head(int(l * 0.8))
test_df = df3.tail(int(l * 0.2))

准备标签 ：

y_train = train_df.pop("Label").values
y_test = test_df.pop("Label").values

准备特征向量 ：

X_train = train_df.values
X_test = test_df.values

导入并实例化随机森林分类器 ：

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50)

拟合模型并评估 ：

clf.fit(X_train, y_train)
clf.score(X_test, y_test)

输出：

0.83262

3. 信用卡欺诈检测

信用卡公司需要监控欺诈交易，以防止客户被收取未购买物品的费用。信用卡交易数据通常是极度不平衡的，欺诈交易只占总交易的一小部分。

数据特点 ：
只包含经过 PCA 转换的数值输入变量。
包含 Time 和 Amount 特征。
Time 特征表示每个交易与数据集中第一个交易之间的秒数。
Amount 特征表示交易金额。
Class 特征是响应参数，欺诈交易为 1，正常交易为 0。

总结

通过以上介绍，我们了解了内部威胁检测、DDoS 攻击检测和信用卡欺诈检测的方法。这些方法都利用了机器学习的技术，通过特征工程和模型训练来实现入侵检测的目的。在实际应用中，我们可以根据具体的场景和需求选择合适的方法，并不断优化模型以提高检测的准确性。

以下是一个简单的 mermaid 流程图，展示了内部威胁检测的主要步骤：

graph LR
    A[准备工作] --> B[特征工程]
    B --> C[异常检测]
    C --> D[模型评估]

检测场景	主要方法	关键步骤
内部威胁检测	特征工程 + 隔离森林	特征提取、时间序列向量化、模型训练和评估
DDoS 攻击检测	随机森林分类器	数据读取、特征选择、模型训练和评估
信用卡欺诈检测	待进一步探讨	目前介绍了数据特点

自动入侵检测技术全解析（下半部分）

4. 各检测场景详细对比与分析

为了更清晰地了解不同入侵检测场景的特点和差异，我们对内部威胁检测、DDoS 攻击检测和信用卡欺诈检测进行详细对比。

检测场景	数据特点	主要方法	关键步骤	模型评估指标
内部威胁检测	CERT 内部威胁数据集，包含多种日志信息，有较高的内部威胁发生率	特征工程 + 隔离森林	特征提取、时间序列向量化、模型训练和评估	观察截断分数下正常和威胁样本的数量
DDoS 攻击检测	由 CSE - CIC - IDS2018、CICIDS2017 和 CIC DoS 数据集抽样而来，80% 良性和 20% DDoS 流量	随机森林分类器	数据读取、特征选择、模型训练和评估	模型在测试集上的准确率
信用卡欺诈检测	极度不平衡数据，欺诈交易占比小，包含经过 PCA 转换的数值变量、 `Time` 和 `Amount` 特征	待进一步探讨	目前仅介绍数据特点	暂无

从表格中可以看出，不同检测场景的数据特点和采用的方法有明显差异。内部威胁检测侧重于对用户行为特征的提取和时间序列的分析；DDoS 攻击检测主要利用随机森林对流量数据进行分类；信用卡欺诈检测由于数据的极度不平衡性，需要特殊的处理方法。

5. 信用卡欺诈检测的深入探讨

虽然前面只是介绍了信用卡欺诈检测的数据特点，但我们可以进一步探讨可能的处理方法。

5.1 数据预处理

由于信用卡交易数据极度不平衡，直接训练模型可能会导致模型偏向正常交易，从而忽略欺诈交易。因此，需要对数据进行预处理。

过采样 ：增加少数类（欺诈交易）的样本数量。常见的过采样方法有 SMOTE（Synthetic Minority Over - sampling Technique）。

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

欠采样 ：减少多数类（正常交易）的样本数量。可以随机选择部分正常交易样本。

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X, y)

5.2 模型选择

逻辑回归 ：简单且可解释性强，适合处理二分类问题。

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

支持向量机 ：可以处理非线性分类问题。

from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

5.3 模型评估

对于不平衡数据，准确率可能不是一个合适的评估指标。可以使用精确率、召回率和 F1 值来评估模型。

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision}, Recall: {recall}, F1 - score: {f1}")

6. 自动入侵检测的优化方向

为了提高自动入侵检测的性能，可以从以下几个方面进行优化。

6.1 特征工程优化

增加特征 ：对于内部威胁检测，可以观察邮件文本的情感倾向，或者使用心理测量学分析员工的个性特征。
特征选择 ：使用特征选择算法，如卡方检验、信息增益等，选择最有价值的特征，减少特征维度，提高模型训练效率。

from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(score_func=chi2, k=10)
X_new = selector.fit_transform(X, y)

6.2 模型优化

模型调参 ：使用网格搜索或随机搜索等方法，寻找模型的最优参数组合。

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30]
}

rf = RandomForestClassifier()
grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

集成学习 ：结合多个模型的预测结果，提高模型的稳定性和准确性。可以使用投票法或堆叠法。

7. 总结与展望

自动入侵检测在保障网络安全方面起着至关重要的作用。通过特征工程和机器学习模型，我们可以有效地检测内部威胁、DDoS 攻击和信用卡欺诈等入侵行为。

在实际应用中，我们需要根据不同的检测场景选择合适的方法，并不断优化模型以提高检测的准确性。未来，随着网络攻击技术的不断发展，自动入侵检测技术也需要不断创新和改进。例如，结合深度学习技术，如卷积神经网络（CNN）和循环神经网络（RNN），可以处理更复杂的入侵检测任务。

以下是一个 mermaid 流程图，展示了信用卡欺诈检测的主要步骤：

graph LR
    A[数据预处理] --> B[过采样/欠采样]
    B --> C[模型选择]
    C --> D[逻辑回归/支持向量机等]
    D --> E[模型评估]
    E --> F[精确率/召回率/F1值]

综上所述，自动入侵检测是一个不断发展和完善的领域，我们需要持续关注技术的发展，以应对日益复杂的网络安全挑战。