基于 MATLAB 的 ANOVA 特征选择与可视化

Subject.625Ruben

于 2025-03-22 10:12:15 发布

阅读量327

点赞数 1

文章标签： matlab android 开发语言

本文链接：https://blog.youkuaiyun.com/subject625Ruben/article/details/146435329

版权

基于 MATLAB 的 ANOVA 特征选择与可视化

引言

特征选择在机器学习和统计建模中起着至关重要的作用，它能够提高模型的效率，减少过拟合，并增强模型的可解释性。本文介绍了一种基于方差分析（ANOVA）的特征选择方法，涵盖数据加载、特征评分、统计筛选和可视化，提供了一种稳健的特征筛选框架，以优化回归数据集的建模效果。

数据加载与预处理

首先，加载 Excel 数据集，并确保数据的兼容性和完整性：

clc; clear; close all;
try
    df = readtable('回归数据集.xlsx', 'VariableNamingRule','preserve'); 
catch ME
    error('文件读取失败，请检查：
1. 文件路径是否正确
2. Excel 文件是否无密码保护
3. 第一行是否包含有效表头');
end

% 提取特征名称和数据集分区
featureNames = df.Properties.VariableNames(1:end-1); 
X = df{:, 1:end-1};
y = df{:, end};

参数配置

定义特征选择过程中的关键超参数：

alpha = 0.05;       % 显著性水平
max_features = 15;  % 最大保留特征数
min_features = 5;   % 最小保留特征数
figDPI = 600;       % 图像分辨率

步骤 1：基于 ANOVA 计算特征评分

ANOVA 用于计算每个特征的 F 值和 p 值，以评估其对目标变量的影响：

m = size(X, 2);
[Fvals, pvals] = deal(zeros(m,1));

for i = 1:m
    [~, tbl] = anova1(X(:,i), y, 'off');
    Fvals(i) = tbl{2,5};  % F 值
    pvals(i) = tbl{2,6};  % p 值
end

步骤 2：改进的特征筛选

采用 Benjamini-Hochberg (BH) 方法控制假发现率（FDR），优化特征选择：

[sorted_p, sort_idx] = sort(pvals);
bh_threshold = (1:m)'/m * alpha;
k = find(sorted_p <= bh_threshold, 1, 'last');

sig_flag = false(size(pvals));
if ~isempty(k)
    sig_flag = pvals <= sorted_p(k);
end

[~, F_rank] = sort(Fvals, 'descend');
final_idx = F_rank(ismember(F_rank, find(sig_flag)));

动态调整特征数量，以确保特征选择的合理性：

if numel(final_idx) > max_features
    final_idx = final_idx(1:max_features);
elif numel(final_idx) < min_features
    backup_num = min_features - numel(final_idx);
    backup_features = setdiff(F_rank, final_idx);
    final_idx = [final_idx; backup_features(1:min(backup_num, end))];
end

步骤 3：特征选择可视化

设计了一种直观的可视化方案，使用对数刻度的水平条形图区分显著性和选定的特征：

[~, sort_order] = sort(Fvals, 'descend');
features_per_page = 650;
total_pages = ceil(m / features_per_page);

annotation_text = {
    sprintf('\\color[rgb]{1,0.4,0.4}█ 保留特征 (FDR < %.2f)', alpha), 
    sprintf('\\color[rgb]{0.4,0.6,1}█ 显著特征'), 
    sprintf('\\color[rgb]{0.7,0.7,0.7}█ 非显著特征')
    };

for page = 1:total_pages
    fig = figure('Color','white', 'Position', [100 100 1200 850],...
        'DefaultAxesFontName', 'Arial', 'DefaultAxesFontSize', 10);
    
    start_idx = (page-1)*features_per_page + 1;
    end_idx = min(page*features_per_page, m);
    current_range = start_idx:end_idx;
    
    current_F = Fvals(sort_order(current_range));
    current_names = featureNames(sort_order(current_range));
    
    barcolor = zeros(length(current_range),3);
    is_selected = ismember(sort_order(current_range), final_idx);
    is_sig = sig_flag(sort_order(current_range));
    
    barcolor(is_selected,:) = repmat([1 0.4 0.4], sum(is_selected),1);
    barcolor(~is_selected & is_sig,:) = repmat([0.4 0.6 1], sum(~is_selected & is_sig),1);
    barcolor(~is_selected & ~is_sig,:) = repmat([0.7 0.7 0.7], sum(~is_selected & ~is_sig),1);
    
    hold on;
    h = barh(current_F, 'FaceColor','flat');
    h.CData = barcolor;
    h.EdgeColor = 'none';
    
    set(gca, 'YDir','reverse', 'YTick', [], 'XScale','log', 'LineWidth',1.2);
    xlabel('F 值 (对数刻度)', 'FontSize',12, 'FontWeight','bold');
    title(sprintf('特征选择 - 第 %d/%d 页', page, total_pages), 'FontSize',14);
    
    annotation(fig, 'textbox', [0.68 0.82 0.25 0.1], 'String', annotation_text, 'EdgeColor', 'none', 'FontSize', 10);
    annotation(fig, 'textbox', [0.92 0.02 0.08 0.04], 'String', sprintf('第 %d/%d 页', page, total_pages), 'EdgeColor','none', 'FontSize',10);
    
    exportgraphics(fig, sprintf('特征选择_第%d页.png', page), 'Resolution',figDPI, 'BackgroundColor','white');
    close(fig);
end