[Exploratory Data Analysis] Week 1

本文介绍了数据分析中的六个基本原则,包括比较展示、因果关系解释等,并通过R语言展示了多种数据可视化方法,如散点图、箱线图等,帮助理解数据特征及提出建模策略。

Principles of analysis graphics

  • Principle 1: Show corparisons
    • Evidence for a pyhothesis is always relative to another competing yhpothesis.
    • Always ask “Compared to What?”
  • Principle 2: Show causality, mechanism, explanation, systematic structure
    • What is your causal framework for thinking about a question?
  • Principle 3: Show multivariate data
    • multivariate = more than 2 variables
    • the real world is multivariate
    • need to ‘escape flatland’
  • Principle 4: Integration of evidence
    • completely integrate words, numbers, images, diagrams
    • data graphics should make use of many modes of data presentation.
    • don’t let the tool drive the analysis.
  • Principle 5: Describle and document the evidence with appropriate labels, scales, scources, et al.
    • a data graphic should tell a complete story that is credible.
  • Principle 6: Content is king
    • analytical presentations ultimately stand or falling depend on the quanlity, relerance, and integrity of their content.

Exploratory Graphics

Why do we use graphic in data analysis

  • To understand data properties
  • To find partterns in data
  • To suggest modeling strategies
  • To ‘debug’ analysis
  • To communicate results

Characteristics of exploratory graphics

  • They are made quickly
  • A large number are made
  • The goal is for personal understanding
  • Axes/legends are generally cleaned up
  • Colors/size are primarily used for information
# download data set form internet
setwd('E:\\Dropbox\\coursera\\Exploratory Data Analysis')
if(!file.exists('data')) dir.create('data')
fileUrl <- 'https://raw.githubusercontent.com/jtleek/modules/master/04_ExploratoryAnalysis/exploratoryGraphs/data/avgpm25.csv'
download.file(fileUrl, './data/pollution.csv')
# read data
pollution <- read.csv('./data/pollution.csv', header = TRUE)
head(pollution)

Simple Summary of Data

  • One dimension
    • Five-number summary
    • Boxplot
    • Histograms
    • Density plot
    • Barplot
  • Two dimesions
    • Multiple/overlayed 1-D plots(Lattice/ggplot2)
    • Scatterplot
    • Smooth scatter plot
  • More than two dimensions
    • Ovelayed/multiple 2-D plots,coplots
    • Use color, size, shape to add dimesions
    • Spinning plots
    • Actual 3-D plots(not that useful)
# Five number summary
summary(pollution$pm25)
# Boxplot
boxplot(pollution$pm25, col = 'blue')
# Histograms
hist(pollution$pm25, col = 'green')
rug(pollution$pm25)
hist(pollution$pm25, col = 'green', breaks = 100)
rug(pollution$pm25)
# Overlayng features
boxplot(pollution$pm25, col = 'blue')
abline(h = 12)
hist(pollution$pm25, col = 'green')
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = 'red')
# Barplot
barplot(table(pollution$region), col = 'wheat', main = 'NO. of counties in each region')
# Multiple boxplot
boxplot(pm25 ~ region, data = pollution, col = 'red')
# multiple histograms
par(mfrow = c(2,1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == 'east')$pm25, col = 'green')
hist(subset(pollution, region == 'west')$pm25, col = 'green')
# Scatter plot
par(mfrow = c(1,1))
with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)
# scatter plot using color
with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)
# multiple scatterplots
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == 'east'), plot(latitude, pm25), main= 'East')
with(subset(pollution, region == 'west'), plot(latitude, pm25), main = 'West')

Summary

  • Exploratory plots are ‘quick and dirty’
  • Let you summarize the data(usually graphically) and highly any broad features
  • Explore basic questions and hypotheses(and perhaps rule them out)
  • Suggest modeling strategies for the ‘next step’

Plotting

There are three plotting systems in R

  • The basic plotting system

    • ’Artist’s palette’ model
    • Start with blank canvas and build up from there
    • Start with plot function (or similar)
    • Use annotation functions to add/modify(text, lines, points, axis)

    • (Pros) Convenient, mirrors how we think of building plots and analysis

    • (Cons) Can’t go back one plot has started(i.e. to adjust margins)
    • (Cons) Difficult to ‘translate’ to others once a new plot has been created
    • (Cons) Plot is just a series of R commands
  • The Lattive system

    • (Pros) Plots are created with a single functin cal (xyplot, bwplot, etc)
    • (Pros) Most useful for conditioning types of plots: looking at how y changes with x cross levels of z
    • (Pros) Things like margins/spacing set automatically because entire is specified at once
    • (Pros) Good for putting many many plots on a screen
    • (Cons) Sometimes arkward to specify an entire plot in a single function call
    • (Cons) Annotation in plot is not especially intuitive
    • (Cons) Use of panel functions and subscripts difficult to wield and requires intense preparation.
    • (COns) Cannot ‘add’ to the plot once it is created
  • The ggplot2 system

    • Splits the difference between base and lattive in a number of ways
    • Automatically deals with spacing, text, titles but also allows you to annotate by ‘adding’ to a plot
    • Superficial similarity to lattice but generaly easier/more intuitive to use
    • Default mode makes many choices for you (but you can still customize to your heart’s desire)
  • Summary

    • Base: ‘artist’s palette’ model
    • Lattice: Entire plot specified by one function; conditioning
    • ggplot2: Mixed elements of Base and Lattice

Base Plotting System

  • The core plotting and graphics engine in R is encasulated in the following two packages
    • graphics: contains plotting functions for the ‘base’ graphing system, plot, boxplot and many others
    • grDevices: contains all the code implementing the various graphics device, including X11, PDF, PostScript, PNG, etc
  • The lattice plotting system is implementd using the following packages
    • lattice: contains code for producing Trellis graphics, which are independent of the ‘base’ graphics system; including functions like xyplot, bwplot, levelplot
    • grid: implements a different graphing system independent of the ‘base’ system; the lattice package builds on top of grid; we seldom call functions from grid package directly.
# Simple Base Graphcs: histogram
par(mfrow = c(1,1))
library(datasets)
hist(airquality$Ozone)
# Simple Base Graphics: scatterplot
with(airquality, plot(Wind, Ozone))
# Simple Base Graphics: boxplot
airquality <- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = 'month', ylab = 'Ozone(ppb)')
boxplot(Ozone~Month, airquality, xlab="Month", ylab="Ozone (ppb)",col.axis="blue",col.lab="red")

par() function is used to specify global graphics parameters

  • las: the orientation of the axis labels on the plotting
  • bg: the background color
  • mar: the margin size
  • oma: the outer margin size
  • mfrow: number of plot per row, column(plots are filled row-wise)
  • mfcol: number of plot per row, column(plots are filled column-wise)
with(airquality, plot(Wind, Ozone, main = 'Ozone and Wind in New York City'), type = 'n')
with(subset(airquality, Month == 5), points(Wind, Ozone, col = 'blue'))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = 'red'))
legend('topright', pch = 1, col = c('blue', 'red'), legend = c('May', 'Other months'))
with(airquality, plot(Wind, Ozone, main = 'Ozone and Wind in New York City'), pch = 5)
model <- lm(Ozone ~ Wind, data = airquality)
abline(model, lwd = 2)
par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(2, 0, 2, 0)) # bottom left top right
with(airquality, {
        plot(Wind, Ozone, main = 'Ozone and Wind')
        plot(Solar.R, Ozone, main = 'Ozone and Solar.R')
        plot(Temp, Ozone, main = 'Ozone and temperature')
        mtext('Ozone and Weather in New York City', outer = TRUE)
})

Graphics Devies

  • A graphics device is something where you can make a plot
    • A window on your computer(screen device)
    • A PDF file(file device)
    • A PNG or JPEG file(file device)
    • A scalable vector graphics (SVG) file(file device)
  • Screen devices
    • Mac quartz()
    • Windows windows()
    • Unix/Linux xll()
  • Plot on the screen
    • Call a plotting function like plot, xyplot, qplot
    • The plot appears on the screen device
    • Annotate plot if necessary
    • Enjoy
  • Plot to a file
    • Explicity lauch a graphics device
    • Call a plotting function to make a plot
    • Annotate plot if necessary
    • Explicitly close graphics device with dev.off()(this is very important!!!!)
if(!file.exists('figures')) dir.create('figures')
pdf('./figures/myplot.pdf')
with(pollution, plot(pm25, latitude))
title(main = 'Ozone and Wind')
dev.off()
  • Graphics FIle Devices: There are two basic types of file devices: vector and bitmap
    • Vector formats:
      • pdf: useful for line-type graphics, resizes well, usually portable, not efficient if a plot has many objects or points
      • svg: XML- based scalable vector graphics; supports animation and interactive, potentially useful for web-based plots
      • win.metafile: Windows metafile format(only on Windows)
      • postscript: older format, also resizes well, usually portable, can be used to create encapsulated postscript files; Windows systems ofen don’t have a postscript viewer
    • bitmap formats:
      • png: bitmapped format, good for line drawings or images with solid color, uses lossless compression(like the old GIF format), most web browers can read this format natively, good for plotting many many points, does nor resize well
      • jpeg: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web brower, not great for line drawings
      • tiff: Creates bitmap files in the TIFF format; supprots lossless compression
      • bmp: a native Windows bitmapped format

Copy plots
- dev.copy: copy a plot from one device to another
- dev.copy2pdf: specifically copy a plot to a PDF file.

with(faithful, plot(eruptions, waiting))
title(main = 'Old Faithful Geyser data')
dev.copy(png, file = './figures/greyserplot.png')
dev.off()
function nipt_survival_analysis_corrected() % NIPT最佳检测时点分析 - 修正版 % 基于生存分析原理,正确处理删失数据 clc; close all; clear all; try %% 阶段1:数据读取与预处理 fprintf('=== NIPT最佳检测时点分析 (修正版) ===\n'); fprintf('阶段1:数据读取与预处理...\n'); % 读取数据 survival_data = load_and_preprocess_data(); if isempty(survival_data) error('数据加载失败'); end %% 阶段2:探索性数据分析 fprintf('\n阶段2:探索性数据分析...\n'); exploratory_analysis(survival_data); %% 阶段3:科学BMI分组 fprintf('\n阶段3:科学BMI分组...\n'); [bmi_groups, group_info] = create_bmi_groups(survival_data); %% 阶段4:生存分析 fprintf('\n阶段4:生存分析...\n'); survival_results = perform_survival_analysis(survival_data, bmi_groups); %% 阶段5:最佳检测时点计算 fprintf('\n阶段5:最佳检测时点计算...\n'); optimal_timepoints = calculate_optimal_timepoints(survival_results, group_info); %% 阶段6:敏感性分析 fprintf('\n阶段6:敏感性分析...\n'); sensitivity_results = perform_sensitivity_analysis(survival_data, bmi_groups, optimal_timepoints); %% 阶段7:结果可视化 fprintf('\n阶段7:结果可视化...\n'); create_visualizations(survival_data, bmi_groups, survival_results, optimal_timepoints); %% 阶段8:生成临床建议 fprintf('\n阶段8:生成临床建议...\n'); generate_clinical_recommendations(group_info, optimal_timepoints, sensitivity_results); fprintf('\n=== 分析完成! ===\n'); catch ME fprintf('错误:%s\n', ME.message); fprintf('堆栈跟踪:\n'); for i = 1:length(ME.stack) fprintf(' %s (%d)\n', ME.stack(i).name, ME.stack(i).line); end end end %% 阶段1:数据读取与预处理 function survival_data = load_and_preprocess_data() % 尝试读取数据文件 data_files = {'预处理后的NIPT数据.xlsx', '预处理后的胎儿数据.xlsx', 'NIPT数据.xlsx'}; data = []; selected_file = ''; for i = 1:length(data_files) if exist(data_files{i}, 'file') try data = readtable(data_files{i}, 'PreserveVariableNames', true); selected_file = data_files{i}; fprintf('成功读取文件: %s\n', selected_file); break; catch ME fprintf('读取 %s 失败: %s\n', data_files{i}, ME.message); end end end if isempty(data) error('未找到可读的数据文件'); end % 显示数据基本信息 fprintf('数据维度: %d 行 × %d 列\n', size(data, 1), size(data, 2)); fprintf('变量名: %s\n', strjoin(data.Properties.VariableNames(1:min(10, width(data))), ', ')); % 智能列名匹配 col_map = struct(); col_names = data.Properties.VariableNames; % 查找关键列 col_map.ga = find_column(col_names, {'检测孕周', 'Gestational_Week', 'GestationalAge', '孕周'}); col_map.bmi = find_column(col_names, {'孕妇BMI', 'BMI', 'bmi'}); col_map.y_conc = find_column(col_names, {'Y染色体浓度', 'Y_Concentration', 'Y浓度'}); col_map.age = find_column(col_names, {'年龄', 'Age', '孕妇年龄'}); if isempty(col_map.ga) || isempty(col_map.bmi) || isempty(col_map.y_conc) error('缺少必要的列: 孕周、BMI或Y染色体浓度'); end % 提取男胎数据 (Y染色体浓度 > 0) y_conc_data = data.(col_map.y_conc); male_idx = y_conc_data > 0 & ~isnan(y_conc_data); male_data = data(male_idx, :); fprintf('男胎样本数: %d (总共: %d)\n', sum(male_idx), height(data)); % 提取关键变量 ga = male_data.(col_map.ga); bmi = male_data.(col_map.bmi); y_conc = male_data.(col_map.y_conc); % Y染色体浓度单位标准化 if max(y_conc, [], 'omitnan') <= 1 y_conc = y_conc * 100; fprintf('Y染色体浓度已转换为百分比格式\n'); end % 提取年龄(如果存在) if ~isempty(col_map.age) age = male_data.(col_map.age); else age = nan(size(ga)); end % 数据清洗 valid_idx = ~isnan(ga) & ~isnan(bmi) & ~isnan(y_conc) & ... ga >= 8 & ga <= 25 & ... % 合理的孕周范围 bmi >= 16 & bmi <= 50 & ... % 合理的BMI范围 y_conc >= 0 & y_conc <= 20; % 合理的Y浓度范围 ga = ga(valid_idx); bmi = bmi(valid_idx); y_conc = y_conc(valid_idx); age = age(valid_idx); fprintf('数据清洗后有效样本数: %d\n', sum(valid_idx)); % 构建生存分析数据结构 - 关键修正:不估计删失时间! survival_data = table(); survival_data.time = ga; % 观测时间(检测孕周) survival_data.event = double(y_conc >= 4.0); % 事件:Y浓度≥4%(达标) survival_data.bmi = bmi; survival_data.y_conc_observed = y_conc; survival_data.age = age; % 添加BMI平方项 survival_data.bmi_squared = bmi.^2; % 数据质量统计 fprintf('达标样本数: %d (%.1f%%)\n', sum(survival_data.event), mean(survival_data.event)*100); fprintf('删失样本数: %d (%.1f%%)\n', sum(~survival_data.event), mean(~survival_data.event)*100); fprintf('孕周范围: %.1f - %.1f 周\n', min(survival_data.time), max(survival_data.time)); fprintf('BMI范围: %.1f - %.1f kg/m&sup2;\n', min(survival_data.bmi), max(survival_data.bmi)); end %% 智能列查找函数 function col_name = find_column(all_cols, patterns) col_name = ''; for i = 1:length(patterns) matches = find(contains(all_cols, patterns{i}, 'IgnoreCase', true)); if ~isempty(matches) col_name = all_cols{matches(1)}; fprintf(' 找到列 %s -> %s\n', patterns{i}, col_name); return; end end fprintf(' 警告: 未找到列 %s\n', strjoin(patterns, '/')); end %% 阶段2:探索性数据分析 function exploratory_analysis(survival_data) fprintf('探索性数据分析:\n'); % 基本统计量 fprintf('样本量: %d\n', height(survival_data)); fprintf('事件率: %.1f%%\n', mean(survival_data.event) * 100); fprintf('孕周中位数: %.1f 周\n', median(survival_data.time)); fprintf('BMI中位数: %.1f kg/m&sup2;\n', median(survival_data.bmi)); % 相关性分析 valid_idx = ~isnan(survival_data.time) & ~isnan(survival_data.bmi) & ~isnan(survival_data.y_conc_observed); if sum(valid_idx) > 10 corr_matrix = corr([survival_data.time(valid_idx), survival_data.bmi(valid_idx), ... survival_data.y_conc_observed(valid_idx)], 'Rows', 'complete'); fprintf('时间-BMI相关性: %.3f\n', corr_matrix(1,2)); fprintf('时间-Y浓度相关性: %.3f\n', corr_matrix(1,3)); fprintf('BMI-Y浓度相关性: %.3f\n', corr_matrix(2,3)); end % 绘制散点图矩阵 figure('Name', '探索性数据分析', 'Position', [100, 100, 800, 600]); subplot(2,2,1); scatter(survival_data.bmi, survival_data.time, 30, survival_data.event, 'filled'); colormap([0.8 0.2 0.2; 0.2 0.6 0.2]); % 红色=未达标,绿色=达标 colorbar('Ticks', [0.25, 0.75], 'TickLabels', {'未达标', '达标'}); xlabel('BMI (kg/m&sup2;)'); ylabel('检测孕周'); title('BMI vs 检测孕周'); grid on; subplot(2,2,2); scatter(survival_data.time, survival_data.y_conc_observed, 30, survival_data.bmi, 'filled'); colorbar; xlabel('检测孕周'); ylabel('Y染色体浓度 (%)'); title('孕周 vs Y浓度'); grid on; subplot(2,2,3); histogram(survival_data.bmi, 20, 'FaceColor', [0.2 0.4 0.8]); xlabel('BMI (kg/m&sup2;)'); ylabel('频数'); title('BMI分布'); grid on; subplot(2,2,4); boxplot(survival_data.time, survival_data.event, 'Labels', {'未达标', '达标'}); ylabel('检测孕周'); title('达标状态 vs 检测孕周'); grid on; saveas(gcf, '探索性分析.png'); end %% 阶段3:创建BMI分组 function [bmi_groups, group_info] = create_bmi_groups(survival_data) % 使用WHO标准结合数据分布进行分组 bmi_values = survival_data.bmi; % 基于分位数和临床意义的混合分组 quantiles = quantile(bmi_values, [0, 0.2, 0.4, 0.6, 0.8, 1.0]); % 调整边界以确保临床合理性 boundaries = [0, 18.5, 23, 27.5, 32, max(bmi_values)+1]; boundaries = max(boundaries, quantiles(1:6)); % 确保覆盖所有数据 group_labels = {'偏瘦(<18.5)', '正常(18.5-23)', '超重(23-27.5)', '肥胖I(27.5-32)', '肥胖II+(≥32)'}; % 创建分组 bmi_groups = discretize(bmi_values, boundaries, 'categorical', group_labels); % 计算分组统计信息 group_info = table(); unique_groups = categories(bmi_groups); for i = 1:length(unique_groups) group_idx = bmi_groups == unique_groups{i}; group_data = survival_data(group_idx, :); group_info.group_id(i) = i; group_info.group_name{i} = unique_groups{i}; group_info.n_subjects(i) = sum(group_idx); group_info.event_rate(i) = mean(group_data.event); group_info.median_time(i) = median(group_data.time); group_info.mean_bmi(i) = mean(group_data.bmi); group_info.bmi_range{i} = sprintf('%.1f-%.1f', boundaries(i), boundaries(i+1)); end % 显示分组结果 fprintf('BMI分组结果:\n'); for i = 1:height(group_info) fprintf(' %s: n=%d, 达标率=%.1f%%, 中位时间=%.1f周\n', ... group_info.group_name{i}, group_info.n_subjects(i), ... group_info.event_rate(i)*100, group_info.median_time(i)); end end %% 阶段4:生存分析 function survival_results = perform_survival_analysis(survival_data, bmi_groups) fprintf('执行生存分析...\n'); survival_results = struct(); % 整体Kaplan-Meier分析 [km_time, km_surv, km_ci_lower, km_ci_upper] = kaplan_meier_estimator(... survival_data.time, survival_data.event); survival_results.overall.T = km_time; survival_results.overall.S = km_surv; survival_results.overall.CI_lower = km_ci_lower; survival_results.overall.CI_upper = km_ci_upper; % 计算中位达标时间 median_idx = find(km_surv <= 0.5, 1); if ~isempty(median_idx) survival_results.overall.median_time = km_time(median_idx); else survival_results.overall.median_time = inf; end fprintf('整体中位达标时间: %.1f周\n', survival_results.overall.median_time); % 分组Kaplan-Meier分析 unique_groups = categories(bmi_groups); survival_results.groups = cell(length(unique_groups), 1); for i = 1:length(unique_groups) group_idx = bmi_groups == unique_groups{i}; group_data = survival_data(group_idx, :); if sum(group_idx) >= 5 % 确保有足够样本 [g_km_time, g_km_surv, g_ci_lower, g_ci_upper] = kaplan_meier_estimator(... group_data.time, group_data.event); survival_results.groups{i}.T = g_km_time; survival_results.groups{i}.S = g_km_surv; survival_results.groups{i}.CI_lower = g_ci_lower; survival_results.groups{i}.CI_upper = g_ci_upper; % 计算分组中位达标时间 g_median_idx = find(g_km_surv <= 0.5, 1); if ~isempty(g_median_idx) survival_results.groups{i}.median_time = g_km_time(g_median_idx); else survival_results.groups{i}.median_time = inf; end else fprintf(' 组 %s 样本数不足 (%d),跳过生存分析\n', unique_groups{i}, sum(group_idx)); survival_results.groups{i} = []; end end % 执行Log-Rank检验 survival_results.logrank = perform_logrank_test(survival_data.time, survival_data.event, bmi_groups); fprintf('Log-Rank检验 p值: %.6f\n', survival_results.logrank.p_value); end %% 正确的Kaplan-Meier估计器 function [T, S, CI_lower, CI_upper] = kaplan_meier_estimator(time, event) % 对时间排序 [sorted_time, sort_idx] = sort(time); sorted_event = event(sort_idx); % 获取唯一时间点 [unique_times, ~, idx] = unique(sorted_time); T = unique_times'; n_times = length(T); % 初始化 S = ones(n_times, 1); se_S = zeros(n_times, 1); n_at_risk = zeros(n_times, 1); n_events = zeros(n_times, 1); % 计算Kaplan-Meier估计量 for i = 1:n_times t_i = T(i); % 风险集大小 n_at_risk(i) = sum(sorted_time >= t_i); % 事件数 n_events(i) = sum(sorted_time == t_i & sorted_event == 1); if n_at_risk(i) > 0 if i == 1 S(i) = (n_at_risk(i) - n_events(i)) / n_at_risk(i); else S(i) = S(i-1) * (n_at_risk(i) - n_events(i)) / n_at_risk(i); end % 计算标准误 (Greenwood公式) if S(i) > 0 if i == 1 var_log_S = n_events(i) / (n_at_risk(i) * (n_at_risk(i) - n_events(i))); else var_log_S = var_log_S + n_events(i) / (n_at_risk(i) * (n_at_risk(i) - n_events(i))); end se_S(i) = S(i) * sqrt(var_log_S); end else if i > 1 S(i) = S(i-1); se_S(i) = se_S(i-1); end end end %% 继续Kaplan-Meier估计器 % 计算95%置信区间 z_alpha = 1.96; % 95%置信水平 CI_lower = max(0, S - z_alpha * se_S); CI_upper = min(1, S + z_alpha * se_S); end %% Log-Rank检验实现 function result = perform_logrank_test(time, event, groups) % 简化的Log-Rank检验实现 unique_groups = categories(groups); n_groups = length(unique_groups); if n_groups < 2 result.p_value = NaN; result.chi2_stat = NaN; return; end % 获取所有事件时间点 event_times = unique(time(event == 1)); event_times = sort(event_times); % 初始化统计量 O = zeros(n_groups, 1); % 观察事件数 E = zeros(n_groups, 1); % 期望事件数 % 对每个事件时间点计算 for t = event_times' % 在时间t的风险集 at_risk = time >= t; total_at_risk = sum(at_risk); total_events = sum(time == t & event == 1); if total_events > 0 && total_at_risk > 0 for g = 1:n_groups group_mask = (groups == unique_groups{g}); % 组内在时间t的风险集大小 group_at_risk = sum(at_risk & group_mask); % 组内在时间t的观察事件数 group_events = sum(time == t & event == 1 & group_mask); O(g) = O(g) + group_events; E(g) = E(g) + group_at_risk * (total_events / total_at_risk); end end end % 计算卡方统计量 chi2_stat = sum((O - E).^2 ./ E); df = n_groups - 1; p_value = 1 - chi2cdf(chi2_stat, df); result.p_value = p_value; result.chi2_stat = chi2_stat; result.df = df; result.observed = O; result.expected = E; end %% 阶段5:最佳检测时点计算 function optimal_timepoints = calculate_optimal_timepoints(survival_results, group_info) fprintf('计算最佳检测时点...\n'); n_groups = height(group_info); optimal_timepoints = table(); success_threshold = 0.95; % 95%达标率 for i = 1:n_groups if i <= length(survival_results.groups) && ~isempty(survival_results.groups{i}) km_data = survival_results.groups{i}; % 找到累积达标概率达到95%的时间点 cumulative_success = 1 - km_data.S; target_idx = find(cumulative_success >= success_threshold, 1); if ~isempty(target_idx) optimal_time = km_data.T(target_idx); % 添加安全边际(考虑个体差异和测量误差) optimal_time_adjusted = optimal_time + 0.5; else % 如果数据中未达到95%,使用外推估计 if length(km_data.T) >= 2 % 线性外推 last_two_t = km_data.T(end-1:end); last_two_s = cumulative_success(end-1:end); if diff(last_two_s) > 0 slope = diff(last_two_s) / diff(last_two_t); optimal_time = last_two_t(end) + (success_threshold - last_two_s(end)) / slope; optimal_time_adjusted = optimal_time + 0.5; else optimal_time = km_data.T(end) + 1; optimal_time_adjusted = optimal_time + 0.5; end else optimal_time = 16; % 默认值 optimal_time_adjusted = 16.5; end end % 确保在合理范围内 optimal_time = min(max(optimal_time, 10), 20); optimal_time_adjusted = min(max(optimal_time_adjusted, 10.5), 20.5); % 存储结果 optimal_timepoints.group_id(i) = i; optimal_timepoints.optimal_time(i) = optimal_time; optimal_timepoints.optimal_time_adjusted(i) = optimal_time_adjusted; optimal_timepoints.success_rate_at_optimal(i) = ... interp1(km_data.T, cumulative_success, optimal_time, 'linear', 'extrap'); fprintf(' 组%d: 推荐时点=%.1f周 (调整后=%.1f周), 预期成功率=%.1f%%\n', ... i, optimal_time, optimal_time_adjusted, ... optimal_timepoints.success_rate_at_optimal(i)*100); else % 对于样本不足的组,使用保守估计 optimal_timepoints.group_id(i) = i; optimal_timepoints.optimal_time(i) = 16; optimal_timepoints.optimal_time_adjusted(i) = 16.5; optimal_timepoints.success_rate_at_optimal(i) = 0.85; fprintf(' 组%d: 样本不足,使用保守估计=16.5周\n', i); end end end %% 阶段6:敏感性分析 function sensitivity_results = perform_sensitivity_analysis(survival_data, bmi_groups, optimal_timepoints) fprintf('执行敏感性分析...\n'); sensitivity_results = struct(); n_groups = height(optimal_timepoints); n_simulations = 500; error_levels = [0, 0.2, 0.5, 1.0]; % 测量误差水平 (%) % 初始化存储矩阵 success_rates = zeros(length(error_levels), n_groups); time_variations = zeros(length(error_levels), n_groups); for e = 1:length(error_levels) error_std = error_levels(e); fprintf(' 误差水平 σ=%.1f%%: ', error_std); temp_success = zeros(n_simulations, n_groups); temp_times = zeros(n_simulations, n_groups); for sim = 1:n_simulations % 添加随机测量误差 measurement_error = error_std * randn(height(survival_data), 1); noisy_y_conc = survival_data.y_conc_observed + measurement_error; noisy_event = double(noisy_y_conc >= 4.0); for g = 1:n_groups group_idx = (bmi_groups == categorical(g)); if sum(group_idx) >= 5 group_time = survival_data.time(group_idx); group_event = noisy_event(group_idx); % 计算该组在推荐时点的成功率 recommended_time = optimal_timepoints.optimal_time_adjusted(g); success_at_time = mean(group_event(group_time <= recommended_time)); temp_success(sim, g) = success_at_time; % 重新计算最佳时点 [km_time, km_surv] = kaplan_meier_estimator(group_time, group_event); cumulative_success = 1 - km_surv; new_optimal_idx = find(cumulative_success >= 0.95, 1); if ~isempty(new_optimal_idx) temp_times(sim, g) = km_time(new_optimal_idx); else temp_times(sim, g) = optimal_timepoints.optimal_time(g); end else temp_success(sim, g) = NaN; temp_times(sim, g) = NaN; end end end % 计算统计量 success_rates(e, :) = nanmean(temp_success, 1); time_variations(e, :) = nanstd(temp_times, 0, 1); fprintf('平均成功率=%.1f%%, 时点变异=%.2f周\n', ... mean(success_rates(e, :))*100, mean(time_variations(e, :))); end sensitivity_results.error_levels = error_levels; sensitivity_results.success_rates = success_rates; sensitivity_results.time_variations = time_variations; sensitivity_results.n_simulations = n_simulations; end %% 阶段7:结果可视化 function create_visualizations(survival_data, bmi_groups, survival_results, optimal_timepoints) fprintf('创建可视化图表...\n'); % 主图:生存曲线 fig1 = figure('Name', 'NIPT生存分析结果', 'Position', [100, 100, 1200, 800]); % 子图1:整体生存曲线 subplot(2, 2, 1); plot_kaplan_meier_curve(survival_results.overall, '整体生存曲线'); % 子图2:分组生存曲线 subplot(2, 2, 2); plot_grouped_kaplan_meier(survival_results.groups); % 子图3:最佳时点推荐 subplot(2, 2, 3); plot_optimal_timepoints(optimal_timepoints); % 子图4:BMI与达标时间关系 subplot(2, 2, 4); plot_bmi_vs_time(survival_data, bmi_groups); saveas(fig1, 'NIPT_生存分析结果.png'); % 第二张图:敏感性分析 fig2 = figure('Name', '敏感性分析', 'Position', [150, 150, 1000, 600]); % 假设sensitivity_results已从阶段6获得 % plot_sensitivity_results(sensitivity_results); % 需要实际数据 text(0.5, 0.5, '敏感性分析结果', 'HorizontalAlignment', 'center', 'FontSize', 16); saveas(fig2, 'NIPT_敏感性分析.png'); fprintf('可视化图表已保存\n'); end %% 辅助绘图函数 function plot_kaplan_meier_curve(km_data, title_str) plot(km_data.T, 1-km_data.S, 'b-', 'LineWidth', 2); hold on; plot(km_data.T, 1-km_data.CI_lower, 'b--', 'LineWidth', 1); plot(km_data.T, 1-km_data.CI_upper, 'b--', 'LineWidth', 1); % 标记中位时间 if isfield(km_data, 'median_time') && ~isinf(km_data.median_time) plot(km_data.median_time, 0.5, 'ro', 'MarkerSize', 8, 'LineWidth', 2); text(km_data.median_time, 0.45, sprintf('中位时间: %.1f周', km_data.median_time), ... 'HorizontalAlignment', 'center'); end yline(0.95, 'r--', 'LineWidth', 2, 'Label', '95%达标线'); xlabel('孕周'); ylabel('累积达标概率'); title(title_str); grid on; legend('生存曲线', '95% CI下限', '95% CI上限', 'Location', 'best'); hold off; end function plot_grouped_kaplan_meier(group_km_data) colors = lines(length(group_km_data)); hold on; for i = 1:length(group_km_data) if ~isempty(group_km_data{i}) km_data = group_km_data{i}; plot(km_data.T, 1-km_data.S, 'Color', colors(i,:), 'LineWidth', 2, ... 'DisplayName', sprintf('组%d', i)); end end yline(0.95, 'r--', 'LineWidth', 2, 'Label', '95%达标线'); xlabel('孕周'); ylabel('累积达标概率'); title('分组生存曲线'); legend('Location', 'best'); grid on; hold off; end function plot_optimal_timepoints(optimal_timepoints) bar(optimal_timepoints.group_id, [optimal_timepoints.optimal_time, ... optimal_timepoints.optimal_time_adjusted]); xlabel('BMI组别'); ylabel('推荐检测孕周'); title('各组最佳检测时点'); legend('理论时点', '调整时点(+0.5周)', 'Location', 'best'); grid on; % 添加数值标签 for i = 1:height(optimal_timepoints) text(i, optimal_timepoints.optimal_time(i), ... sprintf('%.1f', optimal_timepoints.optimal_time(i)), ... 'HorizontalAlignment', 'center', 'VerticalAlignment', 'bottom'); text(i, optimal_timepoints.optimal_time_adjusted(i), ... sprintf('%.1f', optimal_timepoints.optimal_time_adjusted(i)), ... 'HorizontalAlignment', 'center', 'VerticalAlignment', 'top'); end end function plot_bmi_vs_time(survival_data, bmi_groups) scatter(survival_data.bmi, survival_data.time, 40, double(bmi_groups), 'filled'); colormap(jet(double(max(bmi_groups)))); colorbar('Label', 'BMI组别'); xlabel('BMI (kg/m&sup2;)'); ylabel('检测孕周'); title('BMI与检测孕周关系'); grid on; % 添加趋势线 valid_idx = ~isnan(survival_data.bmi) & ~isnan(survival_data.time); if sum(valid_idx) > 10 p = polyfit(survival_data.bmi(valid_idx), survival_data.time(valid_idx), 1); bmi_range = linspace(min(survival_data.bmi), max(survival_data.bmi), 100); trend_line = polyval(p, bmi_range); hold on; plot(bmi_range, trend_line, 'r-', 'LineWidth', 2); legend('数据点', '趋势线', 'Location', 'best'); hold off; end end %% 阶段8:生成临床建议 function generate_clinical_recommendations(group_info, optimal_timepoints, sensitivity_results) fprintf('生成临床应用建议...\n'); % 创建推荐表格 fprintf('\n=== NIPT检测时点临床推荐 ===\n'); fprintf('┌──────┬──────────────────┬──────────┬──────────────┬──────────────┐\n'); fprintf('│ 组别 │ BMI范围 (kg/m&sup2;) │ 样本数 │ 推荐时点(周) │ 预期成功率 │\n'); fprintf('├──────┼──────────────────┼──────────┼──────────────┼──────────────┤\n'); for i = 1:height(group_info) fprintf('│ %-4d │ %-16s │ %8d │ %12.1f │ %10.1f%% │\n', ... group_info.group_id(i), ... group_info.bmi_range{i}, ... group_info.n_subjects(i), ... optimal_timepoints.optimal_time_adjusted(i), ... optimal_timepoints.success_rate_at_optimal(i)*100); end fprintf('└──────┴──────────────────┴──────────┴──────────────┴──────────────┘\n'); % 临床建议 fprintf('\n=== 临床实施建议 ===\n'); fprintf('1. 分层检测策略:\n'); for i = 1:height(group_info) risk_level = ''; if group_info.event_rate(i) < 0.7 risk_level = '高风险'; advice = '建议16周后检测,必要时重复检测'; elseif group_info.event_rate(i) < 0.85 risk_level = '中风险'; advice = '按推荐时点检测,密切观察'; else risk_level = '低风险'; advice = '可按标准流程检测'; end fprintf(' - %s (%s): %s\n', group_info.group_name{i}, risk_level, advice); end fprintf('\n2. 质量控制要求:\n'); fprintf(' - Y染色体浓度测量精度: ±0.5%%\n'); fprintf(' - BMI测量精度: ±0.2 kg/m&sup2;\n'); fprintf(' - 孕周确定精度: ±0.5周\n'); fprintf('\n3. 检测误差影响:\n'); if isfield(sensitivity_results, 'error_levels') fprintf(' - 测量误差每增加0.5%%,预期成功率下降约%.1f%%\n', ... mean(diff(sensitivity_results.success_rates(:,1)))*100/2); end fprintf('\n4. 特殊情况处理:\n'); fprintf(' - 对高BMI孕妇(≥30),建议适当延后检测时间\n'); fprintf(' - 检测失败时应在一周后重新检测\n'); fprintf(' - 临界值结果建议结合其他临床指标综合判断\n'); % 保存建议到文件 save_recommendations_to_file(group_info, optimal_timepoints); end %% 保存建议到文件 function save_recommendations_to_file(group_info, optimal_timepoints) filename = 'NIPT_临床建议指南.txt'; fid = fopen(filename, 'w', 'n', 'UTF-8'); if fid == -1 warning('无法创建建议文件'); return; end fprintf(fid, 'NIPT基于BMI分组的最佳检测时点临床建议\n'); fprintf(fid, '生成时间: %s\n\n', datestr(now, 'yyyy-mm-dd HH:MM')); fprintf(fid, '一、BMI分组及推荐检测时点\n'); fprintf(fid, '┌────补全代码
最新发布
09-06
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值