实战--利用SVM对基因表达标本是否癌变的预测

本文探讨了使用支持向量机(SVM)对基因表达数据进行分析,以预测结肠样本是否为肿瘤。通过分析Alon的数据集,包含40个癌症样本和21个健康样本,采用线性核函数的SVM进行分类。实验结果显示,即使在仅分析前20个基因的情况下,预测准确率也能达到93%。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

利用支持向量机对基因表达标本是否癌变的预测

As we mentioned earlier, gene expression analysis has a wide variety of applications, including cancer studies. In 1999, Uri Alon analyzed gene expression data for 2,000 genes from 40 colon tumor tissues and compared them with data from colon tissues belonging to 21 healthy individuals, all measured at a single time point. We can represent his data as a 2,000 × 61 gene expression matrix, where the first 40 columns describe tumor samples and the last 21 columns describe normal samples.

Now, suppose you performed a gene expression experiment with a colon sample from a new patient, corresponding to a 62nd column in an augmented gene expression matrix. Your goal is to predict whether this patient has a colon tumor. Since the partition of tissues into two clusters (tumor vs. healthy) is known in advance, it may seem that classifying the sample from a new patient is easy. Indeed, since each patient corresponds to a point in 2,000-dimensional space, we can compute the center of gravity of these points for the tumor sample and for the healthy sample. Afterwards, we can simply check which of the two centers of gravity is closer to the new tissue.

Alternatively, we could perform a blind analysis, pretending that we do not already know the classification of samples into cancerous vs. healthy, and analyze the resulting 2,000 x 62 expression matrix to divide the 62 samples into two clusters. If we obtain a cluster consisting predominantly of cancer tissues, this cluster may help us diagnose colon cancer.

Final Challenge: These approaches may seem straightforward, but it is unlikely that either of them will reliably diagnose the new patient. Why do you think this is? Given Alon’s 2,000 × 61 gene expression matrix and gene data from a new patient, derive a superior approach to evaluate whether this patient is likely to have a colon tumor.

一、原理

参见

https://www.cnblogs.com/dfcao/p/3462721.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

二、

数据:

40 Cancer Samples

21 Healthy Samples

Unknown Sample

问题分析:

这是一个分类问题,训练集有61个,特征量有2000个,如果利用高斯核函数的SVM会出现过拟合,故选择线性核函数

代码

 1 from os.path import dirname
 2 import numpy as np
 3 import math
 4 import random
 5 import matplotlib.pyplot as plt
 6 from sklearn import datasets, svm
 7 
 8 def Input():
 9     X = []
10     Y = []
11     check_x=[]
12     check_y=[]
13     
14     dataset1 = open(dirname(__file__)+'colon_cancer.txt').read().strip().split('\n')
15     dataset1=[list(map(float,line.split()))[:] for line in dataset1]
16     X += dataset1[10:]
17     check_x += dataset1[:10]
18     Y += [1]*(len(dataset1)-10)
19     check_y += [1]*10
20     
21     dataset2 = open(dirname(__file__)+'colon_healthy.txt').read().strip().split('\n')
22     dataset2=[list(map(float,line.split()))[:] for line in dataset2]
23     X += dataset2[5:]
24     check_x += dataset2[:5]
25     Y += [0]*(len(dataset2)-5)
26     check_y += [0]*5
27     
28     dataset3 = open(dirname(__file__)+'colon_test.txt').read().strip().split('\n')
29     test_X = [list(map(float,line.split()))[:] for line in dataset3]
30     
31     
32     return [X ,Y , test_X , check_x , check_y]
33 
34 if __name__ == '__main__':
35     INF = 999999
36     
37     [X_train ,y_train , test_X,check_x, check_y] = Input()
38     
39     kernel = 'linear' # 线性核函数
40     
41     clf = svm.SVC(kernel=kernel, gamma=10)
42     clf.fit(X_train,y_train)
43     
44     predict_for_ckeck = clf.predict(check_x)
45     cnt=0
46     for i in range(len(check_y)):
47         if check_y[i]==predict_for_ckeck[i]:
48             cnt+=1
49     print('Accuracy %.2f%%'%(cnt/len(check_y)))
50     
51     print(clf.predict(test_X))
Accuracy 87%
[0]

奇怪的是,只选择前20个基因进行分析,训练集预测正确率居然上升到90%

Accuracy 93%

  [0]

 

转载于:https://www.cnblogs.com/lokwongho/p/9979089.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值