要求如下:
* 输入格式(只需要前3列就可以)
```
Gene Sample Value Unit Abundance
ENSG00000000003 A-431 21.3 FPKM Medium
ENSG00000000003 A-549 32.5 FPKM Medium
ENSG00000000003 AN3-CA 38.2 FPKM Medium
ENSG00000000003 BEWO 31.4 FPKM Medium
ENSG00000000003 CACO-2 63.9 FPKM High
ENSG00000000005 A-431 0.0 FPKM Not detected
ENSG00000000005 A-549 0.0 FPKM Not detected
ENSG00000000005 AN3-CA 0.0 FPKM Not detected
ENSG00000000005 BEWO 0.0 FPKM Not detected
ENSG00000000005 CACO-2 0.0 FPKM Not detected
```
* 输出格式
```
Name A-431 A-549 AN3-CA BEWO CACO-2
ENSG00000000460 25.2 14.2 10.6 24.4 14.2
ENSG00000000938 0.0 0.0 0.0 0.0 0.0
ENSG00000001084 19.1 155.1 24.4 12.6 23.5
ENSG00000000457 2.8 3.4 3.8 5.8 2.9
```
正文:
因为是小白一个,折腾了很久用了个笨办法解决,还请各位不要见笑
用的是python的pandas
import pandas as pd
#导入表格
df = pd.read_csv("../multipleColExpr.txt",sep="\t")#sep='\t'使得成为一个表格的形式
#筛选数据
data = df[["Gene","Sample","Value"]]#选取前三列(这里用的是列举)
data_Gene = data['Gene'].drop_duplicates()#对Gene这一列进行处理,只保留不重复的信息
data_Gene = data_Gene.reset_index(drop=True) #删除索引列信息,避免后面不能在一行输出
data1 = pd.DataFrame(columns=['Name','A4','A5','AN','BEWO','CACO'])#简单构建dataframe
data_A4 = data[data['Sample']=='A-431']#筛选出Sample中的A-431信息,下同
data_A4 = data_A4.reset_index(drop=True)#删除索引列信息,很重要的一步
data_A4 = data_A4['Value']#把A4列中需要的Value值保留
#后面就是循环
data_A5 = data[data['Sample']=='A-549']
data_A5 = data_A5.reset_index(drop=True)
data_A5 = data_A5['Value']
data_AN = data[data['Sample']=='AN3-CA']
data_AN = data_AN.reset_index(drop=True)
data_AN = data_AN['Value']
data_B = data[data['Sample']=='BEWO']
data_B = data_B.reset_index(drop=True)
data_B = data_B['Value']
data_C = data[data['Sample']=='CACO-2']
data_C = data_C.reset_index(drop=True)
data_C = data_C['Value']
#将列名和数值一一对应
data1.Name = data_Gene
data1.A4 =data_A4
data1.A5=data_A5
data1.AN=data_AN
data1.BEWO=data_B
data1.CACO=data_C
#对列名重命名
data1 = data1.rename(columns={'A4': 'A-431', 'A5': 'A-549','AN': 'AN3-CA','CACO': 'CACO-2'})
#输出
data1
用的是笨办法估计数据量大了以后就做不动了,后续再优化吧
2023/5/4 更新如下:
发现自己data1建dataframe傻了,有别的方法更快建立
import pandas as pd
df = pd.read_csv("/home/eva/Luolab/data/multipleColExpr.txt",sep="\t")
data = df[["Gene","Sample","Value"]]
data_Gene = data['Gene'].drop_duplicates() #删除重复行
data_Gene = data_Gene.reset_index(drop=True) #不保留原有索引列,重置
data_A4 = data[data['Sample']=='A-431'] #筛选
data_A4 = data_A4.reset_index(drop=True)
data_A4 = data_A4['Value']
data_A5 = data[data['Sample']=='A-549']
data_A5 = data_A5.reset_index(drop=True)
data_A5 = data_A5['Value']
data_AN = data[data['Sample']=='AN3-CA']
data_AN = data_AN.reset_index(drop=True)
data_AN = data_AN['Value']
data_B = data[data['Sample']=='BEWO']
data_B = data_B.reset_index(drop=True)
data_B = data_B['Value']
data_C = data[data['Sample']=='CACO-2']
data_C = data_C.reset_index(drop=True)
data_C = data_C['Value']
data2 = pd.DataFrame({'data_Gene':data_Gene,'data_A4':data_A4,'data_A5':data_A5,'data_AN':data_AN,'data_B':data_B,'data_C':data_C})
data2
但是还是得罗列一堆东西,下次试试能不能for循环解决
文章讲述了如何利用Python的pandas库对基因表达数据进行处理,包括读取数据、筛选特定列、按样本分组并提取值,最后构建新的DataFrame进行展示。作者意识到这种方法可能不适用于大数据集,并提出后续优化需求。

被折叠的 条评论
为什么被折叠?



