生信老学长
1、CSV文件
“师兄,为什么我读取xlsx总是报错?”,“师兄,怎么我的文件表头不见了?”,看着师弟那渴望知识的眼神,你却目光躲闪?来了,来了,生信老学长当你的最强嘴替。
实际上,我们日常处理文件的格式主要分为三种,一种是比较规整的有行名和列名的,第二种是文本文件,第三种是有特殊格式的文件。而文件类型也可以分为CSV、XLSX、TXT、GTF、FASTA等。
一、规整类型文件的读取
1、CSV文件
> read.csv('删除行名1.csv')
X1 names gene symbol sample
1 2 out1 gen1 sb1 sp1
2 3 out2 gen2 sb2 sp2
3 4 out3 gen3 sb3 sp3
4 5 out4 gen4 sb4 sp4
5 6 out5 gen5 sb5 sp5
> read.csv('跳过注释.csv',comment.char = "#")#跳过#注释
names gene symbol sample X X.1
1 out1 gen1 sb1 sp1 test1 test1
2 out2 gen2 sb2 sp2 red apple test2
3 out3 gen3 sb3 sp3 test3 test3
4 out4 gen4 sb4 sp4 test4
5 out5 gen5 sb5 sp5 test5 test5
2、xls文件
library(readxl)
> read_excel("规范数据.xlsx", sheet = "Sheet1")#读取sheet1
# A tibble: 5 × 4
names gene symbol sample
<chr> <chr> <chr> <chr>
1 out1 gen1 sb1 sp1
2 out2 gen2 sb2 sp2
3 out3 gen3 sb3 sp3
4 out4 gen4 sb4 sp4
> read_excel("规范数据.xlsx", sheet = "Sheet2",skip = 2)#跳过#行
New names:
• `` -> `...5`
• `` -> `...6`
# A tibble: 5 × 6
names gene symbol sample ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 out1 gen1 sb1 sp1 test1 test1
2 out2 gen2 sb2 sp2 test2 test2
3 out3 gen3 sb3 sp3 test3 test3
4 out4 gen4 sb4 sp4 NA test4
5 out5 gen5 sb5 sp5 test5 test5
3、通用读法
> read.table('跳过注释.csv')#直接读取
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1 did not have 2 elements
> read.table('跳过注释.csv',sep = ",",header = T,comment.char = "#")#指定间隔符
names gene symbol sample X X.1
1 out1 gen1 sb1 sp1 test1 test1
2 out2 gen2 sb2 sp2 red apple test2
3 out3 gen3 sb3 sp3 test3 test3
4 out4 gen4 sb4 sp4 test4
5 out5 gen5 sb5 sp5 test5 test5
4、更快的读取 大型文件应该用fread函数,它更快速
> fread('跳过注释.csv',sep = ",",header = T)
####test V2 V3 V4 V5 V6
1: ##test2
2: names gene symbol sample
3: out1 gen1 sb1 sp1 test1 test1
4: out2 gen2 sb2 sp2 test2 test2
5: out3 gen3 sb3 sp3 test3 test3
6: out4 gen4 sb4 sp4 test4
7: out5 gen5 sb5 sp5 test5 test5
> fread('跳过注释.csv',sep = ",",header = T,skip=2)#跳过注释
names gene symbol sample V5 V6
1: out1 gen1 sb1 sp1 test1 test1
2: out2 gen2 sb2 sp2 red apple test2
3: out3 gen3 sb3 sp3 test3 test3
4: out4 gen4 sb4 sp4 test4
5: out5 gen5 sb5 sp5 test5 test5
二、文本文件读取
1、readline()
> readLines("跳过注释.csv") ##
[1] "####test,,,,," "##test2,,,,,"
[3] "names,gene,symbol,sample,," "out1,gen1,sb1,sp1,test1,test1"
[5] "out2,gen2,sb2,sp2,red apple,test2" "out3,gen3,sb3,sp3,test3,test3"
[7] "out4,gen4,sb4,sp4,,test4" "out5,gen5,sb5,sp5,test5,test5"
> readLines("普通文本.txt")##整行读取
[1] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
[3] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
[5] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
[7] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
[9] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
2、scan()
> scan("普通文本.txt",what = "character", encoding = "UTF-8")#单词分割
Read 25 items
[1] "wx:生信老学长" "课题指导" "生信分析" "个性化分析" "基金指导"
[6] "wx:生信老学长" "课题指导" "生信分析" "个性化分析" "基金指导"
[11] "wx:生信老学长" "课题指导" "生信分析" "个性化分析" "基金指导"
[16] "wx:生信老学长" "课题指导" "生信分析" "个性化分析" "基金指导"
[21] "wx:生信老学长" "课题指导" "生信分析" "个性化分析" "基金指导"
> scan("普通文本.txt",sep = "\n",what = "character", encoding = "UTF-8")#逐行读取
Read 10 items
[1] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
[3] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
[5] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
[7] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
[9] "wx:生信老学长" "课题指导 生信分析 个性化分析 基金指导"
3、read_all()
library(xfun)
> read_all("普通文本.txt")#全部读取
wx:生信老学长
课题指导 生信分析 个性化分析 基金指导
wx:生信老学长
课题指导 生信分析 个性化分析 基金指导
wx:生信老学长
课题指导 生信分析 个性化分析 基金指导
wx:生信老学长
课题指导 生信分析 个性化分析 基金指导
wx:生信老学长
课题指导 生信分析 个性化分析 基金指导
> read_all("普通文本.txt")[1]
[1] "wx:生信老学长
三、特殊文件读取
1、FASTA文件
library(Biostrings)
> readBStringSet("fasta.fa", format="fasta")
BStringSet object of length 2:
width seq names
[1] 72 AGTACGTAGTCGCTGCTGCTACGGGCGCTAG...ACGACGTAGATGCTAGCTGACTAAACGATGC sequence1
[2] 70 AAACGATCGATCGTACTCGACTGATGTAGTA...GTACGTAGCATCGTCAGTTACTGCATGCGGG sequence2
2、GTF文件
library(rtracklayer)
> import('test1.gtf', format = "gtf")
GRanges object with 10 ranges and 6 metadata columns:
seqnames ranges strand | source type score phase gene_id
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character>
[1] 381 150-200 + | Twinscan exon NA <NA> 381.000
[2] 381 300-401 + | Twinscan exon NA <NA> 381.000
[3] 381 380-401 + | Twinscan CDS NA 0 381.000
[4] 381 501-650 + | Twinscan exon NA <NA> 381.000
[5] 381 501-650 + | Twinscan CDS NA 2 381.000
[6] 381 700-800 + | Twinscan exon NA <NA> 381.000
[7] 381 700-707 + | Twinscan CDS NA 2 381.000
[8] 381 900-1000 + | Twinscan exon NA <NA> 381.000
[9] 381 380-382 + | Twinscan start_codon NA 0 381.000
[10] 381 708-710 + | Twinscan stop_codon NA 0 381.000
transcript_id
<character>
[1] 381.000.1
[2] 381.000.1
[3] 381.000.1
[4] 381.000.1
[5] 381.000.1
[6] 381.000.1
[7] 381.000.1
[8] 381.000.1
[9] 381.000.1
[10] 381.000.1
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
3、PDB文件
library(bio3d)
pdb <- read.pdb("test.pdb")
summary(pdb)
pdb$atom
pdb$xyz
盆友们,快收获这一份R文件读取指南,在师弟面前侃侃而谈吧。关注公众号回复read data获取文件和代码。
VX公众号:生信老学长
专业项目指导和生信支持


被折叠的 条评论
为什么被折叠?



