Rosalind编程问题之计算GC含量。
Computing GC Content
Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.
In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Sample input:
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample output:
Rosalind_0808
60.919540
GC含量是在所研究的对象的全基因组中,鸟嘌呤(Guanine)和胞嘧啶(Cytosine)在全部碱基中所占的比例。能够决定DNA的稳定性。本道题需要我们读取含有多序列的fasta文件,并且挨个计算GC含量,最终输出GC含量最高的序列开头注释信息以及其GC含量值。** 解题思路如下:
- 1.逐行读取fasta文件,消掉序列信息中的换行符。
- 2.计算每条序列的GC含量。
- 3.各条序列的GC含量进行对比,并获得最大GC含量。
- 4.输出GC含量最高的序列及其标签
因此,小编分别定义了三个子方法:
- 子方法1BufferedReader用来读取fasta文件并且去掉其中的序列换行符。详情请见:解决Java逐行读取带有行缩进的fasta文件
- 子方法2FindMaxCount获取GC含量最高值。
- 子方法3FindIndex获取最大的GC含量序列引索值,用以输出其对应的开头注释信息。以下是全部代码。
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
public class Computing_GC_Content {
public static void main(String[] args) {
//1.逐行读取fasta文件
ArrayList<String> fasta = BufferedReader("C:/Users/Administrator/Desktop/rosalind_gc.txt", "fasta");//设置返回值是碱基序列。
ArrayList<String> tag = BufferedReader("C:/Users/Administrator/Desktop/rosalind_gc.txt", "tag");//设置返回值是标签名称。
//2.计算每条序列的GC含量
float[] GCratio = new float[fasta.size()];//定义数组存储每条fasta序列的GC比例
int[] GCcount = new int[fasta.size()];//定义数组存储每条fasta序列的GC含量
for (int i = 0; i < fasta.size(); i++) {
for (int j = 0; j < fasta.get(i).length(); j++) {
if (fasta.get(i).charAt(j) == 'G' || fasta.get(i).charAt(j) == 'C') {
//fasta.get(i).charAt(j)=='G'||'C' Java不支持此类型的判断语句
GCcount[i]++;
}
}
GCratio[i] = (float) GCcount[i] / fasta.get(i).length();
}
//3.各条序列的GC含量进行对比,并获得最大GC含量
float maxcount = (float) (Math.round(FindMaxCount(GCratio) * 100000000)) / 1000000;
int maxIndex = FindIndex(GCratio);
//4.输出GC含量最高的序列及其标签
System.out.println(tag.get(maxIndex));
System.out.println(maxcount);
}
//子方法1.读取fasta文件并且分别存储到fasta集合和tag集合中。
public static ArrayList<String> BufferedReader(String path,String choose) {//返回值类型是新建集合大类,此处是Set而非哈希。
BufferedReader reader;
ArrayList<String> tag = new java.util.ArrayList<String>();
ArrayList<String> fasta = new java.util.ArrayList<String>();
try {
reader = new BufferedReader(new FileReader(path));
String line = reader.readLine();
StringBuilder sb = new StringBuilder();
while (line != null) {//多次匹配带有“>”的行,\w代表0—9A—Z_a—z,需要转义。\W代表非0—9A—Z_a—z。
if (line.matches(">[\\w*|\\W*]*")){
tag.add(line);
//第一个循环开始时StringBuilder为空,需要添加判断以排除此特例。
if (sb.length()!=0){
String seq = sb.toString();//定义字符串变量seq保存删除换行符的序列信息
fasta.add(seq);
sb.delete(0, sb.length());//清空StringBuilder中全部元素
}
}else{
sb.append(line);//重新向StringBuilder添加元素
}
// read next line
line = reader.readLine();
}
String seq = sb.toString();
fasta.add(seq);//循环结束还要再次输出序列,否则会丢失一条序列。
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
if (choose.equals("tag")){
return tag;
}
return fasta;
}
//子方法2:获取最大的GC含量数字
public static float FindMaxCount(float[] arr) {
float max = arr[0];
for (int x = 1; x < arr.length; x++) {
if (max < arr[x]) {
max = arr[x];
}
}
return max;
}
//子方法3:获取最大的GC含量序列引索
public static int FindIndex(float[] arr) {
float max = arr[0];
int maxValIndex = 0;
for (int x = 1; x < arr.length; x++) {
if (max < arr[x]) {
max = arr[x];
maxValIndex = x;
}
}
return maxValIndex;
}
}
要留意哈,运行完全部代码并不代表着结束。你会发现输出的结果中">Rosalind_0808"是带着“>”的(示例如下),如果在上传Rosalind网站提交答案时带上了“>”,系统会自动判定为错误。请大家格外注意细节!最后手动删除“>”后再上传即可。
>Rosalind_0808
60.919540