Peng, Zhenling, Marcin J. Mizianty, and Lukasz Kurgan."Genome‐scale prediction of proteins with long intrinsically disorderedregions." Proteins:Structure, Function, and Bioinformatics 82.1 (2014): 145-158.
Abstract:
Proteins with long disordered regions (LDRs), defined as having 30 or more consecutive disordered residues, are abundant in eukaryotes, and these regions are recognized as a distinct class of biologically functional domains. LDRs facilitate various cellular functions and are important for target selection in structural genomics. Motivated by the lack of methods that directly predict proteins with LDRs, we designed Super-fast predictor of proteins with LongIntrinsically DisordERed regions (SLIDER). SLIDER utilizes logistic regression that takes an empirically chosen set of numerical features, which consider selected physicochemical properties of amino acids, sequence complexity, and amino acid composition, as its inputs. Empirical tests show that SLIDER offers competitive predictive performance combined with low computational cost. It outperforms, by at least a modest margin, a comprehensive set of modern disorder predictors (that can indirectly predict LDRs) and is 16 times faster compared to the best currently available disorder predictor. Utilizing our time-efficient predictor, we characterized abundance and functional roles of proteins with LDRs over 110 eukaryotic proteomes. Similar to related studies,we found that eukaryotes have many (on average 30.3%) proteins with LDRs with majority of proteomes having between 25 and 40%, where higher abundance is characteristic to proteomes that have larger proteins. Our first-of-its-kind large-scale functional analysis shows that these proteins are enriched in an umber of cellular functions and processes including certain binding events,regulation of catalytic activities, cellular component organization,biogenesis, biological regulation, and some metabolic and developmental processes. A web server that implements SLIDER is available at http://biomine.ece.ualberta.ca/SLIDER/.
Key words:
intrinsic disorder;long disordered regions; disorder prediction; high-throughput prediction;eukaryotes
Datasets:
MxD
Training: 247proteins (130 pos. and 128 neg.)
Test: 247 proteins(130 pos. and 128 neg.)
Available:
- Web service:http://biomine.ece.ualberta.ca/SLIDER/
- Mizianty, Marcin J., et al. "Improved sequence-based prediction of disordered regions with multilayer fusion of multiple information sources." Bioinformatics26.18 (2010): i489-i496.
Sampling Method:
Training:
All
Validation:
从测试集中挑选10组蛋白质,每组蛋白质拥有100个蛋白质序列。
Feature Extraction:
- AA composition (local-20):计算20种常见氨基酸在该序列中所占百分比
- Low/high complexity region (local-10):通过SEG算法获取该区域
- 2-低、高复杂区域中氨基酸数量
- 2-包含4个及4个以上连续残基的低、高复杂片段数量
- 4=2*2-低、高复杂区域的平均长度与最长长度
- 2-滑动窗口30,计算长达30以上的低、高复杂片段数量
- AA indices/physicochemical properties (local-144): AAindex提供的氨基酸物理化学性质
- 48-所有输入残基的物理化学性质的平均值
- 48-滑动窗口30,计算均值为最小时的值
- 48-滑动窗口30,计算均值为最大时的值
- Hybrid features (local-328):根据低、高复杂区域对以上特征进行混合
- 40=20*2-低、高复杂区域的氨基酸构成(类1)
- 288=144*2-低、高复杂区域氨基酸物理化学性质(类3)
Total: 502
Feature Selection:
- Point-biserial correlation < 0.2 (removed)
- Pearson correlation coefficient > given threshold (grouped)
- Max point-biserial correlation (selected)
Classifier:
- Support Vector Machine (SVM) with kernel function Radial Basis Function (RBF)
- Support Vector Machine (SVM) with kernel function polynomial
- *Ridge Logistic Regression
Others:
Advantages:
- 由于没有使用PSSM以及相关运用到psiblast的特征,所以从提取特征到分类花的时间是非常短的
- 在选择AAindex特征的时候,运用相关性进行筛选,并且分情况进行分布估计验证
Disadvantages:
Review:
我觉得入手点很好,选择的是LDRs来进行研究,如果对每一个残基进行disordered与ordered的研究,可能正负样本的数量会产生很大程度的不平衡。