1.DeePaC: Predicting pathogenic potential of novel DNA with reverse-complement neural networks
Abstract
Motivation
We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious(恶意的、蓄意的) purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized, and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable.
Results
We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that convolutional neural networks and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning. Combining a deep learning approach with integrating the predictions for both mates in a read pair results in cutting the error rate almost in half in comparison to the previous state-of-the-art.
Availability
The code and the models are available at: https://gitlab.com/rki_bioinformatics/DeePaC
Supplementary information
Supplementary data are available at Bioinformatics online.
2.Identification of Expression Signatures For Non-Small-Cell Lung Carcinoma Subtype Classification
Ran Su, Jiahang Zhang, Xiaofeng Liu, Leyi Wei
Bioinformatics, btz557, https://doi.org/10.1093/bioinformatics/btz557
Published: 11 July 2019 Article history
Views Cite
Abstract
Motivation
Non-Small-Cell Lung Carcinoma (NSCLC) mainly consists of two subtypes: lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD). It has been reported that the genetic and epigenetic profiles vary strikingly(显著地) between LUAD and LUSC in the process of tumorigenesis and development. Efficient and precise treatment can be made if subtypes can be identified correctly. Identification of discriminative expression signatures has been explored recently to aid the classification of NSCLC subtypes.
Results
In this study, we designed a classification model integrating both mRNA and long non-coding RNA (lncRNA) expression data to effectively classify the subtypes of NSCLC. A gene selection algorithm, named WGRFE, was proposed to identify the most discriminative(差别的、有区别的) gene signatures(特征) within the recursive feature elimination (RFE) framework. GeneRank scores considering both expression level and correlation, together with the importance generated by classifiers were all taken into account to improve the selection performance. Moreover, a module-based initial filtering of the genes was performed to reduce the computation cost of RFE. We validated (验证、确认)the proposed algorithm on The Cancer Genome Atlas (TCGA) data set. The results demonstrate that the developed approach identified a small number of expression signatures for accurate subtype classification and particularly, we here for the first time show the potential application of lncRNA for NSCLC subtype classification. The R implementation for the proposed approach is available at at https://github.com/RanSuLab/NSCLC-subtype-classification.
Supplementary Information
Supplementary data are available at Bioinformatics online.
3.MM-6mAPred: Identifying DNA N6-methyladenine sites based on Markov Model
Cong Pian, Guangle Zhang, Fei Li, Xiaodan Fan
Bioinformatics, btz556, https://doi.org/10.1093/bioinformatics/btz556
Published: 11 July 2019 Article history
Abstract
Motivation
Recent studies have shown that DNA N6-methyladenine (6mA) plays an important role in epigenetic modification of eukaryotic organisms. It has been found that 6mA is closely related to embryonic development, stress response, and so on. Developing a new algorithm to quickly and accurately identify 6mA sites in genomes is important for explore their biological functions.
Results
In this paper, we proposed a new classification method called MM-6mAPred based on a Markov model which makes use of the transition probability between adjacent nucleotides to identify 6mA site. The sensitivity and specificity of our method are 89.32% and 90.11%, respectively. The overall accuracy of our method is 89.72%, which is 6.59% higher than that of the previous method i6mA-Pred. It indicated that, compared with the 41 nucleotide chemical properties used by i6mA-Pred, the transition probability between adjacent (相邻的)nucleotides can capture more discriminant(区分规则/判别准则) sequence information.
Availability
The web server of MM-6mAPred is freely accessible at http://www.insect-genome.com/MM-6mAPred/.
Supplementary information
Supplementary data are available at Bioinformatics online.
4.MEpurity: estimating tumor purity using DNA methylation data
Bowen Liu, Xiaofei Yang, Tingjie Wang, Jiadong Lin, Yongyong Kang, Peng Jia, Kai Ye
Bioinformatics, btz555, https://doi.org/10.1093/bioinformatics/btz555
Published: 11 July 2019 Article history
Abstract
Motivation
Tumor purity is a fundamental property of each cancer sample and affects downstream investigations. Current tumor purity estimation methods either require matched normal sample or report moderately(适度地) high tumor purity even on normal samples. It is critical to develop a novel computational approach to estimate tumor purity with sufficient precision based on tumor-only sample.
Results
In this study, we developed MEpurity, a beta mixture model-based algorithm, to estimate the tumor purity based on tumor-only Illumina Infinium 450k methylation microarray data. We applied MEpurity to both The Cancer Genome Atlas (TCGA) cancer data and cancer cell line data, demonstrating that MEpurity reports low tumor purity on normal samples and comparable results on tumor samples with other state-of-art methods.
Availability
MEpurity is a C ++ program which is available at https://github.com/xjtu-omics/MEpurity.
Supplementary information
Supplementary data are available at Bioinformatics online.