Xiaogeng Wan* and Xinying Tan Pages 1113 - 1129 ( 17 )
Background: Protein is a kind of important organics in life. It is varied with its sequences, structures and functions. Protein evolutionary classification is one of the popular research topics in computational bioinformatics. Many studies have used protein sequence information to classify the evolutionary relationships of proteins. As the amount of protein sequence data increases, efficient computational tools are needed to make efficient protein evolutionary classifications with high accuracies in the big data paradigm.
Methods: In this study, we propose a new simple and efficient computational approach based on the normalized mutual information rates to compute the relationship between protein sequences, we then use the “distances” defined on the relationships to perform the evolutionary classifications of proteins. The new method is computational efficient, model-free and unsupervised, which does not require training data when performing classifications.
Results: Simulation studies on various examples demonstrate the efficiency of the new method. We use precision-recall curves to compare the efficiency of our new method with traditional methods, results show that the new method outperforms the traditional methods in most of the cases when performing evolutionary classifications.
Conclusion: The new method is simple and proved to be efficient in protein evolutionary classifications, which is useful in future evolutionary analysis particularly in the big data paradigm.
Protein evolutionary classification, mutual information rate, protein sequence, precision-recall, computational, machine learning.
Department of Mathematics, College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, 100029, The Fourth Center of PLA General Hospital, Beijing, 100037