Ting-He Zhang and Shao-Wu Zhang* Pages 406 - 421 ( 16 )
Background: Revealing the subcellular location of a newly discovered protein can bring insight into their function and guide research at the cellular level. The experimental methods currently used to identify the protein subcellular locations are both time-consuming and expensive. Thus, it is highly desired to develop computational methods for efficiently and effectively identifying the protein subcellular locations. Especially, the rapidly increasing number of protein sequences entering the genome databases has called for the development of automated analysis methods.
Methods: In this review, we will describe the recent advances in predicting the protein subcellular locations with machine learning from the following aspects: i) Protein subcellular location benchmark dataset construction, ii) Protein feature representation and feature descriptors, iii) Common machine learning algorithms, iv) Cross-validation test methods and assessment metrics, v) Web servers.
Result & Conclusion: Concomitant with a large number of protein sequences generated by highthroughput technologies, four future directions for predicting protein subcellular locations with machine learning should be paid attention. One direction is the selection of novel and effective features (e.g., statistics, physical-chemical, evolutional) from the sequences and structures of proteins. Another is the feature fusion strategy. The third is the design of a powerful predictor and the fourth one is the protein multiple location sites prediction.
Protein subcellular location, prediction, dataset construction, feature representation, machine learning, protein sequences.
School of Automation, Northwestern Polytechnical University, Xi'an, 710072, School of Automation, Northwestern Polytechnical University, Xi'an, 710072