Submit Manuscript  

Article Details


BDselect: a package for k-mer selection based on the binomial distribution

Author(s):

Fu-Ying Dao, Hao Lv, Zhao-Yue Zhang and Hao Lin*  

Abstract:


Background: Dimension disaster is often associated with feature extraction. The extracted features may contain more redundant feature information, which leads to the limitation of computing ability and overfitting problems.

Objective: Feature selection is an important strategy to overcome the problems from dimension disaster. In most machine learning tasks, features determine the upper limit of the model performance. Therefore, more and more feature selection methods should be developed to optimize features.

Methods: In this paper, we introduce a new technique to optimize sequence features based on the binomial distribution (BD). Firstly, the principle of the binomial distribution algorithm is introduced in detail. Then, the proposed algorithm is compared with other commonly used feature selection methods on three different types of datasets by using a Random Forest classifier with the same parameters.

Results and Conclusion: The results confirm that BD has a promising improvement in feature selection and classification accuracy. Finally, we provide the source code and executable program package (http://lin-group.cn/server/BDselect/), by which users can easily perform our algorithm in their research.

Keywords:

dimension disasters, feature selection, binomial distribution, machine learning, classification, datasets

Affiliation:

Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054



Full Text Inquiry