The IRMA Community
Newsletters
Research IRM
Click a keyword to search titles using our InfoSci-OnDemand powered search:
|
Prediction for Compound Activity in Large Drug Datasets Using Efficient Machine Learning Approaches
Abstract
Modern drug design requires activity prediction within a large number of chemical compounds using their descriptors that are often generated with high-noise in high-dimensional space. Both computational performance and classification quality face great challenges if machine learning algorithms are to be applied successfully. For computational efficiency, we implement the proximal support vector machine (PSVM) since it only depends on linear operations and can be trained faster than support vector machines (SVM) using quadratic optimization. For even larger datasets, we use parallel computing to make the training and classification time acceptable. To improve the classification quality, we implement and compare the SVM, k-nearest neighbor, decision tree and the naive Bayes classifiers. We measure the classification qualities by using the cross-validation accuracies, generalization accuracies, and the false positive and false negative ratios in ROC (receiver operating characteristics) curves. We also conduct feature selection in order to find the most important features and gain insights into the nature of the descriptors of the compounds. Features are easy to select using linear SVMs but the selection may be biased. We use a nonlinear kernel SVM in the feature selection process to achieve a higher ranking quality. To fully understand the properties of the noisy features in the dataset, we experiment with different number of features using the SVM classifier to obtain an optimal number of features.
|
|