A Data Pre-processing Method to Increase Efficiency and Accuracy in Data Mining
2005 (English)In: 10th Conference on Artificial Intelligence in Medicine, AIME2005 - Aberdeen, UK, 2005, 434-443 p.Conference paper (Other academic)
In medicine, data mining methods such as Decision Tree Induction (DTI) can be trained for extracting rules to predict the outcomes of new patients. However, incompleteness and high dimensionality of stored data are a problem. Canonical Correlation Analysis (CCA) can be used prior to DTI as a dimension reduction technique to preserve the character of the original data by omitting non-essential data. In this study, data from 3949 breast cancer patients were analysed. Raw data were cleaned by running a set of logical rules. Missing values were replaced using the Expectation Maximization algorithm. After dimension reduction with CCA, DTI was employed to analyse the resulting dataset. The validity of the predictive model was confirmed by ten-fold cross validation and the effect of pre-processing was analysed by applying DTI to data without pre-processing. Replacing missing values and using CCA for data reduction dramatically reduced the size of the resulting tree and increased the accuracy of the prediction of breast cancer recurrence.
Place, publisher, year, edition, pages
2005. 434-443 p.
, Lecture Notes in Computer Science, ISSN 0302-9743 (Print) 1611-3349 (Online) ; 3581
Engineering and Technology
IdentifiersURN: urn:nbn:se:liu:diva-12707DOI: 10.1007/11527770_59ISBN: 978-3-540-27831-3OAI: oai:DiVA.org:liu-12707DiVA: diva2:16891