Misplacing the Code: An Examination of Data Quality Issues in Bayesian Text Classification for Automated Coding of Medical Diagnoses

View Free PDF

Author(s): Eitel J.M. Lauria (Marist College, USA)and Alan D. March (Universidad del Salvador, Argentina)
Copyright: 2007
Pages: 3
Source title: Managing Worldwide Operations and Communications with Information Technology
Source Editor(s): Mehdi Khosrow-Pour, D.B.A. (Information Resources Management Association, USA)
DOI: 10.4018/978-1-59904-929-8.ch296
ISBN13: 9781599049298
EISBN13: 9781466665378

Keywords: Information Science Reference / IT Research & Theory / IT Research and Theory / Library & Information Science

Abstract

In this article we discuss the effect of dirty data on text mining for automated coding of medical diagnoses. Using two Bayesian machine learning algorithms (naive Bayes and shrinkage) we build ICD9-CM classification models trained from free-text diagnoses. We investigate the effect of training the classifiers using both clean and (simulated) dirty data. The research focuses on the impact that erroneous labeling of training data sets has on the classifiers’ predictive accuracy.

IRMA Offers Over 2,500 Full Text Open Access Research Papers for Free Download Click to Start Searching Free IRM Research!

IRMA Sponsors

Encyclopedia of Information Science and Technology, Fourth Edition