Text Categorization

View Sample PDF

Author(s): Megan Chenoweth (Innovative Interfaces Inc., USA)and Min Song (New Jersey Institute of Technology, USA)
Copyright: 2009
Pages: 6
Source title: Encyclopedia of Data Warehousing and Mining, Second Edition
Source Author(s)/Editor(s): John Wang (Montclair State University, USA)
DOI: 10.4018/978-1-60566-010-3.ch296

Purchase

View Text Categorization on the publisher's website for pricing and purchasing information.

Abstract

Text categorization (TC) is a data mining technique for automatically classifying documents to one or more predefined categories. This paper will introduce the principles of TC, discuss common TC methods and steps, give an overview of the various types of TC systems, and discuss future trends. TC systems begin with a group of known categories and a set of training documents already assigned to a category, usually by a human expert. Depending on the system, the documents may undergo a process called dimensionality reduction, which reduces the number of words or features that the classifier evaluates during the learning process. The system then analyzes the documents and “learns” which words or features of each document caused it to be classified into a particular category. This is known as supervised learning, because it is based on human knowledge of the categories and their criteria. The learning process results in a classifier which can apply the rules it learned during the training phase to additional documents.

The IRMA Community

Research IRM

Text Categorization

Purchase

Abstract

Related Content

IRMA Sponsors