The IRMA Community
Newsletters
Research IRM
Click a keyword to search titles using our InfoSci-OnDemand powered search:
|
Text Categorization
Abstract
Text categorization (TC) is a data mining technique for automatically classifying documents to one or more predefined categories. This paper will introduce the principles of TC, discuss common TC methods and steps, give an overview of the various types of TC systems, and discuss future trends. TC systems begin with a group of known categories and a set of training documents already assigned to a category, usually by a human expert. Depending on the system, the documents may undergo a process called dimensionality reduction, which reduces the number of words or features that the classifier evaluates during the learning process. The system then analyzes the documents and “learns” which words or features of each document caused it to be classified into a particular category. This is known as supervised learning, because it is based on human knowledge of the categories and their criteria. The learning process results in a classifier which can apply the rules it learned during the training phase to additional documents.
Related Content
Girija Ramdas, Irfan Naufal Umar, Nurullizam Jamiat, Nurul Azni Mhd Alkasirah.
© 2024.
18 pages.
|
Natalia Riapina.
© 2024.
29 pages.
|
Xinyu Chen, Wan Ahmad Jaafar Wan Yahaya.
© 2024.
21 pages.
|
Fatema Ahmed Wali, Zahra Tammam.
© 2024.
24 pages.
|
Su Jiayuan, Zhang Jingru.
© 2024.
26 pages.
|
Pua Shiau Chen.
© 2024.
21 pages.
|
Minh Tung Tran, Thu Trinh Thi, Lan Duong Hoai.
© 2024.
23 pages.
|
|
|