IRMA-International.org: Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

Text Mining Methods for Hierarchical Document Indexing

Text Mining Methods for Hierarchical Document Indexing
View Sample PDF
Author(s): Han-Joon Kim (The University of Seoul, Korea)
Copyright: 2009
Pages: 9
Source title: Encyclopedia of Data Warehousing and Mining, Second Edition
Source Author(s)/Editor(s): John Wang (Montclair State University, USA)
DOI: 10.4018/978-1-60566-010-3.ch299

Purchase

View Text Mining Methods for Hierarchical Document Indexing on the publisher's website for pricing and purchasing information.

Abstract

We have recently seen a tremendous growth in the volume of online text documents from networked resources such as the Internet, digital libraries, and company-wide intranets. One of the most common and successful methods of organizing such huge amounts of documents is to hierarchically categorize documents according to topic (Agrawal, Bayardo & Srikant, 2000; Kim & Lee, 2003). The documents indexed according to a hierarchical structure (termed ‘topic hierarchy’ or ‘taxonomy’) are kept in internal categories as well as in leaf categories, in the sense that documents at a lower category have increasing specificity. Through the use of a topic hierarchy, users can quickly navigate to any portion of a document collection without being overwhelmed by a large document space. As is evident from the popularity of web directories such as Yahoo (http:// www.yahoo.com/) and Open Directory Project (http:// www.dmoz.org/), topic hierarchies have increased in importance as a tool for organizing or browsing a large volume of electronic text documents. Currently, the topic hierarchies maintained by most information systems are manually constructed and maintained by human editors. The topic hierarchy should be continuously subdivided to cope with the high rate of increase in the number of electronic documents. For example, the topic hierarchy of the Open Directory Project has now reached about 590,000 categories. However, manually maintaining the hierarchical structure incurs several problems. First, such a manual task is prohibitively costly as well as time-consuming. Until now, large search portals such as Yahoo have invested significant time and money into maintaining their taxonomy, but obviously they will not be able to keep up with the pace of growth and change in electronic documents through such manual activity. Moreover, for a dynamic networked resource (e.g., World Wide Web) that contains highly heterogeneous documents accompanied by frequent content changes, maintain- ing a ‘good’ hierarchy is fraught with difficulty, and oftentimes is beyond the human experts’ capabilities. Lastly, since human editors’ categorization decision is not only highly subjective but their subjectivity is also variable over time, it is difficult to maintain a reliable and consistent hierarchical structure. The above limitations require information systems that can provide intelligent organization capabilities with topic hierarchies. Related commercial systems include Verity Knowledge Organizer (http://www.verity.com/), Inktomi Directory Engine (http://www.inktomi.com/), and Inxight Categorizer (http://www.inxight.com/), which enable a browsable web directory to be automatically built. However, these systems did not address the (semi-)automatic evolving capabilities of organizational schemes and classification models at all. This is one of the reasons why the commercial taxonomy-based services do not tend to be as popular as their manually constructed counterparts, such as Yahoo.

Related Content

Girija Ramdas, Irfan Naufal Umar, Nurullizam Jamiat, Nurul Azni Mhd Alkasirah. © 2024. 18 pages.
Natalia Riapina. © 2024. 29 pages.
Xinyu Chen, Wan Ahmad Jaafar Wan Yahaya. © 2024. 21 pages.
Fatema Ahmed Wali, Zahra Tammam. © 2024. 24 pages.
Su Jiayuan, Jingru Zhang. © 2024. 26 pages.
Pua Shiau Chen. © 2024. 21 pages.
Minh Tung Tran, Thu Trinh Thi, Lan Duong Hoai. © 2024. 23 pages.
Body Bottom