IRMA-International.org: Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis

A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis
View Sample PDF
Author(s): Mohamed Hammami (Faculté des Sciences de Sfax, Tunisia), Youssef Chahir (Université de Caen, France)and Liming Chen (Ecole Centrale de Lyon, France)
Copyright: 2008
Pages: 29
Source title: Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications
Source Author(s)/Editor(s): John Wang (Montclair State University, USA)
DOI: 10.4018/978-1-59904-951-9.ch117

Purchase

View A Data Mining Driven Approach for Web Classification and Filtering Based on Multimodal Content Analysis on the publisher's website for pricing and purchasing information.

Abstract

Along with the ever growing Web is the proliferation of objectionable content, such as sex, violence, racism, and so forth. We need efficient tools for classifying and filtering undesirable Web content. In this chapter, we investigate this problem through WebGuard, our automatic machine-learning-based pornographic Web site classification and filtering system. Facing the Internet more and more visual and multimedia as exemplified by pornographic Web sites, we focus here our attention on the use of skin color-related visual content-based analysis along with textual and structural content based analysis for improving pornographic Web site filtering. While the most commercial filtering products on the marketplace are mainly based on textual content-based analysis such as indicative keywords detection or manually collected black list checking, the originality of our work resides on the addition of structural and visual content-based analysis to the classical textual content-based analysis along with several major-data mining techniques for learning and classifying. Experimented on a test bed of 400 Web sites including 200 adult sites and 200 nonpornographic ones, WebGuard, our Web filtering engine scored a 96.1% classification accuracy rate when only textual and structural content based analysis are used, and 97.4% classification accuracy rate when skin color-related visual content-based analysis is driven in addition. Further experiments on a black list of 12,311 adult Web sites manually collected and classified by the French Ministry of Education showed that WebGuard scored 87.82% classification accuracy rate when using only textual and structural content-based analysis, and 95.62% classification accuracy rate when the visual content-based analysis is driven in addition. The basic framework of WebGuard can apply to other categorization problems of Web sites which combine, as most of them do today, textual and visual content.

Related Content

Nuno Silva, Pedro Sousa, Miguel Mira da Silva. © 2019. 19 pages.
Ioannis Routis, Mara Nikolaidou, Nancy Alexopoulou. © 2019. 21 pages.
Jeffrey S. Zanzig, Guillermo A. Francia III, Xavier P. Francia. © 2019. 26 pages.
S. B. Goyal. © 2019. 30 pages.
Maria João Ferreira, Fernando Moreira, Isabel Seruca. © 2019. 24 pages.
Agostino Poggi, Paolo Fornacciari, Gianfranco Lombardo, Monica Mordonini, Michele Tomaiuolo. © 2019. 21 pages.
Rüdiger Pryss, Manfred Reichert. © 2019. 26 pages.
Body Bottom