Creator of Knowledge
Information Resources Management Association
Advancing the Concepts & Practices of Information Resources Management in Modern Organizations

A Method to Quantify Corpus Similarity and its Application to Quantifying the Degree of Literality in a Document

A Method to Quantify Corpus Similarity and its Application to Quantifying the Degree of Literality in a Document
View Sample PDF
Author(s): Etienne Denoual (ATR—Spoken Language Communication Research Labs, Japan and CLIPS—GETA—IMAG, Joseph Fourier University, France)
Copyright: 2008
Pages: 13
Source title: Information Communication Technologies: Concepts, Methodologies, Tools, and Applications
Source Author(s)/Editor(s): Craig Van Slyke (Northern Arizona University, USA)
DOI: 10.4018/978-1-59904-949-6.ch048



Comparing and quantifying corpora are key issues in corpus-based translation and corpus linguistics, for which there is still a notable lack of standards. This makes it difficult for a user to isolate, transpose, or extend the interesting features of a corpus to other NLP systems. In this work, we address the issue of measuring similarity between corpora. We suggest a scale between two user-chosen corpora on which any third given corpus can be assigned a coefficient of similarity, based on the cross-entropy of statistical N-gram character models. A possible application of this framework is to quantify similarity in terms of literality (or, conversely, orality). To this end, we carry out experiments on several well-known corpora in both English and Japanese and show that the defined similarity coefficient is robust in terms of language and model order variations. Comparing it to other existing similarity measures shows similar performance while extending widely the range of application to electronic data written in languages with no clear word segmentation. Within this framework, we further investigate the notion of homogeneity in the case of a large multilingual resource.

Related Content

Adeyinka Tella, Oluwakemi Titilola Olaniyi, Aderinola Ololade Dunmade. © 2021. 24 pages.
Md. Maidul Islam. © 2021. 17 pages.
Peterson Dewah. © 2021. 23 pages.
Lungile Precious Luthuli, Thobekile K. Buthelezi. © 2021. 14 pages.
Delight Promise Udochukwu, Chidimma Oraekwe. © 2021. 13 pages.
Julie Moloi. © 2021. 18 pages.
Mandisa Msomi, Lungile Preciouse Luthuli, Trywell Kalusopa. © 2021. 17 pages.
Body Bottom