Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text

View Sample PDF

Author(s): Neetika Bansal (College of Engineering & Management, India), Vishal Goyal (Punjabi University, India)and Simpel Rani (Yadavindra College of Engineering, India)
Copyright: 2022
Pages: 10
Source title: Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines
Source Author(s)/Editor(s): Information Resources Management Association (USA)
DOI: 10.4018/978-1-6684-6303-1.ch076

Keywords: Data Analysis and Statistics / Data Mining / Engineering Science Reference / Library & Information Science

Purchase

View on the publisher's website for pricing and purchasing information.

Abstract

People do not always use Unicode, rather, they mix multiple languages. The processing of codemixed data becomes challenging due to the linguistic complexities. The noisy text increases the complexities of language identification. The dataset used in this article contains Facebook and Twitter messages collected through Facebook graph API and twitter API. The annotated English Punjabi code mixed dataset has been trained using a pipeline Dictionary Vectorizer, N-gram approach with some features. Furthermore, classifiers used are Logistic Regression, Decision Tree Classifier and Gaussian Naïve Bayes are used to perform language identification at word level. The results show that Logistic Regression performs best with an accuracy of 86.63 with an F-1 measure of 0.88. The success of machine learning approaches depends on the quality of labeled corpora.

The IRMA Community

Research IRM

Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media Text

Purchase

Abstract

Related Content

IRMA Sponsors