Text Classification Using “Anti”-Bayesian Quantile Statistics-Based Classifiers

Oommen, John; Khoury, Richard; Schmidt, Aron

dc.contributor.author	Oommen, John
dc.contributor.author	Khoury, Richard
dc.contributor.author	Schmidt, Aron
dc.date.accessioned	2017-03-30T13:31:48Z
dc.date.available	2017-03-30T13:31:48Z
dc.date.created	2016-12-15T15:13:06Z
dc.date.issued	2016
dc.identifier.citation	Transactions on Computational Collective Intelligence. 2016, .
dc.identifier.issn	2190-9288
dc.identifier.uri	http://hdl.handle.net/11250/2436389
dc.description.abstract	The problem of Text Classification (TC) has been studied for decades, and this problem is particularly interesting because the features are derived from syntactic or semantic indicators, while the classification, in and of itself, is based on statistical Pattern Recognition (PR) strategies. Thus, all the recorded TC schemes work using the fundamental paradigm that once the statistical features are inferred from the syntactic/semantic indicators, the classifiers themselves are the well-established ones such as the Bayesian, the Na¨ıve Bayesian, the SVM etc. and those that are neural or fuzzy. In this paper, we shall demonstrate that by virtue of the skewed distributions of the features, one could advantageously work with information latent in certain “non-central” quantiles (i.e., those distant from the mean) of the distributions. We, indeed, demonstrate that such classifiers exist and are attainable, and show that the design and implementation of such schemes work with the recentlyintroduced paradigm of Quantile Statistics (QS)-based classifiers(The foundational properties for CMQS (for generic and some straightforward distributions) were initially described in [17]. Their properties for uni-dimensional distributions of the exponential family are included in [9], and for multi-dimensional distributions in [18]. The authors of [17], [9] and [18] had initially proposed their results as being based on the Order-Statistics of the distributions. This was later corrected in [19], where they showed that their results were rather based on their Quantile Statistics.). These classifiers, referred to as Classification by Moments of Quantile Statistics (CMQS), are essentially “Anti”-Bayesian in their modus operandi. To achieve our goal, in this paper we demonstrate the power and potential of CMQS to describe the very high-dimensional TC-related vector spaces in terms of a limited number of “outlier-based” statistics. Thereafter, the PR task in classification invokes the CMQS classifier for the underlying multi-class problem by using a linear number of pair-wise CMQS-based classifiers. By a rigorous testing on the standard 20-Newsgroups corpus we show that CMQS-based TC attains accuracy that is comparable to the best-reported classifiers. We also propose the potential of fusing the results of a CMQS-based methodology with those obtained from a more traditional scheme.
dc.language.iso	eng
dc.title	Text Classification Using “Anti”-Bayesian Quantile Statistics-Based Classifiers
dc.type	Peer reviewed
dc.type	Journal article
dc.source.pagenumber	26
dc.source.journal	Transactions on Computational Collective Intelligence
dc.identifier.doi	10.1007/978-3-662-53580-6_7
dc.identifier.cristin	1413594
cristin.unitcode	201,15,4,0
cristin.unitname	Institutt for informasjons- og kommunikasjonsteknologi
cristin.ispublished	true
cristin.fulltext	preprint
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: OommenPaper07.pdf
Størrelse:: 310.9Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel