Determining geographic origin of social media users with Bayesian Analysis of common syntactical and spelling errors when using foreign languages

Lei, Xiaoqi

Lei, Xiaoqi

Master thesis

View/Open

Xiaoqi Lei.pdf (923.3Kb)

URI

http://hdl.handle.net/11250/137531

Date

2011

Metadata

Show full item record

Collections

Master's theses in Information and Communication Technology [505]

Abstract

As the growing influence and importance of social media, the need of categorizing authors of overt text information from social media by their geographic origin background is becoming more urgent than ever before. To achieve the goal, some method been developed, for instance, classifying by authors' language, timezone, or by geographic terms used in the text.

This thesis explored a unique classifier to determine the social media users' geographic background: Native Language Classifier, which classifies authors' native language from the text they have written in English.

The Native Language Classifier set up a training set consisting of English corpus in size of 6 million words of 800 authors from 4 different language background: Chinese, Russian, Spanish and French. And through testing 200 users (50 users from each language group) the classifier made an overall accuracy of 75% by combining result from n-gram algorithms in word level, n-gram algorithms in character level, and spell checking algorithm, to classify those authors into groups of correct language background. It would be valuable for both social media analyzers, and text classifying researchers.

More than the classifying result, some interesting observations are made from the test as well. They disclosed some rules behind the languages. Therefore the method developed by this thesis would also possibly become a useful tool to help researchers analyzing the feature of the languages.

Description

Masteroppgave i informasjons- og kommunikasjonsteknologi IKT590 2011 – Universitetet i Agder, Grimstad

Publisher

Universitetet i Agder / University of Agder