Automatic Categorization of Web Sites
Master thesis
Åpne
Permanent lenke
http://hdl.handle.net/11250/137039Utgivelsesdato
2008Metadata
Vis full innførselSamlinger
Sammendrag
In this thesis we have presented a solution to classify websites into
geographical attribute code (NUTS) and economical activities attribute codes(NACE).
We propose a solution for web site classification with high accuracy. We use keywordbased
document classification methods which had shown good performance. After
classification, each document is assigned a class label from a set of predefined
categories, which is based on a pool of pre-classified sample documents.
Our solution includes to remove stop words and skip html tags, which identify the
informative term, remove the non-informative or redundant terms to improve the
classification accuracy; use mutual information for feature selection to reduce the
dimensional feature space and produce vectors for classification; finally, use Naïve
Bayes and Decision Tree algorithm to perform the classification and also provide the
performance comparison.The system has shown great performance in the experiment. It classifies web
sites into NACE categories with maximum accuracy of 97% performed on 46 web
pages, while NUTS classification has best accuracy of 93% performed on 223 web
pages.
Beskrivelse
Masteroppgave i informasjons- og kommunikasjonsteknologi 2008 – Universitetet i Agder, Grimstad