dc.contributor.author | Zhu, Lida | |
dc.date.accessioned | 2009-03-12T14:16:03Z | |
dc.date.issued | 2008 | |
dc.identifier.uri | http://hdl.handle.net/11250/137039 | |
dc.description | Masteroppgave i informasjons- og kommunikasjonsteknologi 2008 – Universitetet i Agder, Grimstad | en |
dc.description.abstract | In this thesis we have presented a solution to classify websites into
geographical attribute code (NUTS) and economical activities attribute codes(NACE).
We propose a solution for web site classification with high accuracy. We use keywordbased
document classification methods which had shown good performance. After
classification, each document is assigned a class label from a set of predefined
categories, which is based on a pool of pre-classified sample documents.
Our solution includes to remove stop words and skip html tags, which identify the
informative term, remove the non-informative or redundant terms to improve the
classification accuracy; use mutual information for feature selection to reduce the
dimensional feature space and produce vectors for classification; finally, use Naïve
Bayes and Decision Tree algorithm to perform the classification and also provide the
performance comparison.The system has shown great performance in the experiment. It classifies web
sites into NACE categories with maximum accuracy of 97% performed on 46 web
pages, while NUTS classification has best accuracy of 93% performed on 223 web
pages. | en |
dc.format.extent | 806386 bytes | |
dc.format.mimetype | application/pdf | |
dc.language.iso | eng | en |
dc.publisher | Universitetet i Agder / Agder University | en |
dc.subject.classification | IKT590 | |
dc.title | Automatic Categorization of Web Sites | en |
dc.type | Master thesis | en |
dc.subject.nsi | VDP::Mathematics and natural science: 400::Information and communication science: 420::Knowledge based systems: 425 | en |
dc.source.pagenumber | 69 | en |