Semi-automatic web resource discovery using ontology-focused crawling

Kristoffersen, Erik; Sætren, Marius André

dc.contributor.author	Kristoffersen, Erik
dc.contributor.author	Sætren, Marius André
dc.date.accessioned	2007-06-28T11:18:40Z
dc.date.issued	2005
dc.identifier.uri	http://hdl.handle.net/11250/137184
dc.description	Masteroppgave i informasjons- og kommunikasjonsteknologi 2005 - Høgskolen i Agder, Grimstad	en
dc.description.abstract	The enormous amount of information available on the Internet makes it difficult to find resources with relevant information using regular breadth-first crawlers. Focused crawlers seek to exclusively find web pages that are relevant for the user, and avoid downloading irrelevant web pages. Ontologies have recently been proposed as a tool for defining the target domain for focused crawlers. In this project we have developed a prototype of an ontology-focused crawler. We have accomplished this by developing extra modules to the Java open source crawler Heritrix. In one of the modules we have developed, we measure the relevancy of web pages in relation to an ontology describing the area of interest. We have also developed a link analysis module to determine the importance of web pages. This module uses the link analysis component from the open source search engine Nutch. The importance measure is used to ensure that the most important web pages are downloaded first. This thesis also contains an evaluation of several open source crawlers. We found that Heritrix was the easiest to extend, and best suited for our purpose. Our prototype is therefore built upon Heritrix. To measure the performance of the prototype several test crawls with different settings has been carried out. Focused crawlers are often evaluated by harvest rate, which is the ratio between number of relevant and all of the web pages downloaded. The prototype performed well in the tests, and in one of them the prototype had a harvest rate of about 0.55. In a similar unfocused crawl, the harvest rate was only about 0.15. Both the prototype and the algorithm are designed to be easily configured. More testing and adjustments of the settings could improve the performance of the prototype even further, but we have shown that ontologies are a suitable technology for creating focused crawlers.	en
dc.format.extent	647556 bytes
dc.format.mimetype	application/pdf
dc.language.iso	eng	en
dc.publisher	Høgskolen i Agder
dc.publisher	Agder University College
dc.subject.classification	IKT590
dc.title	Semi-automatic web resource discovery using ontology-focused crawling	en
dc.type	Master thesis	en
dc.subject.nsi	VDP::Matematikk og naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Algoritmer og beregnbarhetsteori: 422

Tilhørende fil(er)

Filnavn:: master_ikt_2005_kristoffersen.pdf
Størrelse:: 632.3Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses in Information and Communication Technology [491]
MM500, IKT590, IKT591

Vis enkel innførsel