Semi-automatic web resource discovery using ontology-focused crawling
Abstract
The enormous amount of information available on the Internet makes it difficult to find
resources with relevant information using regular breadth-first crawlers. Focused crawlers
seek to exclusively find web pages that are relevant for the user, and avoid downloading
irrelevant web pages. Ontologies have recently been proposed as a tool for defining the
target domain for focused crawlers.
In this project we have developed a prototype of an ontology-focused crawler. We have
accomplished this by developing extra modules to the Java open source crawler Heritrix. In
one of the modules we have developed, we measure the relevancy of web pages in relation
to an ontology describing the area of interest. We have also developed a link analysis
module to determine the importance of web pages. This module uses the link analysis
component from the open source search engine Nutch. The importance measure is used to
ensure that the most important web pages are downloaded first.
This thesis also contains an evaluation of several open source crawlers. We found that
Heritrix was the easiest to extend, and best suited for our purpose. Our prototype is
therefore built upon Heritrix.
To measure the performance of the prototype several test crawls with different settings has
been carried out. Focused crawlers are often evaluated by harvest rate, which is the ratio
between number of relevant and all of the web pages downloaded. The prototype
performed well in the tests, and in one of them the prototype had a harvest rate of about
0.55. In a similar unfocused crawl, the harvest rate was only about 0.15. Both the prototype
and the algorithm are designed to be easily configured. More testing and adjustments of the
settings could improve the performance of the prototype even further, but we have shown
that ontologies are a suitable technology for creating focused crawlers.
Description
Masteroppgave i informasjons- og kommunikasjonsteknologi 2005 - Høgskolen i Agder, Grimstad
Publisher
Høgskolen i AgderAgder University College