Incremental web crawling as a competitive game of learning automata

Myrer, Svein Arild; Olsen, Morten Goodwin

dc.contributor.author	Myrer, Svein Arild
dc.contributor.author	Olsen, Morten Goodwin
dc.date.accessioned	2007-06-28T12:19:06Z
dc.date.issued	2005
dc.identifier.uri	http://hdl.handle.net/11250/137178
dc.description	Masteroppgave i informasjons- og kommunikasjonsteknologi 2005 - Høgskolen i Agder, Grimstad	en
dc.description.abstract	There is no doubt that the World Wide Web has lived up to it’s hype of being the world’s central information highway through the past years. An increasing amount of versatile services keeps finding their way onto the Web as information providers continue to embrace the possibilities that the Web can offer. Especially the possibility of producing dynamic content has been an accelerant factor and is the reason why we now conveniently can participate in online auctions or see the latest development of our favorite stocks in near real-time from our own living rooms. However, for automated data mining applications that deploy crawlers to continuously capture the information provided by this new breed of services, the highly dynamic nature of the content is not convenient at all. As a matter of fact, a complete new set of challenges emerges where traditional crawling strategies are shown to be sub-optimal. Accordingly a new class of methods for crawling operations are clearly needed. Nonetheless, the problem area has so far been given limited attention in literature. In this thesis we address the new problem area of monitoring highly dynamic data sources of different importance. We use the concept of an incremental web crawler as a basis for our novel approach where we consider the incremental crawling task as a continuous learning problem where scheduling of monitoring tasks is combined with parameter estimation in an on-line manner. By mapping the problem to two variants of the so called knapsack problem we propose two solutions based on a machine learning technique known as learning automata. We show empirically that our proposed solutions continuously improve their performance through a learning process and that they are capable of operating in non-stationary environments. We also show their performance in comparison to alternative algorithms where, most notably, our schemes are shown to outdo the traditional uniform crawling scheme by factors up to 550% in certain situations.	en
dc.format.extent	1074894 bytes
dc.format.mimetype	application/pdf
dc.language.iso	eng	en
dc.publisher	Høgskolen i Agder
dc.publisher	Agder University College
dc.subject.classification	IKT590
dc.title	Incremental web crawling as a competitive game of learning automata	en
dc.type	Master thesis	en
dc.subject.nsi	VDP::Matematikk og naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Algoritmer og beregnbarhetsteori: 422
dc.subject.nsi	VDP::Matematikk og naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Simulering, visualisering, signalbehandling, bildeanalyse: 429

Tilhørende fil(er)

Filnavn:: master_ikt_2005_myrer.pdf
Størrelse:: 1.025Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Master's theses in Information and Communication Technology [491]
MM500, IKT590, IKT591

Vis enkel innførsel