A novel spatio-temporal scheme for reducing the rate of false positives in bloom filter based URL-caching
Master thesis
Permanent lenke
http://hdl.handle.net/11250/137494Utgivelsesdato
2010Metadata
Vis full innførselSamlinger
Sammendrag
Achieving efficient use of available resources is an important problem in the
field of web mining. Monitoring and analyzing the web is extremely resource demanding,
and therefore, more efficient use of resources often translates directly
into improved web monitoring coverage and accuracy. One important sub problem
is to reduce the memory consumption of the URL cache in a web crawler
system. Utilizing the space efficient data structure Bloom filter as URL cache,
will reduce the memory consumption. However, the Bloom filter introduces false
positives, leading to loss of valuable web content when the filter are utilized as a
URL cache in a web crawler system. Based on the latter problems of false positives,
this thesis propose three novel strategies, namely a temporal, a spatial and a
spatio-temporal strategy, each aiming to reduce the false positive rate introduced
by the Bloom filter. During testing and evaluation of the strategies, we discovered
both the spatial and temporal strategy is able to reduce the false positive in the
Bloom filter. The two former strategies was then combined to test if it is possible
to further decrease the false positive probability. Testing and evaluation of
the combined strategies shows that it does yield a reduction in the false positive
probability.
Beskrivelse
Masteroppgave i informasjons- og kommunikasjonsteknologi 2010 – Universitetet i Agder, Grimstad